18892 >> Jitu Padhye: It's a pleasure to welcome Sanjin. And this paper is going to appear in SIGCOMM. It's about routers that use GPs. Thanks. >> Sanjin Han: Thank you. Hi. I'm Sanjin Han. And I'm going to present PacketShader, our GPU-accelerated software router. This work is joint work with Keon Jong, Kyoungsoo Park and my field advisor, Sue Moon. Let me explain what PacketShader is all about. PacketShader is a software router that exploits GPU for high performance database process. PacketShader prototype shows 40 gigabits per second for performance on a single PC. I'm going to give a general idea of the software routers. Software router is literally driven by software. So in the horizontally flexible and programmable. The loads of the software router is not limited to IP routing. If we want more functionality for support of new protocols, then you can just do it by multiplying software. Software routers are based on commodity hardware, which is two implications. Commodity hardware is achieved and each evolution is much better because it's backed by millions of customers. Now let's review the commodity hardware market. Thinking about this has become popular. And this is a great opportunity for software routers. The cost of the input is developed quickly, and it's now below $300 per port. You can build multi-tension software router at a very affordable price. However, high speed link and low cost general guarantee, they guarantee the performance of software routers. The Achilles heel of software routers has long been its low performance. The tail end of thresholds of the performance of software routers measured with shifts of four or five packets, many which have tried to push up performance to the limit, but it has not been scaled beyond ten kilobits per second. This means that it doesn't matter how many tens of ports you have, your shifter will not be able to handle even a single port at a line rate. The performance bottleneck is in the CPU, and I'm going to give you a little more details. Let's see how much computation power is needed and why CPU only approach is not enough for tension networks. First of all, you need 300 CPU cycles to receive and transmit packets for interface cut. Furthermore, you need additional computation power depending on what you want to do on the software router. The problem is that you don't have enough CPU cycles, so how can you solve this problem? >>: CPU cycles they must be to a specific architecture, right a set, is this 86 or? >> Sanjin Han: Yeah, these are based on actually -- and unfortunately [indiscernible]. So our approach is two-fold. The first one is to optimize packet aisle and you get significant improvement by factor of six. I'm not going to address the details in this talk, and you can find our optimization techniques in the paper. And our second approach is about how to handle these application-dependent operations in an effective way. Our solution is to off-load those operations to GPU, and this is the main topic of this talk. I assume most of you are not familiar with GPU. So let me introduce you to GPU briefly. GPU is the center processor in a graphics card. In order to synthesize realistic 3D cryptics in real time, GPU handles millions of polygons and billions of pixels per second on each massively patterned architecture. Software routers have nothing to do with the graphics and rendering, then why am I introducing GPU to you guys? Essentially, GPU has been flexible and programmable enough for known graphics workload -- general proportion computation on GPUs is increasingly popular and basically this is a novel application of GPU per software routers. GPU architecture is different from that of general [indiscernible] CPU. Today's CPU has small number of cores and each core is quite [indiscernible]. In contrast, GPU has larger number of cores and each GPU core is not just as smart as CPU cores. They don't have some automatic features of CPU, such as other world execution or dynamic branch prediction. If we want to transport a few people quickly, then GPU is the best option. But if you want to transport so many people, in other words, if your workload is through-put centric then you should [indiscernible] GPU. This difference comes from how [indiscernible] are computed in that processor. CPU dedicates most transistors to execution control plane and larger caches to accelerate single-threaded programs. GPU has much more transistors in general, and most of the transistors are devoted to estimating larger units and each estimating larger units form large array of GPU cores. So if your workload has enough parallelism, you can expect significant through-put improvement with GPU. Now let's move on how we can utilize GPU per packet processing. For packet processing, GPU can be quite effective solution. I'm going to explain the elements of GPU in terms of low computation power, [indiscernible] latency and memory bandwidth. Software routers need lots of computation power. Hessian, encryption, pattern matching, all these are operations are computation hungry. The good thing with GPU has much higher computation power so you can adopt those compute intensive operations in your software router. And the second aspect is memory latency. Software routers tend to do random memory access, so your CPU will experience larger cache misses and it thus risk computation power. In contrast, GPU can work around memory penalties. When the GPU has a cache miss, then it switches to another threat that is immediately executable. This context switching is done by hardware scheduler. So it doesn't encourage any extra overage. And the last thing I want to compare is memory bandwidth. The memory bandwidth of top of the line GPUs is referred to gigabytes per second today and this number is theoretical and depending on memory access pattern, you will get much less empirical memory bandwidth. So you can't utilize even this number because you have many mouths to feed. Basically reflection and transmission of package [indiscernible] take larger amount of memory bandwidth. If your software router for traffic at a certain rate, then the memory bandwidth consumption is more than four times. As a result, your project packet progression can be left 10 gigabytes per second on high speed networks. Within the budget, you have to make a porting decision or perform deep packet inspection for every packet. The problem with that, your limited budget is not quite enough to handle such memory intensive operations. But good thing GPU has order higher memory bandwidth than that of CPU. So if you offload memory intensive operations to GPU, then you can secure enough memory bandwidth. So far I have explained the why GPU can be useful for packet progression. Now I'm going to explain how to use GPU for packet progression. The basic idea of using GPU in software router is very simple. The CPU law of the shipping and transmitting packets through a [indiscernible] remains [indiscernible] -remains as a tradition of the software router. The importance is that we offload core operations to the GPU. Such operations could be for example forwarding table lookup, encryption or anything that requires much processing power. One thing you have to consider as I said to people GPU is basically parallel processing. For optimal performance, you must feed enough parallelism to GPU, otherwise GPU will be underutilized. So you can't expect any performance advantages over GPU. Then where can you find such parallelism in packet processing? We can take much per package from Alex Cues [phonetic] in a bench not one by one and process them in parallel. The key insight behind this work is that safely progression is usually parallelizable by definition. You don't need any coordination between package per procession. Some of you might worry that accumulating hundreds of packets may induce unreasonable delay. I will answer to the question, it's not true. For example, on a fully saturated intensive link, you can get up to 1,000 packets only in 67 microseconds. Now let's move on, the detailed architecture of PacketShader. PacketShader basic design is very simple. The packet procession is done in three stages. Preshader and post shader. And I'm going to explain how each stage works. This is an example of IPv4 forwarding. Were packets received in a bets. Preshader shader proposed some preprocessing of package such as checker sum and keep check or check whatever. Some packets that need further precession, goes to the IPsec of the operating system. For the last package, Preshader collects the destination IP order sheet of the packets and pass them to the shader stage. In the shader stage, the collected IP order sheets are copied to the GPU. And the GPU propose 14 paper lookup in parallel. And finally the resulting, that's the whole information is delivered to the CPU again. Finally, in the post-shader stage, the packets are authenticate and transmitted through alpha ports based on the 14 [indiscernible]. For simplicity, I omitted the device driver in the previous slide but, of course, you need interface to communicate with network interface as in the figure. We implemented our own device driver for optimal performance. And you can find some details in the paper. So far we have been -- we have seen that basic design of PacketShader. But things become a little more complicated because of some challenges behind today's commodity architecture. The first challenge comes from adoption of multi-core CPU. CPUs have much per cost in the [indiscernible] and each core must run as independently as possible in order to achieve linear scalability with the number of cores. So we duplicated the patent streamline across multiple cores as shown in the slide. Another choice we made is to dedicate one CPU core as master core which communicates with the GPU. And multiple cores contended for the same GPU performance significantly interrupts. So this is the reason why we have only one master core in the ship year. For [indiscernible] multipchip computation has become increasingly popular. The challenge of a multichip is the cost of communication between separated GPUs is very extensible. So we partitioned the PacketShader, shift them in a way that any communication between the master and workers can be turned only into the same CPU. Now let's move on. How we evaluated the shift and what performance numbers we got. This is how we set up our PacketShader prototype. We used two protocol GPUs for 0 port [indiscernible] and two GPUs in the shifter. The implementation of why we used multiple components in the shifter means first to prove that PacketShader is really scaleable; and, second, to confirm that our system reflects the trend of top-of-the-line commodity servers. For measurement, we collected packet generated and PacketShader back to back. We built the packet generator which is based on our highly optimized with packet IO engine. You can generate up to 80 kilobits per second with minimum size UDP packets. And here is the result. We implemented the four applications, IPv4 and [indiscernible] were below switch and access tunnelling on top of PacketShader. This could have, compares our CPU-only and GPU plus GPU implementations, clearly showing that the effectiveness of the GPU per packet procession. The speeder due to GPU auxiliary depends on applications. For IPv4, which is simple enough already, you can't expect much improvement with GPU acceleration. In contrast, IPv6 shows 4.8 times improvement over CPU-only implementation because IP shift forwarding is highly memory intensive. So the GPU can help a lot. >>: How do you manage to get 38.2 gigabits per second? The max was under 10, and the memory bandwidth was roughly 10. >> Sanjin Han: Memory bandwidth is this in partially in this case. But I can say that the memory bandwidth, the memory bandwidth we need eight kilobits per second because it's about gigabits per second, and the numbers I said is about megabytes and kilobytes per second. >>: [indiscernible]. >> Sanjin Han: In this talk I'll give you more details of IPv6 and IP offloading of each example. First example is IPv6 forwarding. IPv6 lookup requires more processing power than IPv4 because IPv6 address is 128 bits long, which is eight times longer than IPv4 relations. So we offloaded the longest preface matching operation to GPU. The algorithm we used in our work performs binary search on hash tables. Here is the performance result of IPv6 forwarding. The performance improvement over GPU is clear when the packet size is small. And our GPU acceleration guarantees around the 40 gigabits per second for any packet sizes. I told you that the total capacity of our shift is 80 kilobits per second but for GPU only and GPU plus GPU caches the through put is not scale beyond 40 kilobits per second. This is because of the hardware problem in our model, and details are again in the paper. We expect much higher performance once the hardware program is fixed. And the second example is IPv6 tunnelling, IPv6 tunnelling is done in two stages. The first stage the entire packet is encrypted with AES algorithm and in the second stage hash packet method is appended with [indiscernible] algorithm. We offloaded AES and [indiscernible] algorithms to GPU because these cryptography operations are highly [indiscernible] intensive. For IPv6 GPU acceleration improves the through-put by a factor of 30.5. You can see the performance improvement is stable for all packet sizes. This is because IPv6 operates on entire package not only on packet headers. For minimum size packets, PacketShare is launch 10 kilobits per second and its performance which is up to 20 gigabits per second for larger packet sizes. And I can say this number is comparable, much superior to commercial hardware-based IPv6 tunnelling appliances. Let me compare our system with previous software routers. PacketShader is the first multi changes router on a single machine. And with GPU acceleration, PacketShader launches the speed of 40 kilobits per second. What's interesting here is that previous software routers were implemented in corner, not in user space. The column belief has been that usually where packet procession, despite its many other advantages, is much slower than other packet procession. In contrast, we implemented PacketShader in user library and shown that high performance packet processing is possible in user space. >>: [indiscernible] for minimum size packets, right? 64. >> Sanjin Han: Right. >>: Otherwise they mimic your [indiscernible] as well? >>: I guess that's what IPv6 for 1500 byte packets they're CPU only implementation. >> Sanjin Han: For larger packet sizes due to our IO bottleneck, it shows the same performance. But what I want to say is that in GPU implementation for guarantees 40 kilobits per second performance for any combination of packet sizes. I'm going to conclude this talk. I have shown that GPU is a great opportunity for [indiscernible] routers. We realize that GPI idea in PacketShader. And PacketShader is just scales of the router that exploits GPU for packet progression acceleration on top of optimizing the packet value engine. We implemented four applications, IPv4 and V6 open flow and IPsec are PacketShader to show the prove the effectiveness of our shifter. Here is our plan to expand PacketShader. The first one is to integrate [indiscernible] plane in PacketShader. Our current implementation only supports static forwarding table for IP routing, and we are planning to integrate [indiscernible] or [indiscernible] in our shifter. And second one is about programming environment. We are planning to support cling-like model program environment to make our system more extensible. And the third one is to support the concept of opportunistic offloading. If the problem of input traffic is low enough, then we may use only GPU for low latency and low power consumption. And if the traffic goes up, then we can use GPU for high through-put. And this is the end of my talk, and I would be glad -- happy to answer your questions. Thank you. >>: So you do back processing, right? You put the packets in a buffer, give the idea -so that adds some delay. How much delay do you? -- and I was expecting to see some sort of graphic repair [indiscernible]. >> Sanjin Han: Usually we collect hundreds or thousands of packets. But the delay is acceptable. And this graph shows the long-term latency with PacketShader and you can see that but CPU plus GPU implementation shows the latency between 200 microseconds and 400 microseconds. This may be not trivial for some people. But I think it's acceptable. And if I can implement the opportunistic offloading idea, as I said, then the latency graph will look like this. >>: The bottleneck is the PCI express? It seems like you would be bound by [indiscernible] transfer data to the graphics card and back. >> Sanjin Han: The problem with that, this is how our system looks like. And there is something called IOH between GPU and GPU. Can you see? And IOHC basically bridges from devices to CPU. And in our main model we have two IOHs, because you really needed large over PCI floats in our system. But the problem is that if we have two IOHs in the main board, then the IO through-put in our system is not work well as it's supposed to be. So I think I need your help. >>: We actually played with a single IOH case and with a single IOH case plus three network interface cards, we actually have performed more than 50 gig. But when we have dual IOH on our system, then the [indiscernible] -- it's actually not only the reception part, receiver side, that's bottlenecked at 40 gig. So our packet generator can spit out packets of 80 gig, but cannot receive basically take the packets from the [indiscernible] and touch memory at 80 gigs. It's counted 40 gig. So I think it's a dual IOH, the control problem, Intel control problem, and it's been reported in other user groups. And we're just waiting for the next generation IOH from Intel. >>: Even if that's fixed from the IOH to the GPU, that communication is happening over the PCI express bus, right? >>: Yeah. >>: How much benefit do you think you'd get if you actually take the processors from the GPU and put them just on another core on the actual CPU dye itself, so you didn't have to go memory request from GPU to GPU? >> Sanjin Han: You mean how about putting more CPUs than GPUs, right? >>: No, I mean actually putting vector processors like you have on the GPU into the actual CPU dye? They take away an extra core. >> Sanjin Han: Something like AMD. Yeah. Yeah. What I want to say is GPU in this work may be transitional. And maybe some GPU plus GPU integrated to perform maybe the final destination, I think. But today the separate graphics card is the best way to have some performance advantage over CPU, because GPU consumes larger power and it displays larger over here. Maybe I think in some day it's going to be realized that CPU and GPU can be together. >>: Yes, but my question was how much is the memory bandwidth currently to the GPU, currently bottlenecking your system, if you think that's actually a significant factor or not? >> Sanjin Han: Maybe I don't understand. >>: The CPS bandwidth. So from the NIC first it's loaded to the RAM, then the RAM to GPU, and then back to the RAM. And so GPU ->>: So you have an extra back and forth? >>: Yes. >>: That you wouldn't need if you didn't have that. >>: Which is why your packet size goes up. >>: Yeah. >> Sanjin Han: But the important thing is that you don't ->>: In the first place where he had the different pieces eating up parts of it. >>: Depending on the processing, you don't need to copy the entire packet. For the forwarding, you only need the address. >>: Right. >>: So for IPv6 you need the entire packet. So it depends on the application. >>: For your CPU numbers did you take advantage of succinct instructions? >> Sanjin Han: What kind of instructions? >>: Instruction on the ->> Sanjin Han: Yeah. Basically -- yeah. We could have used SSE for our implementation because our implementation is done at individual level. In each individual level we can use SSE instructions. You can't do it in kernel level implementation. But basically we didn't use FFE implementation, SSE instructions because it needs lots of work and some expertise in the instruction. And the same story can be applied to GPU. But our GPU implementation is just straight 14 of CPU program to CPU. So maybe as SSE instructions can help, but the performance bottleneck over CPU implementation of IP forwarding is something about [indiscernible] latency. Not the number instructions. So SSE, I don't think that SSE was a [indiscernible]. >>: What did you use for programming the GPU and what was the ->> Sanjin Han: Basically let's see -- this is a 13 -- 14-hour CPU. Basically it had some extensions for professional, but ->>: Was an open [indiscernible]. >> Sanjin Han: We used CUDA. >>: Okay. >>: What happened -- there's a line of work coming out of Intel and other research places like making network processors. What happened to that? Are they still like networked processors that are supposed to build this kind of thing, processing customized as far as packet forwarding and such things? >> Sanjin Han: What I want to say is that networked processor is not commodity. You can't buy a network processor at the cost of $500. >>: What's the cost of a processor? >> Sanjin Han: I don't know. >>: $100,000. >>: But they're not -- but they're also I think substantially less for parallel. I think there's something like 60, 32, 500. >>: Network processors are going out of fashion, as they often get [indiscernible] processors doing well and ATC chasses, but first it's coming together so you need to design a board with a network processor. So it's not a commodity averse. It's not readily available technology like the graphics card. >>: Things like GPUs will make network processors obsolete? >>: I don't know. But network processor is very hard. Failing to provide a programming environment that's like graphics processors the companies have done, it's one major weakness to network processor marketing. >>: So should I by reordering? >> Sanjin Han: Reordering, yeah, no, there is no reordering in this system. If you ask why, then -- I should have prepared back of the light ->>: Today? >>: What about the slide. >>: The answer was -- basically some reordering can happen. But packets in the same flow will not be drilled together. >>: The package ->>: The possibility of packets being reordered takes place from the mix to the queues in the memory, in the driver. So from the mix is given, the packets are sent to the queues, only the driver. That's the only time the reordering can take place, packet one, two, three, four, arrived at the gate. One goes to queue one, the two goes to two and two goes to three, and out of those 1234 packets, they might actually go to 1234 might come out not is in order but out of order, but when Nick sends packets to his queues then it's actually RSS support he is the five TUPL support at the nick. So packets from the same flow go to the same cue. So packets from the same flow are never -- then all the GPU processing that's all done in order. So only at that part there's a room for reordering. But not for packets from the same ->>: The batch goes and comes back in the same order, that's how it's enforced? >>: Yes. But the shading process, the preshader shader and post processor is all ->>: What exactly did you code the GPU for fall already? It seems like you always do it in one shot. Like for each package there's one operation that happens on the GPU. But looks like some checking would also be parallelized. Then you might want to do two different things on the back, do some checking. >> Sanjin Han: Basically we need to do something like that for IPv6 tunnelling, because we do AS and [indiscernible] and this is done in separate stages. So we can do two things. Not just one. Why not three. >>: So it's general. Okay. That was my question. So can you quote the graph about the IPv6 forwarding? This one. Have that right one forward. So is this assuming you're doing a forwarding lookup per packet, or is this, like you would -- how many distinctions? The average router only has a few thousand active destinations in it, right? And you could cache them, is that mean that this is not happening except for like one in a thousand packets or is it the case that you're going to actually do a lookup every time, what's going on? >> Sanjin Han: This is basically the worst case of IPv6 forwarding. We have enough number of prefixes in the system and every packet shift by PacketShader has different. >>: Destination. >> Sanjin Han: Destination, IP addresses. >>: Worst case from both the incoming stream and also at the routing table because actually with an IPv6 routing table you would see less trees than IPv4 because you had more consolation abroad. >>: IPv6 there's no realistic forwarding table that's challenging. It's randomly generated while IPv4 is more a reflection of today's router and operation. >>: Presumably a very high percentage of your firends will take it from reserve from the cache, actual looking up the tables? Is that wrong? >>: I mean, actual ->>: I seen so if you had to use it when it's expensive you'd only cash the 500 flows this patient would go where. >>: [indiscernible] you can just not deal with different fast hookup and slow lookup or packets. So you have to basically put the entire forwarding table on to the line card, however you implement it. >>: But yours is in a real router, right? Software router. >>: You want me to realize, don't you? >>: Why would you penalize yourself? Okay. Fair enough. >> Sanjin Han: Thank you. [applause]