High Performance Modular Packet Processing with Click and GPU Weibin Sun

advertisement
High Performance Modular Packet Processing with Click and GPU
Weibin Sun
{wbsun, ricci}@cs.utah.edu
Robert Ricci
University of Utah, School of Computing
Software packet processing, like Click modular router,
provides much more flexibility than the hardware-based
one. Recent advances(Netmap, psio, PF RING) in software packet I/O have enabled 10Gb/s throughput on
10Gb NICswith the help of Receive Side Scaling(RSS)
on multi-queue NICs and multi-core CPUs, and also
zero-copy buffer sharing between kernel and user modes.
With such advance, it is possible to build software packet
processing systems running at line rate. Those systems
include intrusion detection, firewall, VPN, Openflow, IP
forwarding, deep packet inspection for ISPs to detect pirate content downloading, and many others.
However, the bottleneck that prevents those packet
processing systems listed above from reaching line rate
throughput is becoming the computation now. For instance, the pattern matching algorithm for intrusion detection can only run at 2.15Gb/s on an Intel Xeon X5680
core according to the Kargus research . It needs five
CPU cores to saturate only one 10Gb port. This would
require significant number of dedicated CPU cores for
packet processing, which may contend with packet I/O
cores and hence lower the entire system performance.
Besides that, the hardware cost of those multiple CPUs
is also way more than another feasible commodity hardware choice, which is the parallel GPU computing device. Many research projects have shown huge performance improvements of various computations on parallel GPUs compared with CPUs. As of the pattern matching algorithm, recent work on Kargus shows a 39.1Gb/s
throughput on a $400 GTX580 GPU, which is 3x faster
than the $1639 Xeon X5680 six-core CPU above. As a
result, to build a flexible high performance packet processing system that could run at 10Gb/s line rate, we
take the modularized Click router and integrate GPU
computing into it for computation intensive processing.
To use GPU computing library and other userspace libraries, we are using usermode Click. Obviously, Click
was not originally designed for usermode zero-copy
packet I/O and parallel GPU computing. We adopt the
following technologies to deal with obstacles in Click,
multi-queue NICs and multi-core CPUs: a) GPU Management: A new GPURuntime(specifically, CUDA runtime) Click information element to manage GPU computing memory, kernel and states. There are CUDA
mapped, page-locked memory and non-mapped, pagelocked memory used for GPU computing. They have
different characteristics that worth investigating during
our evaluation. b) Batching: We use batching to get rid
of single packet processing style in Click. A Batcher
element batches packets and also prepares GPU memory. It also support sliced copy to only copy specified
range of packet data. c) Wider Interface: To perform
push and pull with packet batch, we defined BElement
and BPort to provide packet batch support. d) Hybrid
CPU and GPU Processing: A CGBalancer element
that load-balances packets between GPU and CPU according to specified policy and system states. This provide flexible GPU offloading mechanism. e) Zero-copy:
We use Netmap for zero-copy packet capture and transmission, the GPU-enabled Click has been modified to
use CUDA memory that GPU driver used for DMA in
Netmap. Hence zero-copy buffer sharing among GPU
driver, NIC driver and Click. f) Multi-queue Support:
Netmap provide multi-queue packet capture and transmission. To be NUMA-aware for buffers used on different CPU cores, we use CUDA’s cudaHostRegister()
to pin NUMA-aware allocated memory, so that it can be
used for GPU DMA and as zero-copy packet buffers.
We have implemented the Click level improvement including GPU-related elements above, and the
Netmap/CUDA/Click integration for zero-copy memory
sharing. We are developing several computational packet
processing elements for evaluation, such as simple firewall, IPv6 routing, IPSec, VPN and Openflow. Besides
these functional applications, we also want to investigate
the following problems to fully study and analyze our
system: a) comparing the costs of CPU-only system and
GPU-enabled system, build their theoretical cost model
when scaling up. b) comparing different GPU runtime
memory types under different workloads and other conditions. c) study the effect of load balancer policy when
running different workloads. d) try to answer this question: Is multi-queue really needed? It is derived from
these two facts: Netmap research found a single CPU
core can handle line rate forwarding, and now we have
GPUs dedicated for computing. e) also study and measure the scalability of our GPU-enabled Click with regard to number of NICs.
High Performance Modular Packet Processing with Click and GPU
KGPU
Weibin Sun Robert Ricci
{wbsun, ricci}@cs.utah.edu School of Computing, University of Utah
GPU and Click, How?
The Problem
CPU limits functionality in line rate packet processing:
10Gb/s line rate packet I/O available[Netmap, psio].
Compute intensive packet processing:
Intrusion detection
Firewall
Deep packet inspection
VPN, Openflow, ..., etc.
Heterogenous GPU Computing In Click
Manage GPU resource in Click, allow GPU click Elements.
GPURuntime (CUDA) element for Click to manage GPU resources.
Batching needed for GPU
Batcher element for batch, slice, copy, and naturally no
REORDER problem.
Flexible CPU/GPU load balancing
BElement and BPort, with wider
bpush/bpull for batched packets.
Efficient MM for GPU, NIC and Click.
Utilize what we have now for CPU-based packet processing.
CGBalancer element, to loadbalancing between GPU and CPU.
GPUDirect, to zero-copy among
GPU driver, NIC driver, and Click.
✓ CUDA memory for Netmap, NIC
driver and Click.
Flexible platform needed.
Multiple Queues
Zero-copy
Packet Buffer
Our Solution
eth1
(rx)
Flexibility: Click modular router!
Compute intensive processing: Faster/More CPU cores?
MultiCore
P1/T1
...
Core1
eth1
(tx)
...
eth2
(rx)
N
GPUXXX
MultiQueue support:
✓ NUMA-aware CUDA memory allocation.
CPU
Speedup
String Matching
39.1 Gbps[1]
2.15 Gbps[1]
18.2
RSA-1024
74,732 ops/s[2]
3,301 ops/s[2]
22.6
AES-CBC Dec
~32 Gbps[3]
15 Gbps[2]
2.13
HMAC-SHA1
31 Gbps[2]
3.343 Gbps[2]
9.3
ops/s[4]
37.6
P2/T2
...
Core2
eth2
(tx)
...
Batcher
... ...
... ...
... ...
... ...
Py/Ty
ethX
(rx)
...
CoreY
... ...
CPUXXX
ethX
(tx)
N
N
MemcpyHtoD
el
n
r
e
UK
P
G
h
c
Laun
GPUXXX
GPURuntime
GPU
IPv6 Lookup
CPUXXX
(information element)
Computation
1.66x106
N
CGBalancer
How faster can we run with GPU?
ops/s[4]
Batcher
Click
NO! We have faster and cheaper GPUs.
62.4x106
CGBalancer
MemcpyDtoH
...
[1]: Kargus[CCS’12], [2]: SSLShader[NSDI’11]
[3]: GPUstore[SYSTOR’12], [4]: PacketShader[SIGCOMM’10]
Cheaper? Take ‘String Matching’ for example:
In [1], Six-core Xeon X5680 CPU costs $1639, GTX 580 GPU
costs $400. One GPU performance equals about 18 CPU cores,
hence three X5680, which cost $4917. About 12 times more
What existing technologies do we have now?
CUDA Runtime
MultiQueue(RSS) NICs
MultiCore CPUs: 1 thread/queue
Pre-allocated Zero-copy NUMA-aware packet buffer
GPU Driver
GPU
Problems To Investigate (and To Disucss)
How many 10Gb NICs a single CPU core can handle?
To know the ratio of #CPU core / #GPU, and cost saving, and hence the overall system cost comparison.
Study mapped GPU memory and non-mapped one, under different workloads, batches, slicing, scattered packet buffers.
Effects of workload-specific balancer policy on hybrid CPU+GPU packet processing.
Is multi-queue really needed?
According to Netmap work, a single CPU core can handle both RX and TX at line rate for forwarding.
Using GPU as main computing resource, CPU can just do I/O, interrupt handling.
To what extent(#NICs) can this GPU-enabled Click scale up?
Current Progress and Todo
Done infrastructure level, including Click GPU-related elements, Netmap/
CUDA integration.
Todo: Computational Click packet processing elements on GPU for evaluation:
Simple firewall: online packet inspection.
IP routing, IPSec, VPN
Openflow, ...
Download
Related flashcards
Computer science

25 Cards

ARM architecture

23 Cards

System software

24 Cards

Create flashcards