Harnessing GPU Computing in System Level Software Weibin Sun

advertisement
Harnessing GPU Computing
in System Level Software
Weibin Sun
2
Software stack needs
fast system level base!
2
Software stack needs
fast system level base!
System functionality
can be expensive!
2
Software stack needs
fast system level base!
‣ Crypto: AES, SHA1, MD5, RSA
Lookup: IP routing, DNS, key-value store, DB
‣ Erasure coding: RAID, distributed storage,
wireless
‣ Pattern matching: NIDS, virus, searching
‣ Compression: storage, network
‣…
‣
System functionality
can be expensive!
2
Software stack needs
fast system level base!
System functionality
can be expensive!
2
Software stack needs
fast system level base!
System functionality
can be expensive!
Modern digital
workloads are big!
2
Software stack needs
fast system level base!
System software
needs high throughput
processing!
System functionality
can be expensive!
Modern digital
workloads are big!
2
Software stack needs
fast system level base!
System software
needs high throughput
processing!
System functionality
can be expensive!
Modern digital
workloads are big!
3
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
3
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
packets, sectors, data blocks, …
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
GPUs are the de-facto
standard parallel processors!
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
GPUs are the de-facto
standard parallel processors!
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
GPUs are the de-facto
standard parallel processors!
4
Software stack needs
fast system level base!
System functionality
can be expensive!
System software
needs high throughput
processing!
Modern digital
workloads are big!
Many system tasks are
data-parallelizable!
Using GPUs in
system software!
GPUs are the de-facto
standard parallel processors!
4
Thesis statement
The throughput of system software with
parallelizable, computationally expensive tasks
can be improved by using GPUs and
frameworks with memory efficient and
throughput oriented designs.
5
Thesis statement
The throughput of system software with
parallelizable, computationally expensive tasks
can be improved by using GPUs and
frameworks with memory efficient and
throughput oriented designs.
5
Thesis statement
The throughput of system software with
parallelizable, computationally expensive tasks
can be improved by using GPUs and
frameworks with memory efficient and
throughput oriented designs.
5
Thesis statement
The throughput of system software with
parallelizable, computationally expensive tasks
can be improved by using GPUs and
frameworks with memory efficient and
throughput oriented designs.
5
Thesis statement
The throughput of system software with
parallelizable, computationally expensive tasks
can be improved by using GPUs and
frameworks with memory efficient and
throughput oriented designs.
5
Thesis work
•
Generic principles
•
Concretely:
•
•
Two generic frameworks for representative
system tasks: storage and network
•
Example applications on top of both
frameworks
Literature survey *
6
GPU essentials
GPU
Stream Multiprocessor
Stream Multiprocessor
Registers
L2 Cache
Stream Multiprocessor
Registers
Shared
Registers
Memory
Shared
Memory
Shared
L1 Cache
Memory
L1 Cache
L1 Cache
Constant
Memory
Texture
Memory
7
Global
Memory
(Device
Memory)
PCIe
Main Memory
(Host Memory)
CUDA GPU essentials
GPU
Stream Multiprocessor
Stream Multiprocessor
Registers
L2 Cache
Stream Multiprocessor
Registers
Shared
Registers
Memory
Shared
Memory
Shared
L1 Cache
Memory
L1 Cache
L1 Cache
Constant
Memory
Texture
Memory
7
Global
Memory
(Device
Memory)
PCIe
Main Memory
(Host Memory)
CUDA GPU essentials
GPU
Stream Multiprocessor
Stream Multiprocessor
Registers
L2 Cache
Stream Multiprocessor
Registers
Shared
Registers
Memory
Shared
Memory
Shared
L1 Cache
Memory
L1 Cache
L1 Cache
Constant
Memory
Texture
Memory
SIMT wimpy cores:
32-core warp
7
Global
Memory
(Device
Memory)
PCIe
Main Memory
(Host Memory)
CUDA GPU essentials
GPU
Stream Multiprocessor
Stream Multiprocessor
Registers
L2 Cache
Stream Multiprocessor
Registers
Shared
Registers
Memory
Shared
Memory
Shared
L1 Cache
Memory
L1 Cache
L1 Cache
SIMT wimpy cores:
32-core warp
Constant
Memory
Global
Memory
(Device
Memory)
Texture
Memory
wide memory
interface: 384bits!
7
PCIe
Main Memory
(Host Memory)
CUDA GPU essentials
GPU
Stream Multiprocessor
Stream Multiprocessor
Registers
L2 Cache
Stream Multiprocessor
Registers
Shared
Registers
Memory
Shared
Memory
Shared
L1 Cache
Memory
L1 Cache
L1 Cache
SIMT wimpy cores:
32-core warp
Constant
Memory
Global
Memory
(Device
Memory)
PCIe
Main Memory
(Host Memory)
Texture
Memory
wide memory
interface: 384bits!
7
necessary
memory copy
Related work: pioneers
•
Gnort [RAID ’2008]
•
PacketShader [SIGCOMM ’2010]
•
SSLShader [NSDI ’2011]
•
EigenCFA [POPL ’2011]
•
Gibraltar [ICPP ’2010]
•
CrystalGPU [HPDC ’2010]
•
…
8
Related work: pioneers
•
Gnort [RAID ’2008]
•
PacketShader [SIGCOMM ’2010]
•
SSLShader [NSDI ’2011]
•
EigenCFA [POPL ’2011]
•
Gibraltar [ICPP ’2010]
•
CrystalGPU [HPDC ’2010]
•
…
Specialized and Ad-hoc
solutions to one particular
application!
8
GPU computing frameworks/models
•
Hydra [ASPLOS’08]
•
CUDA-Lite [LCPC’08]
•
hiCUDA [GPGPU’09]
•
ASDSM [ASPLOS’10]
•
Sponge [ASPLOS’11]
•
CGCM [PLDI’11]
•
PTask [SOSP’11]
•
…
9
GPU computing frameworks/models
•
Hydra [ASPLOS’08]
•
CUDA-Lite [LCPC’08]
•
hiCUDA [GPGPU’09]
•
ASDSM [ASPLOS’10]
•
Sponge [ASPLOS’11]
•
CGCM [PLDI’11]
•
PTask [SOSP’11]
•
…
9
•
Simple workflow
•
Long computation, PCIe
copy and synchronization
not matter
•
No batched data
divergence
GPU computing frameworks/models
•
Hydra [ASPLOS’08]
•
CUDA-Lite [LCPC’08]
•
hiCUDA [GPGPU’09]
U
P
G
l
e
v
Simple workflow
e
l
m
e
t
s
n
Sy
w
o
s
t
i
Long
computation,
PCIe
s
d
e
e
n
g
copy
and
synchronization
n
i
t
u
p
!
s
not matter
k
com
r
o
w
e
m
a
r
f
c
i
r
e
No batched data
gen
divergence
•
ASDSM [ASPLOS’10]
•
Sponge [ASPLOS’11]
•
CGCM [PLDI’11]
•
PTask [SOSP’11]
•
…
•
•
•
9
Rest of this talk
•
Generic principles
•
GPUstore : for storage systems
•
Snap : for packet processing
10
System level GPU computing: generic principles
11
Problems
Latency-oriented system code V.S. throughput-oriented GPU
CPU-GPU synchronization hurts asynchronous systems
Wasted PCIe data transfer
Double buffering for GPU DMA
12
Problems
Latency-oriented system code V.S. throughput-oriented GPU
CPU-GPU synchronization hurts asynchronous systems
Wasted PCIe data transfer
Double buffering for GPU DMA
12
Latency V.S. throughput
13
Latency V.S. throughput
Batched processing: wait for enough workload
13
Latency V.S. throughput
Batched processing: wait for enough workload
•
how much is enough?
13
Latency V.S. throughput
Batched processing: wait for enough workload
•
how much is enough?
•
increased latency
13
Problems
Latency-oriented system code V.S. throughput-oriented GPU
CPU-GPU synchronization hurts asynchronous systems
Wasted PCIe data transfer
Double buffering for GPU DMA
14
Problems
Batched processing
CPU-GPU synchronization hurts asynchronous systems
Wasted PCIe data transfer
Double buffering for GPU DMA
14
Problems
Batched processing
CPU-GPU
CPU-GPUsynchronization
synchronizationhurts
hurtsasynchronous
asynchronoussystems
systems
Wasted PCIe data transfer
Double buffering for GPU DMA
14
CPU-GPU synchronization
Blocking:
CPU:
GPU:
15
CPU-GPU synchronization
Blocking:
CPU:
GPU:
15
CPU-GPU synchronization
Blocking:
CPU:
GPU:
15
CPU-GPU synchronization
Blocking:
sleep for sync
CPU:
GPU:
15
CPU-GPU synchronization
Blocking:
sleep for sync
CPU:
GPU:
Non-blocking:
CPU:
GPU:
15
CPU-GPU synchronization
Blocking:
sleep for sync
CPU:
GPU:
Non-blocking:
keep working
CPU:
GPU:
15
CPU-GPU synchronization
Blocking:
CPU:
e
d
o
c
System
sleep for sync, etc.)
k
s
i
d
,
k
r
o
w
t
(ne
GPU:
k
c
a
b
Call
g
n
i
l
l
o
P
Non-blocking:
keep working
CPU:
GPU:
15
likes:
Problems
Latency-oriented
Batched
processing
system code VS throughput-oriented GPU
CPU-GPU synchronization hurts asynchronous systems
Wasted PCIe data transfer
Double buffering for GPU DMA
16
Problems
Latency-oriented
Batched
processing
system code VS throughput-oriented GPU
Asynchronous non-blocking GPU programming
Wasted PCIe data transfer
Double buffering for GPU DMA
16
Problems
Latency-oriented
Batched
processing
system code VS throughput-oriented GPU
Asynchronous non-blocking GPU programming
WastedPCIe
PCIedata
datatransfer
transfer
Wasted
Double buffering for GPU DMA
16
How PCIe transfer wasted?
17
How PCIe transfer wasted?
Do you need the entire packet to do routing lookup?
17
How PCIe transfer wasted?
Do you need the entire packet to do routing lookup?
What you need:
17
How PCIe transfer wasted?
Do you need the entire packet to do routing lookup?
What you need:
4
bytes
17
How PCIe transfer wasted?
Do you need the entire packet to do routing lookup?
What you need:
4
bytes
What you may transfer:
17
How PCIe transfer wasted?
Do you need the entire packet to do routing lookup?
What you need:
4
bytes
What you may transfer:
≥ 64 bytes
17
Compact your workload data
For ``use-partial’’ data usage pattern
Compaction
PCIe
Host memory
GPU memory
18
Problems
Latency-oriented
Batched
processing
system code VS throughput-oriented GPU
CPU-GPU
synchronization
hurts
asynchronous
systems
Asynchronous
non-blocking
GPU
programming
Wasted PCIe data transfer
Double buffering for GPU DMA
19
Problems
Latency-oriented
Batched
processing
system code VS throughput-oriented GPU
CPU-GPU
synchronization
hurts
asynchronous
systems
Asynchronous
non-blocking
GPU
programming
Reduce PCIe data transfer with compacted workload
Double buffering for GPU DMA
19
Problems
Latency-oriented
Batched
processing
system code VS throughput-oriented GPU
CPU-GPU
synchronization
hurts
asynchronous
systems
Asynchronous
non-blocking
GPU
programming
Reduce PCIe data transfer with compacted workload
Double
Doublebuffering
bufferingfor
forGPU
GPUDMA
DMA
19
GPU DMA contraints
20
GPU DMA contraints
•
DMA only on locked memory pages
20
GPU DMA contraints
•
DMA only on locked memory pages
•
GPU drivers only recognize their own locked memory
pool
GPU driver’s memory area
DMA on PCIe
GPU mem
DMA
locked
mem
20
GPU DMA contraints
•
DMA only on locked memory pages
•
GPU drivers only recognize their own locked memory
pool
GPU driver’s memory area
External
mem
DMA on PCIe
GPU mem
DMA
locked
mem
20
GPU DMA contraints
•
DMA only on locked memory pages
•
GPU drivers only recognize their own locked memory
pool
GPU driver’s memory area
External
mem
DMA on PCIe
GPU mem
DMA
locked
mem
20
GPU DMA contraints
•
DMA only on locked memory pages
m
e
t
s
y
s
• GPU drivers only recognize their
d
own
locked
memory
n
a
r
e
v
i
r
d
pool ake GPU
M
d
e
k
c
o
l
e
m
a
s
e
h
t
e
s
GPU
driver’s
memory
area
u
e
d
co
!
A
M
D
y
p
o
c
o
r
e
z
r
o
f
y
r
o
External mem
DMA on PCIe
mem
GPU mem
DMA
locked
mem
20
Problems
Batched
Latency-oriented
processing
system code VS throughput-oriented GPU
CPU-GPU synchronization
Asynchronous
non-blockinghurts
GPUasynchronous
programming systems
Wasted PCIe
PCIe data
data transfer
transfer with compacted workload
Reduce
Double buffering for GPU DMA
21
Problems
Batched
Latency-oriented
processing
system code VS throughput-oriented GPU
CPU-GPU synchronization
Asynchronous
non-blockinghurts
GPUasynchronous
programming systems
Wasted PCIe
PCIe data
data transfer
transfer with compacted workload
Reduce
All use the same locked memory for zero-copy DMA
21
Generic principles
Batched
Latency-oriented
processing
system code VS throughput-oriented GPU
CPU-GPU synchronization
Asynchronous
non-blockinghurts
GPUasynchronous
programming systems
Wasted PCIe
PCIe data
data transfer
transfer with compacted workload
Reduce
All use the same locked memory for zero-copy DMA
21
GPUstore: Harnessing GPU Computing
in OS Storage Systems
[Weibin Sun, Robert Ricci, Matthew L. Curry @ SYSTOR’12]
22
GPUstore overview
23
GPUstore overview
•
A storage framework for Linux kernel to use
CUDA GPUs in filesystems, block drivers, etc.
23
GPUstore overview
•
A storage framework for Linux kernel to use
CUDA GPUs in filesystems, block drivers, etc.
Syscall
VFS
Filesystem
Page cache
Block I/O
Virtual Block
Driver
Block Driver
Disk
23
GPUstore overview
•
A storage framework for Linux kernel to use
CUDA GPUs in filesystems, block drivers, etc.
Syscall
VFS
Filesystem
Page cache
•
Minimally invasive GPU integration
Block I/O
Virtual Block
Driver
Block Driver
Disk
23
GPUstore overview
•
A storage framework for Linux kernel to use
CUDA GPUs in filesystems, block drivers, etc.
Syscall
VFS
Filesystem
Page cache
•
Minimally invasive GPU integration
Block I/O
•
Small changes
Virtual Block
Driver
Block Driver
Disk
23
GPUstore overview
•
A storage framework for Linux kernel to use
CUDA GPUs in filesystems, block drivers, etc.
Syscall
VFS
Filesystem
Page cache
•
Minimally invasive GPU integration
Block I/O
•
•
Small changes
Preserve interface and semantics
Virtual Block
Driver
Block Driver
Disk
23
GPUstore overview
•
A storage framework for Linux kernel to use
CUDA GPUs in filesystems, block drivers, etc.
Syscall
VFS
Filesystem
Page cache
•
Minimally invasive GPU integration
Block I/O
•
•
Small changes
Preserve interface and semantics
Virtual Block
Driver
Block Driver
•
Keep efficient
Disk
23
GPUstore integration
Syscall
VFS
Filesystems… e.g.
eCryptfs
Page Cache
Functional virtual block drivers…
e.g. dm-crypt, MD RAID6
Block device drivers
24
GPUstore integration
Syscall
VFS
Filesystems… e.g.
eCryptfs
!...
CPUCipher(buf);
!...
Page Cache
Functional virtual block drivers…
e.g. dm-crypt, MD RAID6
!...
CPURSCode(buf);
!...
Block device drivers
24
GPUstore integration
Syscall
VFS
Filesystems… e.g.
eCryptfs
!...
GPUCipher(buf);
!...
Page Cache
Functional virtual block drivers…
e.g. dm-crypt, MD RAID6
!...
GPURSCode(buf);
!...
Block device drivers
24
GPUstore integration
Syscall
VFS
!...
GPUCipher(buf);
!...
Page Cache
Functional virtual block drivers…
e.g. dm-crypt, MD RAID6
Computation
Requests
!...
GPURSCode(buf);
!...
Block device drivers
24
ion
t
a
t
pu
m
o
C
sts
e
u
Req
GPUstore
Filesystems… e.g.
eCryptfs
GPU Services
GPU Ciphers
GPU RAID6
GPUstore request workflow
sys-calls
Service Users ...
requests
Service-specific
Scheduling
Execution
Preparation
GPU Execution
Post-execution
GPU
CPU
25
GPUstore request workflow
sys-calls
Service Users ...
requests
Service interface
Service-specific
Scheduling
Execution
Preparation
GPU Execution
Post-execution
GPU
CPU
25
Apply generic principles
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
26
Apply generic principles
Batched processing
Asynchronous non-blocking GPU programming
26
Apply generic principles
Batched processing
Merge small requests
Asynchronous non-blocking GPU programming
26
Apply generic principles
Batched processing
Merge small requests
Asynchronous non-blocking GPU programming
Split large requests
26
Apply generic principles
Batched processing
Merge small requests
Asynchronous non-blocking GPU programming
Split large requests
Callback-based request processing
26
Request scheduling
Merge small requests
Split large requests
27
Request scheduling
Merge small requests
Split large requests
How to define “small” and “large”?
27
Request scheduling
Merge small requests
Split large requests
How to define “small” and “large”?
Up to specific services.
27
Apply generic principles
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
28
Apply generic principles
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
28
Apply generic principles
All use the same locked memory for zero-copy DMA
28
Apply generic principles
All use the same locked memory for zero-copy DMA
User space CUDA library dependency
28
Apply generic principles
All use the same locked memory for zero-copy DMA
User space CUDA library dependency
In-kernel CUDA’s locked memory
allocator
28
In-kernel CUDA locked memory allocator
CUDA mem
continuous VA
User space (helper)
Kernel space
Physical pages
Continuous VA
Kernel vmap
29
In-kernel CUDA locked memory allocator
What’s next?
kmalloc()/vmalloc()/
get_free_pages() with
User space (helper)
GPUstoreMalloc()?
Kernel space
CUDA mem
Replace
continuous VA
Physical pages
Continuous VA
Kernel vmap
29
In-kernel CUDA locked memory allocator
What’s next?
kmalloc()/vmalloc()/
get_free_pages() with
User space (helper)
GPUstoreMalloc()?
Kernel space
CUDA mem
Replace
continuous VA
Physical pages
Continuous VA
Kernel vmap
Sometimes it is infeasible!
29
Why infeasible?
30
Why infeasible?
Syscall
VFS
Filesystem
Page cache
Block I/O
Virtual Block
Driver
Block Driver
30
Why infeasible?
Syscall
➡
Who allocates the memory?
VFS
Filesystem
Page cache
Block I/O
Virtual Block
Driver
Block Driver
30
Why infeasible?
Syscall
➡
Who allocates the memory?
๏
“pass-by-ref” interface
VFS
Filesystem
Page cache
Block I/O
Virtual Block
Driver
Block Driver
30
Why infeasible?
Syscall
➡
Who allocates the memory?
๏
➡
“pass-by-ref” interface
VFS
Filesystem
Can NOT modify highly-depended cache
allocators:
Page cache
Block I/O
Virtual Block
Driver
Block Driver
30
Why infeasible?
Syscall
➡
Who allocates the memory?
๏
➡
“pass-by-ref” interface
Filesystem
Can NOT modify highly-depended cache
allocators:
‣
‣
‣
‣
VFS
page cache, buffer cache
object cache
packet pool
…
Page cache
Block I/O
Virtual Block
Driver
Block Driver
30
Why infeasible?
Syscall
➡
Who allocates the memory?
VFS
y
r
o
m
e
m
y
r
a
r
t
i
b
r
a
e
➡ Can NOT modify
s
highly-depended cache?
u
e
w
A
n
M
a
D
C
allocators:
U
P
G
r
o
f
‣ page cache, buffer cache
๏
“pass-by-ref” interface
Filesystem
Page cache
Block I/O
‣
‣
‣
object cache
packet pool
…
Virtual Block
Driver
Block Driver
30
Remap external memory for GPU DMA
31
Remap external memory for GPU DMA
External
mem
GPU mem
31
Remap external memory for GPU DMA
GPU driver’s locked memory area (vma)
External
mem
GPU mem
31
Remap external memory for GPU DMA
GPU driver’s locked memory area (vma)
set page-table entries
External
mem
DMA
locked
mem
31
GPU mem
Remap external memory for GPU DMA
GPU driver’s locked memory area (vma)
set page-table entries
External
mem
DMA on PCIe
DMA
locked
mem
31
GPU mem
Apply generic principles
All use the same locked memory for zero-copy DMA
32
Apply generic principles
All use the same locked memory for zero-copy DMA
In-kernel CUDA locked memory allocator
32
Apply generic principles
All use the same locked memory for zero-copy DMA
In-kernel CUDA locked memory allocator
Remap external memory into GPU driver’s locked
memory area
32
Implementation and evaluation
33
Implementation and evaluation
•
Case studies of major storage residents:
• dm-crypt: disk encryption layer
• eCryptfs: encrypted filesystem
• md: software RAID6
33
md RAID6
dm-crypt
eCryptfs
Implementation and evaluation
GPUstore
GPU
GPU
Encryption
Case studies
ofRAID
major storage
residents:
Recovery Service
Service
• dm-crypt: disk encryption layer
• eCryptfs: encrypted filesystem
4.3. Storage services implemented with GPUstore.
• md: software RAID6
4.1.
•
•
Integration cost:
Subsystem
dm-crypt
eCryptfs
md
Total LOC
1,800
11,000
6,000
Modified LOC
50
200
20
Percent
3%
2%
0.3%
Approximate modified lines 33of code required to call GPU serv
AES cipher performance
4500"
4000"
Throughput"(MB/s)"
3500"
3000"
CPU"
Base"GPU"
2500"
2000"
1500"
1000"
500"
0"
4KB$
8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$
34
Buffer Size
AES cipher performance
4500"
4000"
CPU"
Throughput"(MB/s)"
3500"
3000"
Base"GPU"
2500"
2000"
GPU"with"Split"
1500"
1000"
500"
0"
4KB$
8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$
34
Buffer Size
AES cipher performance
4500"
CPU"
4000"
Throughput"(MB/s)"
3500"
Base"GPU"
3000"
2500"
GPU"with"Split"
2000"
GPU"no"RB"
1500"
1000"
500"
0"
4KB$
8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$
34
Buffer Size
AES cipher performance
4500"
CPU"
4000"
Throughput"(MB/s)"
3500"
Base"GPU"
3000"
GPU"with"Split"
2500"
GPU"no"RB"
2000"
1500"
GPU"no"RB"with"Split"
1000"
500"
0"
4KB$
8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$
34
Buffer Size
Faster than SSD: dm-crypt
1400
Throughput (MB/s)
1200
1000
GPU Read
GPU Write
CPU Read
CPU Write
800
600
400
200
0
4 KB
16 KB
64 KB
256 KB
Read/Write size
35
1 MB
4 MB
Faster than SSD: dm-crypt
1400
Throughput (MB/s)
1200
1000
GPU Read
GPU Write
CPU Read
CPU Write
800
600
400
SSD:
250MB/s
200
0
4 KB
16 KB
64 KB
256 KB
Read/Write size
35
1 MB
4 MB
Working with existing optimization
eCryptfs on RAM disks
Linux max 128KB read-ahead effects on read
800
Throughput (MB/s)
700
600
500
400
300
GPU Read
GPU Write
CPU Read
CPU Write
200
100
0
4 KB
16 KB
64 KB
256 KB
Read/Write size
36
1 MB
4 MB
Figure
GPUstore Summary
•
Enables efficient GPU-accelerated storage in Linux
kernel with small changes
•
https://github.com/wbsun/kgpu
•
Details not presented: user-space helper, non-blocking KU comm, more exp result
37
Snap: Fast and Flexible Packet Processing
With GPUs and Click
[Weibin Sun, Robert Ricci @ ANCS’13]
38
Click background [Kohler, et.al. TOCS’00]
FromDevice(eth0)
•
Elements
CheckSum
•
Ports
•
IPLookup
ToDevice(eth1)
ToDevice(eth2)
39
push/pull
Why using GPUs in Click?
Configurations
Throughput
Slow
down
30.97 Gbps
N/A
17.7 Gbps
42.7%
Simple Forwarder:
FromDevice
(ethX, RingY)
ToDevice
(ethX, RingY)
Simple SDN Forwarder:
FromDevice
(ethX, RingY)
ToDevice
(ethX, RingY)
SDNClassifier
40
Why using GPUs in Click?
Configurations
Simple Forwarder:
FromDevice
(ethX, RingY)
Throughput
s
d
!
e
e
n
g
n
i
s
!
s
r
e
e
c
30.97
Gbps
w
o
o
r
p
p
t
g
e
n
i
k
t
c
u
a
p
P
m
o
c
e
r
17.7 Gbps
o
m
ToDevice
(ethX, RingY)
Slow
down
N/A
Simple SDN Forwarder:
FromDevice
(ethX, RingY)
ToDevice
(ethX, RingY)
SDNClassifier
40
42.7%
Why using GPUs in Click?
41
GPU
100
CPU
10
8K
4K
2K
1K
512
256
128
64
1
0.1
GPU-64-bat
GPU-1K-ba
GPU-8K-ba
1000
Throughput (Mpps)
1000
32
Throughputs (Mpps)
Why using GPUs in Click?
Number of packets per batch
(a) IP lookup algorithm (radix tree)
100
10
1
0.1
64
128
256
5
Packet Sizes
(b) IDSMatcher algorithm (A
Figure 2: GPU vs. CPU performance on packet processing
efficient, but they are not well suited to building pipelines more
complex than a single processing function. In particular, they do
41
not directly provide efficient
support for splitting packets across
can
mov
seve
e
fp
00
al
pl nt,
0
a
F
g
ck
ex bu
or
ig
10
ith
et
ur
re th t
0
s
t
ct an h
e2
m
pe
e
ly
(ra
:G
rb
10
tp p as ya
d
r
a
ro ov in re
P
ix
tc
U
ow ce id gle no
G
h
t
r
PU
1
v
s
e
e
G
ns sin e p t w
s
e
P
.
)
G U- 640.
C
P
Th trea g p ffic roce ell
1
U 1K ba
PU
-8
t
is m at ien ss su
K ba ch
h
i
pe
l
ba tch
(b
el imi cod s–f t su ng ited
tc
r
64
G
)I
fo
h
em ts e or p fu t
PU
G
r
p
o
p
n
D
t
m
s w en he at ex or ct b
S
C PU- 256
1
a
h
PU 2
M
28
nc
or t gr ir u s a am t fo ion uild
K -ba
a
-b
tc
eo
g k, a p s e s p l r . I i n
Pa
at tch
h
e
in a n h to a
s
ch
n
2
g
ck 5
n
, s pl p p
er
t
s
r
o
6
e
d
e
pa
Sn c s su si su en itt ar ip
al
tS
go
ck
ap om ee pp mp lt din ing tic el
iz 51
10
rit
et
es 2
w pl ks ort le of g pa ula ine
GPU-64-bat
00
hm GPU
1000
1000
ill ex to ed lin ro pa c r, s
pr
(B
yt 10
oc
be , f si b ea ut ck ke th mo
10
GPU-1K-ba
(
es 24 GPU
A
t
u
in et s ey re
m
e
y
r
or
0
0
0
a
l
0
h
s
1
)
n ly- pl Cl pip g l s t ac d GPU-256-batcsihn
ok
CPU
GPU-8K-ba
15PU
en fu y ic e o U-6h4-batrcoh o 100-batch g
th
G
C
C
1
100
18
0
o0r
a at
ab nc an k. lin GoPk rou atcshs GPU-2K
a
C PU
0
1
pr su
b
a
l
d
u
K
1
e
l
U
S
g
t
PU
P
g
G p
sic
s
CPU ca
oc p
1e0r00 io
o
e
n
h
1
,
h
s
c
t
a
b
ri
n
fo na ffic ap GraPU-8Ko!0 k)
m
el es po
10
t
b
0.
t
r
r
l
h
o
1
ei
10
ie dr he
us sin rts
s
r
f
v
1
m
u
o
e
n
a
e
m
v
um ed g, co 100 tu ut tl w r
s.
hi er pa p
re er y s
N1
(c
g
l
a
es by wh mp
ck em
i
c
s
h
l
o
n
)S
w
1
a
c
te
se sp ha ets en
or and ta. its a p ich le1x0
o
1
D
N
f
e
t
s
k
a
i
n
t
h
e
,
t
N
um
P n c i d
wi pa w d ge o ed 0.1 a
in
s
t
f
a
c
h ack put ket s ba ata
l
s
e . S t n on
be
or
re l b ke 0.1
th
a s et d p s1e flo
d
fi
o
w
r batch
d
r
e
p
e
e
t
e
s
n
t
a
t
r
s
e
s
u
k
f
of
0.1
ch ce di in d 1v8e Cl ro he
axer of pacard
b
lea erie proc ta a oce d o w
m
u
5
r
N
i
n
1
t
p
s
e
m
4 h
G
al c k
es
er
at 6du 5o1u2 ca102m
aicthm
a
ds s o es nd 0ss.o1 n
64
128
256
5
P
r
o
b
w
t
g
u
5
l
t
a
r
2
a
c
a
i
r
k
h
l
U
e
r
8
fi
d
a
s
f
i
2
n
i
r
s
r
i
1
l
s
p
r
i
)
e
a
t
s
e
l
e
e
n
tc forawardper
f w m of
er ce batch ass
sto o a ele ing ro . 64 a wkeat lSl izengs (Buyst ed mNumber
ts
or Dpackets
h
n
Packet Sizes
n
N
c
,
h
y
a
e
P
S
e
a
d
p
i
)
p
c
d
re di me , in
(
e
o
e
w
fi
i
e
t
s
l
e
a
g
w
h
of orase ickm) d a r le a
er
ner
, m tcffh n
e l h k th it (Arhtso-C
p
o
o
m
t
s
l
ba
c
e
a
t
o
t
b
l
al
orwalgevoritrhom h C of he an ry, “e pe m en etnon-linienae r.
s per an re s
g
tc
e
nt
are axaes(radix
e of tree) a
r
atch er el ug (a)
a
o
IP
lookup
algorithm
(b)
IDSMatcher
algorithm (A
x
M
t
o
t
S
d
r
a
l
h
D
I
h
s
)
r
i
e
b
t
h
r
(
ho
t
r
k
g
e
t
c
i
i
.
a
r
w
e
h
.
h
a
l
t
n
c
t
e
e
e
t
l
ev ith th k’os ritphi mast.i Nore pas a” o th In da
that
tshm
n
3
e
adiwxe tree) ,
m
e
l
e
g
t
l
v
a
r
p
p
u
a
e
w
e
e
g d n e e se si
.1
on ve
r d d ias pseetron f atde apt
cl cessitn
o
r
o
h
a
p
l
akes
n
l
t
m
e
A
k
p
y
i
r
c
a
n
e
r
t
a
a
n
e
g
r
p
S
h
t
n
a
U
,
n
e
P
i
y
k
lnlyterenleall
an us sig ve e na tedsloonhethe G
sig enipCPU
asns ce o W
aGPU
, s vs.
I
ou PC P,U perforbm
.
s
e
t
l
e
Figure
2:
performance
well aton packet processing
m
m
k
n
o
l
e
e
r
e
g
l
n
h
e
l
,
o
m
e
p
p
u
s
w
w
h
i
m
n
c
i
e
t
n
n
n
a
e
i
r
vtse.- C
e
e
n
l
o
b
b
S
y
re e w
cas n -v an usto anncd frPomed g ch akche theis pinpe en
i
d
many
fi
e
l
r
n
I
o
thve pisai ckdets h/ h C ofrdeprato m
a
.
a
w
m
s
n
e
t
be ti ts pipeblinese
c
i
a
o
d
o
p
s
mo at bl imes ptuo Crloick Iien or ck isnmouircedesoirgn cmh th
n
vieng fr y
t
d
l
pies
i
o
u
m
o
c
w
b
d
k
g
o
g
n
r
y
t
s
n
e
i
n
a
i
e
l
p
a
e
d
k
h
a
s
h
t
p
e
a
l
e
b
t
c
o
a
p
,
s
i
oeu not
akrk, .sIuto
lem elrasare
yn tssuited
as m
ll su ti eth . eIen pyalrticmularthe ng efficient,
meuss awell
ke t epipelines
chwbuilding
severatelmbutch they
h
t
more
can
l
c
o
know
v
e
s
n
e
s
n
w
s
m
a
i
S
o
o
c
i
”
w
n
t
r
.
t
a
.
a
t
c
s
s
c
r
h
o
a
a
t
d
l
n
n
l
h
e
x
G
h
u
s
t
r
e
e
t
f
h
e
s
“
e
e
p
l
t
n
y
g
s
k
a
C
f
e
dwo W ro at alogng pmack a
ssin dd e t sdps litteilne g gpeac P xp h l
ro agde single
natt ifedweprocessing
high than
h
n and they do
t
t
g
o
c
a
i
g
complex
function.
In
particular,
mov
n
r
U
t
u
i
e
u
a
h
a
o
n
n
s
m
f
o
z
a
w
s
fi
t
i
b
r
m
t
a
e
i
n
h
r
e
s
o
p
t
o
n
o
e
o
r
r
w
s
h
t
p
t
h
l
o
i
c
,
n
y
,
i
r
y
a
i
r
s
y
o
z
t Psupnd d esyendaitng penackatec of rimor ck cases,
e
p
r
m
o
o
e
n
f
t
th maainn mn g ks ls theatneedk ie
d tha
npackets
we, ar p ts lhookflups e not’ directly
fi
n
l
i
o
a
e
p
41
s
o
e
s
s
l
b
t
i
a
a
m
e
n
e
.
a
s
k
provide
efficient
support
for
splitting
across
seve
r
d
e
x
c
c
o
o
w
e
a
e
i
e
c
W
g
s
s Pof p
feor ke pu deult eoif raosutinis s oa enrt
eed, wp e caf n dIn e fin PnCIeowbus.
and fits
s
(M
pp
Th
ro
ug
hp
ut
8K
4K
2K
1K
5
m
pp
(Mpps) s)
(M
8K
4K
4K
2K
1K
8K
rock
s!
32
64
128
256
512
512
256
128
1K
3
26K 2
12 44K
25 8
8K
51 6 Throughput
2
1K
Th
ro
ug
hp
ut
(Mpps)
t
u
p
h
g
u
Thro
GPU
32
the
64
Throughputs (Mpps)
And
2K
8K
2K
4K
)
ut (Mpps
p
h
g
u
o
r
Th
Why using GPUs in Click?
Snap: the idea
•
•
Moving parts of Click
pipeline onto GPUs
•
CPU for sequential
•
GPU for parallel
On GPU
On CPU
SomeSource
SequentialElement
ParallelWork
MoreParallelWork
SequentialWork
Keep Click’s modular
style
Drop
MoreSeqWork
AgainParallelWork
FurtherSequential
SomeDestination
42
Apply generic principles
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
43
Apply generic principles
Batched processing
43
Snap batched processing
Element
Single Packet Path
push/pull
Packet
+ push()
+ pull()
Batcher
bpush/bpull
PacketBatch
BElement
+ bpush()
+ bpull()
Multi-Packet Path
Debatcher
Single Packet Path
44
Snap batched processing
Element
Single Packet Path
push/pull
Packet
+ push()
+ pull()
Batcher
bpush/bpull
PacketBatch
BElement
+ bpush()
+ bpull()
Multi-Packet Path
Debatcher
Single Packet Path
44
Snap batched processing
Element
Single Packet Path
push/pull
Packet
+ push()
+ pull()
Batcher
bpush/bpull
PacketBatch
BElement
+ bpush()
+ bpull()
Multi-Packet Path
Debatcher
Single Packet Path
44
Snap batched processing
Element
Single Packet Path
push/pull
Packet
+ push()
+ pull()
Batcher
bpush/bpull
PacketBatch
BElement
+ bpush()
+ bpull()
Multi-Packet Path
Debatcher
Single Packet Path
44
Batched packet divergence
•
Packets in a batch go to different paths
GPUClassifier
GPUChecksum
GPUEncPacket
45
Batched packet divergence
•
Packets in a batch go to different paths
GPUClassifier
GPUChecksum
GPUEncPacket
45
Batched packet divergence
•
Packets in a batch go to different paths
Two choices:
GPUClassifier
GPUChecksum
GPUEncPacket
45
Batched packet divergence
•
Packets in a batch go to different paths
Two choices:
1. Split the batch into two
copies
GPUClassifier
GPUChecksum
GPUEncPacket
45
Batched packet divergence
•
Packets in a batch go to different paths
Two choices:
1. Split the batch into two
copies
➡ needs PCIe copy, sync!
GPUClassifier
GPUChecksum
GPUEncPacket
45
Batched packet divergence
•
Packets in a batch go to different paths
Two choices:
1. Split the batch into two
copies
➡ needs PCIe copy, sync!
GPUClassifier
GPUChecksum
GPUEncPacket
2. Keep the batch, skip some
packets on GPUs
45
Predicated execution in a batch
PacketBatch
...
GPUElement-1
Predicate 0
Predicate 1
GPUElement-2
GPUElement-3
...
Debatcher
All Packets
Dispatcher
Predicate 0
Predicate 1
46
Predicated execution in a batch
PacketBatch
...
GPUElement-1
Predicate 0
Predicate 1
void ge2kernel(…) { if (pkt.predicates[0]) { … // GPUElement-­‐2 logic } }
GPUElement-2
GPUElement-3
...
Debatcher
All Packets
Dispatcher
Predicate 0
Predicate 1
46
Predicated execution in a batch
PacketBatch
...
GPUElement-1
Predicate 0
Predicate 1
void ge2kernel(…) { if (pkt.predicates[0]) { … // GPUElement-­‐2 logic } }
GPUElement-2
GPUElement-3
...
void ge3kernel(…) { if (pkt.predicates[1]) { … // GPUElement-­‐3 logic } }
Debatcher
All Packets
Dispatcher
Predicate 0
Predicate 1
46
Predicated execution in a batch
PacketBatch
...
GPUElement-1
Predicate 0
Predicate 1
void ge2kernel(…) { if (pkt.predicates[0]) { … // GPUElement-­‐2 logic } }
GPUElement-2
GPUElement-3
...
Debatcher
All Packets
Dispatcher
Predicate 0
Predicate 1
void ge3kernel(…) { if (pkt.predicates[1]) { … // GPUElement-­‐3 logic } }
if (pkt.predicates[0]) output(0).push(pkt); else if (pkt.predicates[1]) output(1).push(pkt); …
46
Apply generic principles
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking GPU programming
Batched
processing
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking GPU programming
Batched
processing
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking GPU programming
Batched
processing
Each batch binds a CUDA stream
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking GPU programming
Batched
processing
Each batch binds a CUDA stream
BElement only does async GPU ops
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking GPU programming
Batched
processing
Each batch binds a CUDA stream
BElement only does async GPU ops
Use GPUCompletionQueue to check
stream status
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking
GPU
programming
Batched
processing
Reduce
PCIe data
transfer with
compacted
workload
All use the same locked memory for zero-copy DMA
47
Apply generic principles
Asynchronous
non-blocking
GPU
programming
Batched
processing
Reduce
PCIe data
transfer with
compacted
workload
“Region-of-interest” (ROI) based packet
slicing
All use the same locked memory for zero-copy DMA
47
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Packet
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Protocol
Packet
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Src IP
Protocol
Packet
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Dst IP
Src IP
Protocol
Packet
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Dst IP
Src IP
Src Port
Protocol
Packet
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Dst IP
Src IP
Src Port
Protocol
Dst Port
Packet
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Dst IP
Src IP
Src Port
Protocol
Dst Port
Packet
ROIs
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Dst IP
Src IP
Src Port
Protocol
Dst Port
Packet
Slicing
ROIs
48
Packet slicing for ``use-partial’’
Packet Classification’s regions-of-interest (ROI):
Dst IP
Src IP
Src Port
Protocol
Dst Port
Packet
Slicing
ROIs
Coalescable memory access on GPUs
48
Apply generic principles
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
49
Apply generic principles
Asynchronous
non-blocking
GPU
programming
Batched
processing
Reduce
PCIe
data
transfer
with
compacted
workload
All use the
same
locked
memory
for
zero-copy
DMA
49
Apply generic principles
Asynchronous
non-blocking
GPU
programming
Batched
processing
Reduce
PCIe
data
transfer
with
compacted
workload
All use the
same
locked
memory
for
zero-copy
DMA
ROIs stored in CUDA locked memory
49
Evaluation
•
Basic forwarding I/O improvement
•
Example applications:
•
Classification
•
IP routing
•
Pattern matching for IDS
•
Latency, re-order
•
Flexibility and modularity
50
T
0
Classifier+routing+pattern matching
64
128 256 512 1024 1518
Packet Size (Bytes)
(b) DPI Router
Click
Snap-GPU
Snap-CPU
Snap-GPU w/ Slicing
Throughput (Gbps)
40
30
20
10
0
64
128 256 512 1024 1518
Packet Size (Bytes)
(c) IDS Router
51
ure 5.8. Performance of Click and Snap with three di↵erent applications.
T
0
Classifier+routing+pattern matching
64
128 256 512 1024 1518
Packet Size (Bytes)
(b) DPI Router
Click
Snap-GPU
Snap-CPU
Snap-GPU w/ Slicing
Throughput (Gbps)
40
30
20
10
0
64
128 256 512 1024 1518
Packet Size (Bytes)
(c) IDS Router
51
ure 5.8. Performance of Click and Snap with three di↵erent applications.
Evaluation
52
we o
is
20 on unb
b
s
U
e
P
r
g
a
lSannacpe-C
ivn
nts a
Pack
d escr
si zed
/ Slic ed wer66
w
d
r
U
e
i
et I /
oP
e
b ed
(64 b
s i mp
ap-G utb ou
n
i
n
O
y
S
l
e
t
nd l i
S ect i
e) p a
mi c r
to th
k 10
n k s.
c
i
o
l
o
c
n
C
k
b
at en
et s u
5
e
U
.
n
P
3
chnm
.4. W
-G
si n g
gine.
S aparks
passe
e
C
m e as
lick’s
that
Th e s
s p ac
u
0
N
evalu
e
r
0
e
e
4
d
k
e
t
x
e
m
t
t
p
ate t
h
s
a
e
e for
p pa
ar r a n
r i me
betw
he pa
w
c
n
een i
ge m e
ket I
t s us
ardin
6
cket
4
n
/
nt, w
e th
t er f a
g2 rat
O 3e0n
1
I /O
ou r - p
e
8
ces w
hich
gine
e2for
s
i
m
56 m i n
ath a
and
p l est
ith n
passe
51i2
S
rrang
mum
o
P
s
n
pos2s0i
aapc’ske
addit
pack
upp o
10- 24
e
b
m
im
tS
et s f
le fo
ional
ent t
p
rt is
i
z
r
1518
e
o
r
rwa(r
ve(m
om a
h at u
pro c
B
n ot
y
8
1
t
d
e
b
5
e
essin10
nt2s4s)1
)eD
ses a
r,Pw
t h r ea
singl
l t i -t h
I
0
g
l
1
l
.
R
h
C
e
d
f
i
o
lick
W
c
ou r N
2
- s af e
u
i
h
readi
1
n
t
5
e
e
p
r
simyptes)
ut N
t est
, al l o
IC s i
n g su
256
b
I
S
8
C
o
w
n
n
2 th at Size (B ly
p p or
0
a
1
i
o
t h r ea
t
n
p
o
u
g
4
GPUa s646
r t est
e one
t to s
only
d s, o
ackS
i
P
n
nap-a-rpdaerth
m
g
t
o
4
l
an d a
ne p
n
a
e
0
c
e
h
o
utD
i n e.
pack
er N
Forw CPU
rd C
ent M
p
N
u
S
e
C
S
t
IC . S
na, pa-n
t I/O
lick’s
lick’s (a)
Q /R
GdPapUa-CPU
n
N
a
e
SS q
t
icing
x
e
p
0
l
h
t
n
i
S
w
r
s
4
m
S
/
a
e
t
/
ad t
i ng N
ap co 30
Sw
Foruweue, so w dds suppor
o be
etmSanap-GPU licing
d e, a
p
nd a
ardi0n e use si
t for
k run.
c
i
l
l
C
s
W
m
o
g
x
ultip2
e ad
repo p-GPU
3
t een
l0e th
d ed
nrat p e
Confi
t h r ea
S
r
r
e
f
d
gu r a
orma
ads p
U s for icting
t
P
n ce f
8
0
i
e
o
C
4
r
1
Click
n
0
N
5
ap
/ Sl he Sna
or
2
o
1
I
n
w
C
t
S
,
U
1 Pa
p
P
e
4
1
s
a
0
G
2
c
e
ch us
on1fi0g
th
Click
g
r
nap
S
e
i
u
n
0
r
v
g
4 Pa
3
2
a
i
t
a
ion.
d
51
Th r o
Click lick ths (110
t
)
s
a
e
h
5u06ghpu (Byt
4C Pa -GPU thre
t
2
4
.
5
r
5
a
p
t
t
h
d
e
Snap Sna s (4
20
8Gbps t Size
i
2
f
1
th0rea )
i
1 Pa
8
e
.28 G ack 6.56
s
s
d
t
4
Snap
h
a
s
M
6
b
)
4
0
P
l
p
p
s 1
ps128
C
13.02
1518
,
4 Pa4
0
N
1
4
n
2
D
1
0
o
.
1
8
S
t
G
i
h
M
t
2
b
U
s
56
ps
B as e
a
pps
P
512
r
G
)
6
u
s
5
1
5
8
P
e
Forw
2
8
g
1
t
.
59 G
.5 Mp g aacke
2 10128
fi
(B y
n
e
0
ardin Fi3g0u
z
b
i
t
n
o
p
2
p
S
t
S
i
c
s 1
4 a1ck5et
s
ize (B64
c
s
30.97
g Pe re 5.
e
u
g
p
2
1
n
P
8 urtg
ytes)
x
rfo0rm
Gbps he.2nMp (c
8.
e
er i
o
e
n
p
)
R
2 ance Perfor
s
a
PiIv
c
d
)D
of Sn manccee w44.0 Mp1518 IDS Route
be
e
(
manc
h
w
h
U,
t
a
g
Ps
r
p
i
t
arpm
n
o
C
i
e nu
s
4
c
f
p
i
w
l
2
a
n
an d
Cl1i0ck
i
SnThu U w/ S
mber
o
n
0
t
f
1
C
P
o
2
a
r
a
i
l
G
1
i
e
n
h
.
s are
c
5
t
p
)
d
k
ea1
a
t
s
a
t Sn
Snap e is
r
e
ytes
56
u
founing p
.89x
2
e
k
B
(
l
g
c
k
w
e
c
c
g
i
fi
8
l
a
z
i
d
d
i
t
C
2
sp eed
n
n
.
n
h
r
p
S
1
i
i
o
three taph-GePU
0 n T
s
re
c
e
n
4atbsle Packet
urpwfa
a
6
O ne
e
↵
e
n
r
i
d
S
o
e
f i ↵e r e n , t h
h
or silem
d
t
o
5.2.
a rd
inte.r F
i
e
w
f
r
l
n
e
w
o
h
g
t
F
a
9 esting her le path ) SDNSnap’ns t; t U
aw
h
h
n
o
ppl i c
t
40
l
o
s
i
friets 5.
g
d
e
t
s
a
r
n
a
P
c
i
e
tiu
fo(rwa eme naim
r
onrsa.
prov w/oScliceiss
p-C
singl Mat esult is
u
y
u
s
g
l
g
r
S
a
e
e
eding a p-GhPUpm
Se-path
r ents
fi
t h at
e
t 30
D
n
a
S
I
m
r
o
waP
S
t
U
D
n
a
c
l
n
d
s
I
p er f o
nh2i.c
to g
i
a
Sd
n
p
e
G rdi ng p e
e
’
t
m
s
a
3
h
h
i
n
rman o
8x sp 52lin e I/O
t
four- of w
s
s
e
o
y
r f or m
k
y
z
d
t
p
c
r
c
e
i
i
e
a
b
e
l
n
s
e
.
h th
t0
C
h
a du
v
t i on s
ex p e
r i me
5.5.1
Thro
ug
mi z a
ps)
t (Gb
Thro
ughp
ut (G
bps)
ps)
b
G
(
t
ghpu
u
o
r
Th
s)
p
b
G
(
put
h
g
u
Thro
s)
p
b
t (G
u
p
h
g
u
o
Thr
ps)
b
G
(
t
ghpu
u
o
r
Th
Evaluation
we o
is
20 on unb
b
s
U
e
P
r
g
a
lSannacpe-C
ivn
nts a
Pack
d escr
si zed
/ Slic ed wer66
w
d
r
U
e
i
et I /
oP
e
b ed
(64 b
s i mp
ap-G utb ou
n
i
n
O
y
S
l
e
t
nd l i
S ect i
e) p a
mi c r
to th
k 10
n k s.
c
i
o
l
o
c
n
C
k
b
at en
et s u
5
e
U
.
n
P
3
chnm
.4. W
-G
si n g
gine.
S aparks
passe
e
C
m e as
lick’s
that
Th e s
s p ac
u
0
N
evalu
e
r
0
e
e
4
d
k
e
t
x
e
m
t
t
p
ate t
h
s
a
e
e for
p pa
ar r a n
r i me
betw
he pa
w
c
n
een i
ge m e
ket I
t s us
ardin
6
cket
4
n
/
nt, w
e th
t er f a
g2 rat
O 3e0n
1
I /O
ou r - p
e
8
ces w
hich
gine
e2for
s
i
m
56 m i n
ath a
and
p l est
ith n
passe
51i2
S
rrang
mum
o
P
s
n
pos2s0i
aapc’ske
addit
pack
upp o
10- 24
e
b
m
im
tS
et s f
le fo
ional
ent t
p
rt is
i
z
r
1518
e
o
r
rwa(r
ve(m
om a
h at u
pro c
B
n ot
y
8
1
t
d
e
b
5
e
essin10
nt2s4s)1
)eD
ses a
r,Pw
t h r ea
singl
l t i -t h
I
0
g
l
1
l
.
R
h
C
e
d
f
i
o
lick
W
c
ou r N
2
- s af e
u
i
h
readi
1
n
t
5
e
e
p
r
simyptes)
ut N
t est
, al l o
IC s i
n g su
256
b
I
S
8
C
o
w
n
n
2 th at Size (B ly
p p or
0
a
1
i
o
t h r ea
t
n
p
o
u
g
4
GPUa s646
r t est
e one
t to s
only
d s, o
ackS
i
P
n
nap-a-rpdaerth
m
g
t
o
4
l
an d a
ne p
n
a
e
0
c
e
h
o
utD
i n e.
pack
er N
Forw CPU
rd C
ent M
p
N
u
S
e
C
S
t
IC . S
na, pa-n
t I/O
lick’s
lick’s (a)
Q /R
GdPapUa-CPU
n
N
a
e
SS q
t
icing
x
e
p
0
l
h
t
n
i
S
w
r
s
4
m
S
/
a
e
t
/
ad t
i ng N
ap co 30
Sw
Fo•ruweue, so w dds suppor
o be
etmSanap-GPU licing
d e, a
p
nd a
ardi0n e use si
t for
k run.
c
i
l
l
C
s
W
m
o
g
x
ultip2
e ad
repo p-GPU
3
t een
l0e th
d ed
nrat p e
Confi
t h r ea
S
r
r
e
f
d
gu r a
orma
ads p
U s for icting
t
P
n ce f
8
0
i
e
o
C
4
r
1
Click
n
0
N
5
ap
/ Sl he Sna
or
2
o
1
I
n
w
C
t
S
,
U
1 Pa
p
P
e
4
1
s
a
0
G
2
c
e
ch us
on1fi0g
th
pCli•
g
r
ck 4
Sna
e
i
u
n
0
r
v
g
3
2
a
i
t
a
ion.
d
51
Th r o
Click licPkaths (110
t
)
s
a
e
h
5u06ghpu (Byt
4C Pa -GPU thre
t
2
4
.
5
r
5
a
p
t
t
h
d
e
Snap Sna s (4
20
8Gbps t Size
i
2
f
1
th0rea )
i
1 Pa
8
e
.28 G ack 6.56
s
s
d
t
4
Snap
h
a
s
M
6
b
)
4
0
P
l
p
p
s 1
ps128
C
13.02
1518
,
4 Pa4
0
N
1
4
n
2
D
1
0
o
.
1
8
S
t
•
G
i
h
M
t
2
b
U
s
56
ps
B as e
a
pps
P
512
r
G
)
6
u
s
5
1
5
8
P
e
Forw
2
8
g
1
t
.
59 G
.5 Mp g aacke
2 10128
fi
(B y
n
e
0
ardin Fi3g0u
z
b
i
t
n
o
p
2
p
S
t
S
i
c
s 1
4 a1ck5et
s
ize (B64
c
s
30.97
g Pe re 5.
e
u
g
p
2
1
n
P
8 urtg
ytes)
x
rfo0rm
Gbps he.2nMp (c
8.
e
er i
o
e
n
p
)
R
2 ance Perfor
s
a
PiIv
c
d
)D
of Sn manccee w44.0 Mp1518 IDS Route
be
e
(
manc
h
w
h
U,
t
a
g
Ps
r
p
i
t
arpm
n
o
C
i
e nu
s
4
c
f
p
i
w
l
2
a
n
an d
Cl1i0ck
i
SnThu U w/ S
mber
o
n
0
t
f
1
C
P
o
2
a
r
a
i
l
G
1
i
e
n
h
.
s are
c
5
t
p
)
d
k
ea1
a
t
s
a
t Sn
Snap e is
r
e
ytes
56
u
founing p
.89x
2
e
k
B
(
l
g
c
k
w
e
c
c
g
i
fi
8
l
a
z
i
d
d
i
t
C
2
sp eed
n
n
.
n
h
r
p
S
1
i
i
o
three taph-GePU
0 n T
s
re
c
e
n
4atbsle Packet
urpwfa
a
6
O ne
e
↵
e
n
r
i
d
S
o
e
f i ↵e r e n , t h
h
or silem
d
t
o
5.2.
a rd
inte.r F
i
e
w
f
r
l
n
e
w
o
h
g
t
F
a
9 esting her le path ) SDNSnap’ns t; t U
aw
h
h
n
o
ppl i c
t
40
l
o
s
i
friets 5.
g
d
e
t
s
a
r
n
a
P
c
i
e
tiu
fo(rwa eme naim
r
onrsa.
prov w/oScliceiss
p-C
singl Mat esult is
u
y
u
s
g
l
g
r
S
a
e
e
eding a p-GhPUpm
Se-path
r ents
fi
t h at
e
t 30
D
n
a
S
I
m
r
o
waP
S
t
U
D
n
a
c
l
n
d
s
I
p er f o
nh2i.c
to g
i
a
Sd
n
p
e
G rdi ng p e
e
’
t
m
s
a
3
h
h
i
n
rman o
8x sp 52lin e I/O
t
four- of w
s
s
e
o
y
r f or m
k
y
z
d
t
p
c
r
c
e
i
i
e
a
b
e
l
n
s
e
.
h th
t0
C
h
a du
v
t i on s
ex p e
r i me
5.5.1
Thro
ug
mi z a
Thro
ughp
ut (G
bps)
ps)
b
G
(
t
ghpu
u
o
r
Th
Evaluation
CPU just 1/3 or 1/4
ps)
b
G
(
t
ghpu
u
o
r
Th
s)
p
b
t (G
u
p
h
g
u
o
Thr
GPU reached 40Gb/s line rate at 128B
ps)
t (Gb
s)
p
b
G
(
put
h
g
u
Thro
Latency tolerable in LAN for non-latencysensitive app, negligible in WAN
12
·
Preserving Click’s flexibility:
full IP router
E. Kohler et al.
FromDevice(eth0)
FromDevice(eth1)
Classifier(...)
ARP
queries
Classifier(...)
ARP
responses
ARP
queries
IP
ARPResponder
(1.0.0.1 ...)
to Queue
ARP
responses
IP
ARPResponder
(2.0.0.1 ...)
to ARPQuerier
to Queue
to ARPQuerier
Paint(1)
Paint(2)
Strip(14)
CheckIPHeader(...)
GetIPAddress(16)
LookupIPRoute(...)
to Linux
DropBroadcasts
DropBroadcasts
CheckPaint(1)
CheckPaint(2)
ICMPError
redirect
IPGWOptions(1.0.0.1)
ICMPError
redirect
IPGWOptions(2.0.0.1)
ICMPError
bad param
FixIPSrc(1.0.0.1)
ICMPError
bad param
FixIPSrc(2.0.0.1)
DecIPTTL
DecIPTTL
ICMPError
TTL expired
IPFragmenter(1500)
ICMPError
TTL expired
IPFragmenter(1500)
ICMPError
must frag
ICMPError
must frag
from Classifier
from Classifier
ARPQuerier(1.0.0.1, ...)
ARPQuerier(2.0.0.1, ...)
ToDevice(eth0)
ToDevice(eth1)
Fig. 8.
An IP router configuration.
53
·
E. Kohler et al.
FromDevice(eth0)
FromDevice(eth1)
Classifier(...)
ARP
queries
Classifier(...)
ARP
responses
ARP
queries
IP
ARPResponder
(1.0.0.1 ...)
to Queue
ARP
responses
IP
Add a pattern
matching
element
ARPResponder
(2.0.0.1 ...)
to ARPQuerier
to Queue
to ARPQuerier
Paint(1)
Paint(2)
Strip(14)
CheckIPHeader(...)
GetIPAddress(16)
LookupIPRoute(...)
to Linux
DropBroadcasts
DropBroadcasts
CheckPaint(1)
CheckPaint(2)
ICMPError
redirect
IPGWOptions(1.0.0.1)
ICMPError
redirect
IPGWOptions(2.0.0.1)
ICMPError
bad param
FixIPSrc(1.0.0.1)
ICMPError
bad param
FixIPSrc(2.0.0.1)
DecIPTTL
DecIPTTL
ICMPError
TTL expired
IPFragmenter(1500)
ICMPError
must frag
from Classifier
from Classifier
ARPQuerier(1.0.0.1, ...)
ARPQuerier(2.0.0.1, ...)
ToDevice(eth0)
ToDevice(eth1)
Fig. 8.
An IP router configuration.
Snap-CPU
Snap-GPU (w/ Slicing)
40
30
20
10
0
ICMPError
TTL expired
IPFragmenter(1500)
ICMPError
must frag
Click
Throughput (Gbps)
12
Preserving Click’s flexibility:
full IP router
64
128 256 512 1024 1518
Packet Size (Bytes)
Figure 5.11. Fully functional IP router + IDS performance
53
of elements, some of which are duplicated sixteen times, once for
·
E. Kohler et al.
FromDevice(eth0)
FromDevice(eth1)
Classifier(...)
ARP
queries
Classifier(...)
ARP
responses
ARP
queries
IP
ARPResponder
(1.0.0.1 ...)
to Queue
ARP
responses
IP
Add a pattern
matching
element
ARPResponder
(2.0.0.1 ...)
to ARPQuerier
to Queue
to ARPQuerier
Paint(1)
Paint(2)
Strip(14)
CheckIPHeader(...)
GetIPAddress(16)
LookupIPRoute(...)
to Linux
DropBroadcasts
DropBroadcasts
CheckPaint(1)
CheckPaint(2)
ICMPError
redirect
IPGWOptions(1.0.0.1)
ICMPError
redirect
IPGWOptions(2.0.0.1)
ICMPError
bad param
FixIPSrc(1.0.0.1)
ICMPError
bad param
FixIPSrc(2.0.0.1)
DecIPTTL
DecIPTTL
ICMPError
TTL expired
IPFragmenter(1500)
ICMPError
must frag
from Classifier
from Classifier
ARPQuerier(1.0.0.1, ...)
ARPQuerier(2.0.0.1, ...)
ToDevice(eth0)
ToDevice(eth1)
Fig. 8.
An IP router configuration.
Snap-CPU
Snap-GPU (w/ Slicing)
40
30
20
10
0
ICMPError
TTL expired
IPFragmenter(1500)
ICMPError
must frag
Click
Throughput (Gbps)
12
Preserving Click’s flexibility:
full IP router
64
128 256 512 1024 1518
Packet Size (Bytes)
Figure 5.11. Fully functional IP router + IDS performance
53
of elements, some of which are duplicated sixteen times, once for
Snap Summary
•
A generic parallel packet processing framework
•
flexibility of Click
•
fast parallel power from GPUs
•
https://github.com/wbsun/snap
•
Details not presented: network I/O, async
scheduling
54
Thesis statement
The throughput of system software with
parallelizable, computationally expensive tasks
can be improved by using GPUs and
frameworks with memory-efficient and
throughput-oriented designs.
55
Conclusion
56
Conclusion
Generic principles:
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
56
Conclusion
Generic principles:
Batched processing
Asynchronous non-blocking GPU programming
Reduce PCIe data transfer with compacted workload
All use the same locked memory for zero-copy DMA
Concrete frameworks:
GPUstore with high throughput storage applications
Snap with high throughput network packet processing
56
Thanks!
Q&A
57
Backup slides
58
ROI for memory access coalescing
packet
ROIs
59
How ROI slicing works?
•
To Click insiders:
•
Batcher accepts ROI requests from BElements
•
Batcher merges requested ROIs into result ROIs
•
Each BElement asks for its ROIs’ offsets
•
GPU kernels invoked by BElements use variables
for offsets
60
How predicated execution works?
•
To Click insiders:
•
Manually assigned: where and what!
61
Path-encoded predicate
GPUElement-1
1
0
GPUElement-2
00,0
GPUElement-4
01,0
GPUElement-5
GPUElement-3
1,1
0,1
10,0
GPUElement-6
GPUElement-8
GPUElement-7
0,0,1
GPUElement-9
62
1,0,1
GPUElement-10
Download