Harnessing GPU Computing in System Level Software Weibin Sun 2 Software stack needs fast system level base! 2 Software stack needs fast system level base! System functionality can be expensive! 2 Software stack needs fast system level base! ‣ Crypto: AES, SHA1, MD5, RSA Lookup: IP routing, DNS, key-value store, DB ‣ Erasure coding: RAID, distributed storage, wireless ‣ Pattern matching: NIDS, virus, searching ‣ Compression: storage, network ‣… ‣ System functionality can be expensive! 2 Software stack needs fast system level base! System functionality can be expensive! 2 Software stack needs fast system level base! System functionality can be expensive! Modern digital workloads are big! 2 Software stack needs fast system level base! System software needs high throughput processing! System functionality can be expensive! Modern digital workloads are big! 2 Software stack needs fast system level base! System software needs high throughput processing! System functionality can be expensive! Modern digital workloads are big! 3 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! 3 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! packets, sectors, data blocks, … 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! GPUs are the de-facto standard parallel processors! 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! GPUs are the de-facto standard parallel processors! 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! GPUs are the de-facto standard parallel processors! 4 Software stack needs fast system level base! System functionality can be expensive! System software needs high throughput processing! Modern digital workloads are big! Many system tasks are data-parallelizable! Using GPUs in system software! GPUs are the de-facto standard parallel processors! 4 Thesis statement The throughput of system software with parallelizable, computationally expensive tasks can be improved by using GPUs and frameworks with memory efficient and throughput oriented designs. 5 Thesis statement The throughput of system software with parallelizable, computationally expensive tasks can be improved by using GPUs and frameworks with memory efficient and throughput oriented designs. 5 Thesis statement The throughput of system software with parallelizable, computationally expensive tasks can be improved by using GPUs and frameworks with memory efficient and throughput oriented designs. 5 Thesis statement The throughput of system software with parallelizable, computationally expensive tasks can be improved by using GPUs and frameworks with memory efficient and throughput oriented designs. 5 Thesis statement The throughput of system software with parallelizable, computationally expensive tasks can be improved by using GPUs and frameworks with memory efficient and throughput oriented designs. 5 Thesis work • Generic principles • Concretely: • • Two generic frameworks for representative system tasks: storage and network • Example applications on top of both frameworks Literature survey * 6 GPU essentials GPU Stream Multiprocessor Stream Multiprocessor Registers L2 Cache Stream Multiprocessor Registers Shared Registers Memory Shared Memory Shared L1 Cache Memory L1 Cache L1 Cache Constant Memory Texture Memory 7 Global Memory (Device Memory) PCIe Main Memory (Host Memory) CUDA GPU essentials GPU Stream Multiprocessor Stream Multiprocessor Registers L2 Cache Stream Multiprocessor Registers Shared Registers Memory Shared Memory Shared L1 Cache Memory L1 Cache L1 Cache Constant Memory Texture Memory 7 Global Memory (Device Memory) PCIe Main Memory (Host Memory) CUDA GPU essentials GPU Stream Multiprocessor Stream Multiprocessor Registers L2 Cache Stream Multiprocessor Registers Shared Registers Memory Shared Memory Shared L1 Cache Memory L1 Cache L1 Cache Constant Memory Texture Memory SIMT wimpy cores: 32-core warp 7 Global Memory (Device Memory) PCIe Main Memory (Host Memory) CUDA GPU essentials GPU Stream Multiprocessor Stream Multiprocessor Registers L2 Cache Stream Multiprocessor Registers Shared Registers Memory Shared Memory Shared L1 Cache Memory L1 Cache L1 Cache SIMT wimpy cores: 32-core warp Constant Memory Global Memory (Device Memory) Texture Memory wide memory interface: 384bits! 7 PCIe Main Memory (Host Memory) CUDA GPU essentials GPU Stream Multiprocessor Stream Multiprocessor Registers L2 Cache Stream Multiprocessor Registers Shared Registers Memory Shared Memory Shared L1 Cache Memory L1 Cache L1 Cache SIMT wimpy cores: 32-core warp Constant Memory Global Memory (Device Memory) PCIe Main Memory (Host Memory) Texture Memory wide memory interface: 384bits! 7 necessary memory copy Related work: pioneers • Gnort [RAID ’2008] • PacketShader [SIGCOMM ’2010] • SSLShader [NSDI ’2011] • EigenCFA [POPL ’2011] • Gibraltar [ICPP ’2010] • CrystalGPU [HPDC ’2010] • … 8 Related work: pioneers • Gnort [RAID ’2008] • PacketShader [SIGCOMM ’2010] • SSLShader [NSDI ’2011] • EigenCFA [POPL ’2011] • Gibraltar [ICPP ’2010] • CrystalGPU [HPDC ’2010] • … Specialized and Ad-hoc solutions to one particular application! 8 GPU computing frameworks/models • Hydra [ASPLOS’08] • CUDA-Lite [LCPC’08] • hiCUDA [GPGPU’09] • ASDSM [ASPLOS’10] • Sponge [ASPLOS’11] • CGCM [PLDI’11] • PTask [SOSP’11] • … 9 GPU computing frameworks/models • Hydra [ASPLOS’08] • CUDA-Lite [LCPC’08] • hiCUDA [GPGPU’09] • ASDSM [ASPLOS’10] • Sponge [ASPLOS’11] • CGCM [PLDI’11] • PTask [SOSP’11] • … 9 • Simple workflow • Long computation, PCIe copy and synchronization not matter • No batched data divergence GPU computing frameworks/models • Hydra [ASPLOS’08] • CUDA-Lite [LCPC’08] • hiCUDA [GPGPU’09] U P G l e v Simple workflow e l m e t s n Sy w o s t i Long computation, PCIe s d e e n g copy and synchronization n i t u p ! s not matter k com r o w e m a r f c i r e No batched data gen divergence • ASDSM [ASPLOS’10] • Sponge [ASPLOS’11] • CGCM [PLDI’11] • PTask [SOSP’11] • … • • • 9 Rest of this talk • Generic principles • GPUstore : for storage systems • Snap : for packet processing 10 System level GPU computing: generic principles 11 Problems Latency-oriented system code V.S. throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Wasted PCIe data transfer Double buffering for GPU DMA 12 Problems Latency-oriented system code V.S. throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Wasted PCIe data transfer Double buffering for GPU DMA 12 Latency V.S. throughput 13 Latency V.S. throughput Batched processing: wait for enough workload 13 Latency V.S. throughput Batched processing: wait for enough workload • how much is enough? 13 Latency V.S. throughput Batched processing: wait for enough workload • how much is enough? • increased latency 13 Problems Latency-oriented system code V.S. throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Wasted PCIe data transfer Double buffering for GPU DMA 14 Problems Batched processing CPU-GPU synchronization hurts asynchronous systems Wasted PCIe data transfer Double buffering for GPU DMA 14 Problems Batched processing CPU-GPU CPU-GPUsynchronization synchronizationhurts hurtsasynchronous asynchronoussystems systems Wasted PCIe data transfer Double buffering for GPU DMA 14 CPU-GPU synchronization Blocking: CPU: GPU: 15 CPU-GPU synchronization Blocking: CPU: GPU: 15 CPU-GPU synchronization Blocking: CPU: GPU: 15 CPU-GPU synchronization Blocking: sleep for sync CPU: GPU: 15 CPU-GPU synchronization Blocking: sleep for sync CPU: GPU: Non-blocking: CPU: GPU: 15 CPU-GPU synchronization Blocking: sleep for sync CPU: GPU: Non-blocking: keep working CPU: GPU: 15 CPU-GPU synchronization Blocking: CPU: e d o c System sleep for sync, etc.) k s i d , k r o w t (ne GPU: k c a b Call g n i l l o P Non-blocking: keep working CPU: GPU: 15 likes: Problems Latency-oriented Batched processing system code VS throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Wasted PCIe data transfer Double buffering for GPU DMA 16 Problems Latency-oriented Batched processing system code VS throughput-oriented GPU Asynchronous non-blocking GPU programming Wasted PCIe data transfer Double buffering for GPU DMA 16 Problems Latency-oriented Batched processing system code VS throughput-oriented GPU Asynchronous non-blocking GPU programming WastedPCIe PCIedata datatransfer transfer Wasted Double buffering for GPU DMA 16 How PCIe transfer wasted? 17 How PCIe transfer wasted? Do you need the entire packet to do routing lookup? 17 How PCIe transfer wasted? Do you need the entire packet to do routing lookup? What you need: 17 How PCIe transfer wasted? Do you need the entire packet to do routing lookup? What you need: 4 bytes 17 How PCIe transfer wasted? Do you need the entire packet to do routing lookup? What you need: 4 bytes What you may transfer: 17 How PCIe transfer wasted? Do you need the entire packet to do routing lookup? What you need: 4 bytes What you may transfer: ≥ 64 bytes 17 Compact your workload data For ``use-partial’’ data usage pattern Compaction PCIe Host memory GPU memory 18 Problems Latency-oriented Batched processing system code VS throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Asynchronous non-blocking GPU programming Wasted PCIe data transfer Double buffering for GPU DMA 19 Problems Latency-oriented Batched processing system code VS throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload Double buffering for GPU DMA 19 Problems Latency-oriented Batched processing system code VS throughput-oriented GPU CPU-GPU synchronization hurts asynchronous systems Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload Double Doublebuffering bufferingfor forGPU GPUDMA DMA 19 GPU DMA contraints 20 GPU DMA contraints • DMA only on locked memory pages 20 GPU DMA contraints • DMA only on locked memory pages • GPU drivers only recognize their own locked memory pool GPU driver’s memory area DMA on PCIe GPU mem DMA locked mem 20 GPU DMA contraints • DMA only on locked memory pages • GPU drivers only recognize their own locked memory pool GPU driver’s memory area External mem DMA on PCIe GPU mem DMA locked mem 20 GPU DMA contraints • DMA only on locked memory pages • GPU drivers only recognize their own locked memory pool GPU driver’s memory area External mem DMA on PCIe GPU mem DMA locked mem 20 GPU DMA contraints • DMA only on locked memory pages m e t s y s • GPU drivers only recognize their d own locked memory n a r e v i r d pool ake GPU M d e k c o l e m a s e h t e s GPU driver’s memory area u e d co ! A M D y p o c o r e z r o f y r o External mem DMA on PCIe mem GPU mem DMA locked mem 20 Problems Batched Latency-oriented processing system code VS throughput-oriented GPU CPU-GPU synchronization Asynchronous non-blockinghurts GPUasynchronous programming systems Wasted PCIe PCIe data data transfer transfer with compacted workload Reduce Double buffering for GPU DMA 21 Problems Batched Latency-oriented processing system code VS throughput-oriented GPU CPU-GPU synchronization Asynchronous non-blockinghurts GPUasynchronous programming systems Wasted PCIe PCIe data data transfer transfer with compacted workload Reduce All use the same locked memory for zero-copy DMA 21 Generic principles Batched Latency-oriented processing system code VS throughput-oriented GPU CPU-GPU synchronization Asynchronous non-blockinghurts GPUasynchronous programming systems Wasted PCIe PCIe data data transfer transfer with compacted workload Reduce All use the same locked memory for zero-copy DMA 21 GPUstore: Harnessing GPU Computing in OS Storage Systems [Weibin Sun, Robert Ricci, Matthew L. Curry @ SYSTOR’12] 22 GPUstore overview 23 GPUstore overview • A storage framework for Linux kernel to use CUDA GPUs in filesystems, block drivers, etc. 23 GPUstore overview • A storage framework for Linux kernel to use CUDA GPUs in filesystems, block drivers, etc. Syscall VFS Filesystem Page cache Block I/O Virtual Block Driver Block Driver Disk 23 GPUstore overview • A storage framework for Linux kernel to use CUDA GPUs in filesystems, block drivers, etc. Syscall VFS Filesystem Page cache • Minimally invasive GPU integration Block I/O Virtual Block Driver Block Driver Disk 23 GPUstore overview • A storage framework for Linux kernel to use CUDA GPUs in filesystems, block drivers, etc. Syscall VFS Filesystem Page cache • Minimally invasive GPU integration Block I/O • Small changes Virtual Block Driver Block Driver Disk 23 GPUstore overview • A storage framework for Linux kernel to use CUDA GPUs in filesystems, block drivers, etc. Syscall VFS Filesystem Page cache • Minimally invasive GPU integration Block I/O • • Small changes Preserve interface and semantics Virtual Block Driver Block Driver Disk 23 GPUstore overview • A storage framework for Linux kernel to use CUDA GPUs in filesystems, block drivers, etc. Syscall VFS Filesystem Page cache • Minimally invasive GPU integration Block I/O • • Small changes Preserve interface and semantics Virtual Block Driver Block Driver • Keep efficient Disk 23 GPUstore integration Syscall VFS Filesystems… e.g. eCryptfs Page Cache Functional virtual block drivers… e.g. dm-crypt, MD RAID6 Block device drivers 24 GPUstore integration Syscall VFS Filesystems… e.g. eCryptfs !... CPUCipher(buf); !... Page Cache Functional virtual block drivers… e.g. dm-crypt, MD RAID6 !... CPURSCode(buf); !... Block device drivers 24 GPUstore integration Syscall VFS Filesystems… e.g. eCryptfs !... GPUCipher(buf); !... Page Cache Functional virtual block drivers… e.g. dm-crypt, MD RAID6 !... GPURSCode(buf); !... Block device drivers 24 GPUstore integration Syscall VFS !... GPUCipher(buf); !... Page Cache Functional virtual block drivers… e.g. dm-crypt, MD RAID6 Computation Requests !... GPURSCode(buf); !... Block device drivers 24 ion t a t pu m o C sts e u Req GPUstore Filesystems… e.g. eCryptfs GPU Services GPU Ciphers GPU RAID6 GPUstore request workflow sys-calls Service Users ... requests Service-specific Scheduling Execution Preparation GPU Execution Post-execution GPU CPU 25 GPUstore request workflow sys-calls Service Users ... requests Service interface Service-specific Scheduling Execution Preparation GPU Execution Post-execution GPU CPU 25 Apply generic principles Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 26 Apply generic principles Batched processing Asynchronous non-blocking GPU programming 26 Apply generic principles Batched processing Merge small requests Asynchronous non-blocking GPU programming 26 Apply generic principles Batched processing Merge small requests Asynchronous non-blocking GPU programming Split large requests 26 Apply generic principles Batched processing Merge small requests Asynchronous non-blocking GPU programming Split large requests Callback-based request processing 26 Request scheduling Merge small requests Split large requests 27 Request scheduling Merge small requests Split large requests How to define “small” and “large”? 27 Request scheduling Merge small requests Split large requests How to define “small” and “large”? Up to specific services. 27 Apply generic principles Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 28 Apply generic principles Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 28 Apply generic principles All use the same locked memory for zero-copy DMA 28 Apply generic principles All use the same locked memory for zero-copy DMA User space CUDA library dependency 28 Apply generic principles All use the same locked memory for zero-copy DMA User space CUDA library dependency In-kernel CUDA’s locked memory allocator 28 In-kernel CUDA locked memory allocator CUDA mem continuous VA User space (helper) Kernel space Physical pages Continuous VA Kernel vmap 29 In-kernel CUDA locked memory allocator What’s next? kmalloc()/vmalloc()/ get_free_pages() with User space (helper) GPUstoreMalloc()? Kernel space CUDA mem Replace continuous VA Physical pages Continuous VA Kernel vmap 29 In-kernel CUDA locked memory allocator What’s next? kmalloc()/vmalloc()/ get_free_pages() with User space (helper) GPUstoreMalloc()? Kernel space CUDA mem Replace continuous VA Physical pages Continuous VA Kernel vmap Sometimes it is infeasible! 29 Why infeasible? 30 Why infeasible? Syscall VFS Filesystem Page cache Block I/O Virtual Block Driver Block Driver 30 Why infeasible? Syscall ➡ Who allocates the memory? VFS Filesystem Page cache Block I/O Virtual Block Driver Block Driver 30 Why infeasible? Syscall ➡ Who allocates the memory? ๏ “pass-by-ref” interface VFS Filesystem Page cache Block I/O Virtual Block Driver Block Driver 30 Why infeasible? Syscall ➡ Who allocates the memory? ๏ ➡ “pass-by-ref” interface VFS Filesystem Can NOT modify highly-depended cache allocators: Page cache Block I/O Virtual Block Driver Block Driver 30 Why infeasible? Syscall ➡ Who allocates the memory? ๏ ➡ “pass-by-ref” interface Filesystem Can NOT modify highly-depended cache allocators: ‣ ‣ ‣ ‣ VFS page cache, buffer cache object cache packet pool … Page cache Block I/O Virtual Block Driver Block Driver 30 Why infeasible? Syscall ➡ Who allocates the memory? VFS y r o m e m y r a r t i b r a e ➡ Can NOT modify s highly-depended cache? u e w A n M a D C allocators: U P G r o f ‣ page cache, buffer cache ๏ “pass-by-ref” interface Filesystem Page cache Block I/O ‣ ‣ ‣ object cache packet pool … Virtual Block Driver Block Driver 30 Remap external memory for GPU DMA 31 Remap external memory for GPU DMA External mem GPU mem 31 Remap external memory for GPU DMA GPU driver’s locked memory area (vma) External mem GPU mem 31 Remap external memory for GPU DMA GPU driver’s locked memory area (vma) set page-table entries External mem DMA locked mem 31 GPU mem Remap external memory for GPU DMA GPU driver’s locked memory area (vma) set page-table entries External mem DMA on PCIe DMA locked mem 31 GPU mem Apply generic principles All use the same locked memory for zero-copy DMA 32 Apply generic principles All use the same locked memory for zero-copy DMA In-kernel CUDA locked memory allocator 32 Apply generic principles All use the same locked memory for zero-copy DMA In-kernel CUDA locked memory allocator Remap external memory into GPU driver’s locked memory area 32 Implementation and evaluation 33 Implementation and evaluation • Case studies of major storage residents: • dm-crypt: disk encryption layer • eCryptfs: encrypted filesystem • md: software RAID6 33 md RAID6 dm-crypt eCryptfs Implementation and evaluation GPUstore GPU GPU Encryption Case studies ofRAID major storage residents: Recovery Service Service • dm-crypt: disk encryption layer • eCryptfs: encrypted filesystem 4.3. Storage services implemented with GPUstore. • md: software RAID6 4.1. • • Integration cost: Subsystem dm-crypt eCryptfs md Total LOC 1,800 11,000 6,000 Modified LOC 50 200 20 Percent 3% 2% 0.3% Approximate modified lines 33of code required to call GPU serv AES cipher performance 4500" 4000" Throughput"(MB/s)" 3500" 3000" CPU" Base"GPU" 2500" 2000" 1500" 1000" 500" 0" 4KB$ 8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$ 34 Buffer Size AES cipher performance 4500" 4000" CPU" Throughput"(MB/s)" 3500" 3000" Base"GPU" 2500" 2000" GPU"with"Split" 1500" 1000" 500" 0" 4KB$ 8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$ 34 Buffer Size AES cipher performance 4500" CPU" 4000" Throughput"(MB/s)" 3500" Base"GPU" 3000" 2500" GPU"with"Split" 2000" GPU"no"RB" 1500" 1000" 500" 0" 4KB$ 8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$ 34 Buffer Size AES cipher performance 4500" CPU" 4000" Throughput"(MB/s)" 3500" Base"GPU" 3000" GPU"with"Split" 2500" GPU"no"RB" 2000" 1500" GPU"no"RB"with"Split" 1000" 500" 0" 4KB$ 8KB$ 16KB$ 32KB$ 64KB$ 128KB$256KB$512KB$ 1MB$ 2MB$ 4MB$ 34 Buffer Size Faster than SSD: dm-crypt 1400 Throughput (MB/s) 1200 1000 GPU Read GPU Write CPU Read CPU Write 800 600 400 200 0 4 KB 16 KB 64 KB 256 KB Read/Write size 35 1 MB 4 MB Faster than SSD: dm-crypt 1400 Throughput (MB/s) 1200 1000 GPU Read GPU Write CPU Read CPU Write 800 600 400 SSD: 250MB/s 200 0 4 KB 16 KB 64 KB 256 KB Read/Write size 35 1 MB 4 MB Working with existing optimization eCryptfs on RAM disks Linux max 128KB read-ahead effects on read 800 Throughput (MB/s) 700 600 500 400 300 GPU Read GPU Write CPU Read CPU Write 200 100 0 4 KB 16 KB 64 KB 256 KB Read/Write size 36 1 MB 4 MB Figure GPUstore Summary • Enables efficient GPU-accelerated storage in Linux kernel with small changes • https://github.com/wbsun/kgpu • Details not presented: user-space helper, non-blocking KU comm, more exp result 37 Snap: Fast and Flexible Packet Processing With GPUs and Click [Weibin Sun, Robert Ricci @ ANCS’13] 38 Click background [Kohler, et.al. TOCS’00] FromDevice(eth0) • Elements CheckSum • Ports • IPLookup ToDevice(eth1) ToDevice(eth2) 39 push/pull Why using GPUs in Click? Configurations Throughput Slow down 30.97 Gbps N/A 17.7 Gbps 42.7% Simple Forwarder: FromDevice (ethX, RingY) ToDevice (ethX, RingY) Simple SDN Forwarder: FromDevice (ethX, RingY) ToDevice (ethX, RingY) SDNClassifier 40 Why using GPUs in Click? Configurations Simple Forwarder: FromDevice (ethX, RingY) Throughput s d ! e e n g n i s ! s r e e c 30.97 Gbps w o o r p p t g e n i k t c u a p P m o c e r 17.7 Gbps o m ToDevice (ethX, RingY) Slow down N/A Simple SDN Forwarder: FromDevice (ethX, RingY) ToDevice (ethX, RingY) SDNClassifier 40 42.7% Why using GPUs in Click? 41 GPU 100 CPU 10 8K 4K 2K 1K 512 256 128 64 1 0.1 GPU-64-bat GPU-1K-ba GPU-8K-ba 1000 Throughput (Mpps) 1000 32 Throughputs (Mpps) Why using GPUs in Click? Number of packets per batch (a) IP lookup algorithm (radix tree) 100 10 1 0.1 64 128 256 5 Packet Sizes (b) IDSMatcher algorithm (A Figure 2: GPU vs. CPU performance on packet processing efficient, but they are not well suited to building pipelines more complex than a single processing function. In particular, they do 41 not directly provide efficient support for splitting packets across can mov seve e fp 00 al pl nt, 0 a F g ck ex bu or ig 10 ith et ur re th t 0 s t ct an h e2 m pe e ly (ra :G rb 10 tp p as ya d r a ro ov in re P ix tc U ow ce id gle no G h t r PU 1 v s e e G ns sin e p t w s e P . ) G U- 640. C P Th trea g p ffic roce ell 1 U 1K ba PU -8 t is m at ien ss su K ba ch h i pe l ba tch (b el imi cod s–f t su ng ited tc r 64 G )I fo h em ts e or p fu t PU G r p o p n D t m s w en he at ex or ct b S C PU- 256 1 a h PU 2 M 28 nc or t gr ir u s a am t fo ion uild K -ba a -b tc eo g k, a p s e s p l r . I i n Pa at tch h e in a n h to a s ch n 2 g ck 5 n , s pl p p er t s r o 6 e d e pa Sn c s su si su en itt ar ip al tS go ck ap om ee pp mp lt din ing tic el iz 51 10 rit et es 2 w pl ks ort le of g pa ula ine GPU-64-bat 00 hm GPU 1000 1000 ill ex to ed lin ro pa c r, s pr (B yt 10 oc be , f si b ea ut ck ke th mo 10 GPU-1K-ba ( es 24 GPU A t u in et s ey re m e y r or 0 0 0 a l 0 h s 1 ) n ly- pl Cl pip g l s t ac d GPU-256-batcsihn ok CPU GPU-8K-ba 15PU en fu y ic e o U-6h4-batrcoh o 100-batch g th G C C 1 100 18 0 o0r a at ab nc an k. lin GoPk rou atcshs GPU-2K a C PU 0 1 pr su b a l d u K 1 e l U S g t PU P g G p sic s CPU ca oc p 1e0r00 io o e n h 1 , h s c t a b ri n fo na ffic ap GraPU-8Ko!0 k) m el es po 10 t b 0. t r r l h o 1 ei 10 ie dr he us sin rts s r f v 1 m u o e n a e m v um ed g, co 100 tu ut tl w r s. hi er pa p re er y s N1 (c g l a es by wh mp ck em i c s h l o n )S w 1 a c te se sp ha ets en or and ta. its a p ich le1x0 o 1 D N f e t s k a i n t h e , t N um P n c i d wi pa w d ge o ed 0.1 a in s t f a c h ack put ket s ba ata l s e . S t n on be or re l b ke 0.1 th a s et d p s1e flo d fi o w r batch d r e p e e t e s n t a t r s e s u k f of 0.1 ch ce di in d 1v8e Cl ro he axer of pacard b lea erie proc ta a oce d o w m u 5 r N i n 1 t p s e m 4 h G al c k es er at 6du 5o1u2 ca102m aicthm a ds s o es nd 0ss.o1 n 64 128 256 5 P r o b w t g u 5 l t a r 2 a c a i r k h l U e r 8 fi d a s f i 2 n i r s r i 1 l s p r i ) e a t s e l e e n tc forawardper f w m of er ce batch ass sto o a ele ing ro . 64 a wkeat lSl izengs (Buyst ed mNumber ts or Dpackets h n Packet Sizes n N c , h y a e P S e a d p i ) p c d re di me , in ( e o e w fi i e t s l e a g w h of orase ickm) d a r le a er ner , m tcffh n e l h k th it (Arhtso-C p o o m t s l ba c e a t o t b l al orwalgevoritrhom h C of he an ry, “e pe m en etnon-linienae r. s per an re s g tc e nt are axaes(radix e of tree) a r atch er el ug (a) a o IP lookup algorithm (b) IDSMatcher algorithm (A x M t o t S d r a l h D I h s ) r i e b t h r ( ho t r k g e t c i i . a r w e h . h a l t n c t e e e t l ev ith th k’os ritphi mast.i Nore pas a” o th In da that tshm n 3 e adiwxe tree) , m e l e g t l v a r p p u a e w e e g d n e e se si .1 on ve r d d ias pseetron f atde apt cl cessitn o r o h a p l akes n l t m e A k p y i r c a n e r t a a n e g r p S h t n a U , n e P i y k lnlyterenleall an us sig ve e na tedsloonhethe G sig enipCPU asns ce o W aGPU , s vs. I ou PC P,U perforbm . s e t l e Figure 2: performance well aton packet processing m m k n o l e e r e g l n h e l , o m e p p u s w w h i m n c i e t n n n a e i r vtse.- C e e n l o b b S y re e w cas n -v an usto anncd frPomed g ch akche theis pinpe en i d many fi e l r n I o thve pisai ckdets h/ h C ofrdeprato m a . a w m s n e t be ti ts pipeblinese c i a o d o p s mo at bl imes ptuo Crloick Iien or ck isnmouircedesoirgn cmh th n vieng fr y t d l pies i o u m o c w b d k g o g n r y t s n e i n a i e l p a e d k h a s h t p e a l e b t c o a p , s i oeu not akrk, .sIuto lem elrasare yn tssuited as m ll su ti eth . eIen pyalrticmularthe ng efficient, meuss awell ke t epipelines chwbuilding severatelmbutch they h t more can l c o know v e s n e s n w s m a i S o o c i ” w n t r . t a . a t c s s c r h o a a t d l n n l h e x G h u s t r e e t f h e s “ e e p l t n y g s k a C f e dwo W ro at alogng pmack a ssin dd e t sdps litteilne g gpeac P xp h l ro agde single natt ifedweprocessing high than h n and they do t t g o c a i g complex function. In particular, mov n r U t u i e u a h a o n n s m f o z a w s fi t i b r m t a e i n h r e s o p t o n o e o r r w s h t p t h l o i c , n y , i r y a i r s y o z t Psupnd d esyendaitng penackatec of rimor ck cases, e p r m o o e n f t th maainn mn g ks ls theatneedk ie d tha npackets we, ar p ts lhookflups e not’ directly fi n l i o a e p 41 s o e s s l b t i a a m e n e . a s k provide efficient support for splitting across seve r d e x c c o o w e a e i e c W g s s Pof p feor ke pu deult eoif raosutinis s oa enrt eed, wp e caf n dIn e fin PnCIeowbus. and fits s (M pp Th ro ug hp ut 8K 4K 2K 1K 5 m pp (Mpps) s) (M 8K 4K 4K 2K 1K 8K rock s! 32 64 128 256 512 512 256 128 1K 3 26K 2 12 44K 25 8 8K 51 6 Throughput 2 1K Th ro ug hp ut (Mpps) t u p h g u Thro GPU 32 the 64 Throughputs (Mpps) And 2K 8K 2K 4K ) ut (Mpps p h g u o r Th Why using GPUs in Click? Snap: the idea • • Moving parts of Click pipeline onto GPUs • CPU for sequential • GPU for parallel On GPU On CPU SomeSource SequentialElement ParallelWork MoreParallelWork SequentialWork Keep Click’s modular style Drop MoreSeqWork AgainParallelWork FurtherSequential SomeDestination 42 Apply generic principles Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 43 Apply generic principles Batched processing 43 Snap batched processing Element Single Packet Path push/pull Packet + push() + pull() Batcher bpush/bpull PacketBatch BElement + bpush() + bpull() Multi-Packet Path Debatcher Single Packet Path 44 Snap batched processing Element Single Packet Path push/pull Packet + push() + pull() Batcher bpush/bpull PacketBatch BElement + bpush() + bpull() Multi-Packet Path Debatcher Single Packet Path 44 Snap batched processing Element Single Packet Path push/pull Packet + push() + pull() Batcher bpush/bpull PacketBatch BElement + bpush() + bpull() Multi-Packet Path Debatcher Single Packet Path 44 Snap batched processing Element Single Packet Path push/pull Packet + push() + pull() Batcher bpush/bpull PacketBatch BElement + bpush() + bpull() Multi-Packet Path Debatcher Single Packet Path 44 Batched packet divergence • Packets in a batch go to different paths GPUClassifier GPUChecksum GPUEncPacket 45 Batched packet divergence • Packets in a batch go to different paths GPUClassifier GPUChecksum GPUEncPacket 45 Batched packet divergence • Packets in a batch go to different paths Two choices: GPUClassifier GPUChecksum GPUEncPacket 45 Batched packet divergence • Packets in a batch go to different paths Two choices: 1. Split the batch into two copies GPUClassifier GPUChecksum GPUEncPacket 45 Batched packet divergence • Packets in a batch go to different paths Two choices: 1. Split the batch into two copies ➡ needs PCIe copy, sync! GPUClassifier GPUChecksum GPUEncPacket 45 Batched packet divergence • Packets in a batch go to different paths Two choices: 1. Split the batch into two copies ➡ needs PCIe copy, sync! GPUClassifier GPUChecksum GPUEncPacket 2. Keep the batch, skip some packets on GPUs 45 Predicated execution in a batch PacketBatch ... GPUElement-1 Predicate 0 Predicate 1 GPUElement-2 GPUElement-3 ... Debatcher All Packets Dispatcher Predicate 0 Predicate 1 46 Predicated execution in a batch PacketBatch ... GPUElement-1 Predicate 0 Predicate 1 void ge2kernel(…) { if (pkt.predicates[0]) { … // GPUElement-­‐2 logic } } GPUElement-2 GPUElement-3 ... Debatcher All Packets Dispatcher Predicate 0 Predicate 1 46 Predicated execution in a batch PacketBatch ... GPUElement-1 Predicate 0 Predicate 1 void ge2kernel(…) { if (pkt.predicates[0]) { … // GPUElement-­‐2 logic } } GPUElement-2 GPUElement-3 ... void ge3kernel(…) { if (pkt.predicates[1]) { … // GPUElement-­‐3 logic } } Debatcher All Packets Dispatcher Predicate 0 Predicate 1 46 Predicated execution in a batch PacketBatch ... GPUElement-1 Predicate 0 Predicate 1 void ge2kernel(…) { if (pkt.predicates[0]) { … // GPUElement-­‐2 logic } } GPUElement-2 GPUElement-3 ... Debatcher All Packets Dispatcher Predicate 0 Predicate 1 void ge3kernel(…) { if (pkt.predicates[1]) { … // GPUElement-­‐3 logic } } if (pkt.predicates[0]) output(0).push(pkt); else if (pkt.predicates[1]) output(1).push(pkt); … 46 Apply generic principles Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Each batch binds a CUDA stream Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Each batch binds a CUDA stream BElement only does async GPU ops Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Each batch binds a CUDA stream BElement only does async GPU ops Use GPUCompletionQueue to check stream status Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 47 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Reduce PCIe data transfer with compacted workload “Region-of-interest” (ROI) based packet slicing All use the same locked memory for zero-copy DMA 47 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Packet 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Protocol Packet 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Src IP Protocol Packet 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Dst IP Src IP Protocol Packet 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Dst IP Src IP Src Port Protocol Packet 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Dst IP Src IP Src Port Protocol Dst Port Packet 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Dst IP Src IP Src Port Protocol Dst Port Packet ROIs 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Dst IP Src IP Src Port Protocol Dst Port Packet Slicing ROIs 48 Packet slicing for ``use-partial’’ Packet Classification’s regions-of-interest (ROI): Dst IP Src IP Src Port Protocol Dst Port Packet Slicing ROIs Coalescable memory access on GPUs 48 Apply generic principles Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 49 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 49 Apply generic principles Asynchronous non-blocking GPU programming Batched processing Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA ROIs stored in CUDA locked memory 49 Evaluation • Basic forwarding I/O improvement • Example applications: • Classification • IP routing • Pattern matching for IDS • Latency, re-order • Flexibility and modularity 50 T 0 Classifier+routing+pattern matching 64 128 256 512 1024 1518 Packet Size (Bytes) (b) DPI Router Click Snap-GPU Snap-CPU Snap-GPU w/ Slicing Throughput (Gbps) 40 30 20 10 0 64 128 256 512 1024 1518 Packet Size (Bytes) (c) IDS Router 51 ure 5.8. Performance of Click and Snap with three di↵erent applications. T 0 Classifier+routing+pattern matching 64 128 256 512 1024 1518 Packet Size (Bytes) (b) DPI Router Click Snap-GPU Snap-CPU Snap-GPU w/ Slicing Throughput (Gbps) 40 30 20 10 0 64 128 256 512 1024 1518 Packet Size (Bytes) (c) IDS Router 51 ure 5.8. Performance of Click and Snap with three di↵erent applications. Evaluation 52 we o is 20 on unb b s U e P r g a lSannacpe-C ivn nts a Pack d escr si zed / Slic ed wer66 w d r U e i et I / oP e b ed (64 b s i mp ap-G utb ou n i n O y S l e t nd l i S ect i e) p a mi c r to th k 10 n k s. c i o l o c n C k b at en et s u 5 e U . n P 3 chnm .4. W -G si n g gine. S aparks passe e C m e as lick’s that Th e s s p ac u 0 N evalu e r 0 e e 4 d k e t x e m t t p ate t h s a e e for p pa ar r a n r i me betw he pa w c n een i ge m e ket I t s us ardin 6 cket 4 n / nt, w e th t er f a g2 rat O 3e0n 1 I /O ou r - p e 8 ces w hich gine e2for s i m 56 m i n ath a and p l est ith n passe 51i2 S rrang mum o P s n pos2s0i aapc’ske addit pack upp o 10- 24 e b m im tS et s f le fo ional ent t p rt is i z r 1518 e o r rwa(r ve(m om a h at u pro c B n ot y 8 1 t d e b 5 e essin10 nt2s4s)1 )eD ses a r,Pw t h r ea singl l t i -t h I 0 g l 1 l . R h C e d f i o lick W c ou r N 2 - s af e u i h readi 1 n t 5 e e p r simyptes) ut N t est , al l o IC s i n g su 256 b I S 8 C o w n n 2 th at Size (B ly p p or 0 a 1 i o t h r ea t n p o u g 4 GPUa s646 r t est e one t to s only d s, o ackS i P n nap-a-rpdaerth m g t o 4 l an d a ne p n a e 0 c e h o utD i n e. pack er N Forw CPU rd C ent M p N u S e C S t IC . S na, pa-n t I/O lick’s lick’s (a) Q /R GdPapUa-CPU n N a e SS q t icing x e p 0 l h t n i S w r s 4 m S / a e t / ad t i ng N ap co 30 Sw Foruweue, so w dds suppor o be etmSanap-GPU licing d e, a p nd a ardi0n e use si t for k run. c i l l C s W m o g x ultip2 e ad repo p-GPU 3 t een l0e th d ed nrat p e Confi t h r ea S r r e f d gu r a orma ads p U s for icting t P n ce f 8 0 i e o C 4 r 1 Click n 0 N 5 ap / Sl he Sna or 2 o 1 I n w C t S , U 1 Pa p P e 4 1 s a 0 G 2 c e ch us on1fi0g th Click g r nap S e i u n 0 r v g 4 Pa 3 2 a i t a ion. d 51 Th r o Click lick ths (110 t ) s a e h 5u06ghpu (Byt 4C Pa -GPU thre t 2 4 . 5 r 5 a p t t h d e Snap Sna s (4 20 8Gbps t Size i 2 f 1 th0rea ) i 1 Pa 8 e .28 G ack 6.56 s s d t 4 Snap h a s M 6 b ) 4 0 P l p p s 1 ps128 C 13.02 1518 , 4 Pa4 0 N 1 4 n 2 D 1 0 o . 1 8 S t G i h M t 2 b U s 56 ps B as e a pps P 512 r G ) 6 u s 5 1 5 8 P e Forw 2 8 g 1 t . 59 G .5 Mp g aacke 2 10128 fi (B y n e 0 ardin Fi3g0u z b i t n o p 2 p S t S i c s 1 4 a1ck5et s ize (B64 c s 30.97 g Pe re 5. e u g p 2 1 n P 8 urtg ytes) x rfo0rm Gbps he.2nMp (c 8. e er i o e n p ) R 2 ance Perfor s a PiIv c d )D of Sn manccee w44.0 Mp1518 IDS Route be e ( manc h w h U, t a g Ps r p i t arpm n o C i e nu s 4 c f p i w l 2 a n an d Cl1i0ck i SnThu U w/ S mber o n 0 t f 1 C P o 2 a r a i l G 1 i e n h . s are c 5 t p ) d k ea1 a t s a t Sn Snap e is r e ytes 56 u founing p .89x 2 e k B ( l g c k w e c c g i fi 8 l a z i d d i t C 2 sp eed n n . n h r p S 1 i i o three taph-GePU 0 n T s re c e n 4atbsle Packet urpwfa a 6 O ne e ↵ e n r i d S o e f i ↵e r e n , t h h or silem d t o 5.2. a rd inte.r F i e w f r l n e w o h g t F a 9 esting her le path ) SDNSnap’ns t; t U aw h h n o ppl i c t 40 l o s i friets 5. g d e t s a r n a P c i e tiu fo(rwa eme naim r onrsa. prov w/oScliceiss p-C singl Mat esult is u y u s g l g r S a e e eding a p-GhPUpm Se-path r ents fi t h at e t 30 D n a S I m r o waP S t U D n a c l n d s I p er f o nh2i.c to g i a Sd n p e G rdi ng p e e ’ t m s a 3 h h i n rman o 8x sp 52lin e I/O t four- of w s s e o y r f or m k y z d t p c r c e i i e a b e l n s e . h th t0 C h a du v t i on s ex p e r i me 5.5.1 Thro ug mi z a ps) t (Gb Thro ughp ut (G bps) ps) b G ( t ghpu u o r Th s) p b G ( put h g u Thro s) p b t (G u p h g u o Thr ps) b G ( t ghpu u o r Th Evaluation we o is 20 on unb b s U e P r g a lSannacpe-C ivn nts a Pack d escr si zed / Slic ed wer66 w d r U e i et I / oP e b ed (64 b s i mp ap-G utb ou n i n O y S l e t nd l i S ect i e) p a mi c r to th k 10 n k s. c i o l o c n C k b at en et s u 5 e U . n P 3 chnm .4. W -G si n g gine. S aparks passe e C m e as lick’s that Th e s s p ac u 0 N evalu e r 0 e e 4 d k e t x e m t t p ate t h s a e e for p pa ar r a n r i me betw he pa w c n een i ge m e ket I t s us ardin 6 cket 4 n / nt, w e th t er f a g2 rat O 3e0n 1 I /O ou r - p e 8 ces w hich gine e2for s i m 56 m i n ath a and p l est ith n passe 51i2 S rrang mum o P s n pos2s0i aapc’ske addit pack upp o 10- 24 e b m im tS et s f le fo ional ent t p rt is i z r 1518 e o r rwa(r ve(m om a h at u pro c B n ot y 8 1 t d e b 5 e essin10 nt2s4s)1 )eD ses a r,Pw t h r ea singl l t i -t h I 0 g l 1 l . R h C e d f i o lick W c ou r N 2 - s af e u i h readi 1 n t 5 e e p r simyptes) ut N t est , al l o IC s i n g su 256 b I S 8 C o w n n 2 th at Size (B ly p p or 0 a 1 i o t h r ea t n p o u g 4 GPUa s646 r t est e one t to s only d s, o ackS i P n nap-a-rpdaerth m g t o 4 l an d a ne p n a e 0 c e h o utD i n e. pack er N Forw CPU rd C ent M p N u S e C S t IC . S na, pa-n t I/O lick’s lick’s (a) Q /R GdPapUa-CPU n N a e SS q t icing x e p 0 l h t n i S w r s 4 m S / a e t / ad t i ng N ap co 30 Sw Fo•ruweue, so w dds suppor o be etmSanap-GPU licing d e, a p nd a ardi0n e use si t for k run. c i l l C s W m o g x ultip2 e ad repo p-GPU 3 t een l0e th d ed nrat p e Confi t h r ea S r r e f d gu r a orma ads p U s for icting t P n ce f 8 0 i e o C 4 r 1 Click n 0 N 5 ap / Sl he Sna or 2 o 1 I n w C t S , U 1 Pa p P e 4 1 s a 0 G 2 c e ch us on1fi0g th pCli• g r ck 4 Sna e i u n 0 r v g 3 2 a i t a ion. d 51 Th r o Click licPkaths (110 t ) s a e h 5u06ghpu (Byt 4C Pa -GPU thre t 2 4 . 5 r 5 a p t t h d e Snap Sna s (4 20 8Gbps t Size i 2 f 1 th0rea ) i 1 Pa 8 e .28 G ack 6.56 s s d t 4 Snap h a s M 6 b ) 4 0 P l p p s 1 ps128 C 13.02 1518 , 4 Pa4 0 N 1 4 n 2 D 1 0 o . 1 8 S t • G i h M t 2 b U s 56 ps B as e a pps P 512 r G ) 6 u s 5 1 5 8 P e Forw 2 8 g 1 t . 59 G .5 Mp g aacke 2 10128 fi (B y n e 0 ardin Fi3g0u z b i t n o p 2 p S t S i c s 1 4 a1ck5et s ize (B64 c s 30.97 g Pe re 5. e u g p 2 1 n P 8 urtg ytes) x rfo0rm Gbps he.2nMp (c 8. e er i o e n p ) R 2 ance Perfor s a PiIv c d )D of Sn manccee w44.0 Mp1518 IDS Route be e ( manc h w h U, t a g Ps r p i t arpm n o C i e nu s 4 c f p i w l 2 a n an d Cl1i0ck i SnThu U w/ S mber o n 0 t f 1 C P o 2 a r a i l G 1 i e n h . s are c 5 t p ) d k ea1 a t s a t Sn Snap e is r e ytes 56 u founing p .89x 2 e k B ( l g c k w e c c g i fi 8 l a z i d d i t C 2 sp eed n n . n h r p S 1 i i o three taph-GePU 0 n T s re c e n 4atbsle Packet urpwfa a 6 O ne e ↵ e n r i d S o e f i ↵e r e n , t h h or silem d t o 5.2. a rd inte.r F i e w f r l n e w o h g t F a 9 esting her le path ) SDNSnap’ns t; t U aw h h n o ppl i c t 40 l o s i friets 5. g d e t s a r n a P c i e tiu fo(rwa eme naim r onrsa. prov w/oScliceiss p-C singl Mat esult is u y u s g l g r S a e e eding a p-GhPUpm Se-path r ents fi t h at e t 30 D n a S I m r o waP S t U D n a c l n d s I p er f o nh2i.c to g i a Sd n p e G rdi ng p e e ’ t m s a 3 h h i n rman o 8x sp 52lin e I/O t four- of w s s e o y r f or m k y z d t p c r c e i i e a b e l n s e . h th t0 C h a du v t i on s ex p e r i me 5.5.1 Thro ug mi z a Thro ughp ut (G bps) ps) b G ( t ghpu u o r Th Evaluation CPU just 1/3 or 1/4 ps) b G ( t ghpu u o r Th s) p b t (G u p h g u o Thr GPU reached 40Gb/s line rate at 128B ps) t (Gb s) p b G ( put h g u Thro Latency tolerable in LAN for non-latencysensitive app, negligible in WAN 12 · Preserving Click’s flexibility: full IP router E. Kohler et al. FromDevice(eth0) FromDevice(eth1) Classifier(...) ARP queries Classifier(...) ARP responses ARP queries IP ARPResponder (1.0.0.1 ...) to Queue ARP responses IP ARPResponder (2.0.0.1 ...) to ARPQuerier to Queue to ARPQuerier Paint(1) Paint(2) Strip(14) CheckIPHeader(...) GetIPAddress(16) LookupIPRoute(...) to Linux DropBroadcasts DropBroadcasts CheckPaint(1) CheckPaint(2) ICMPError redirect IPGWOptions(1.0.0.1) ICMPError redirect IPGWOptions(2.0.0.1) ICMPError bad param FixIPSrc(1.0.0.1) ICMPError bad param FixIPSrc(2.0.0.1) DecIPTTL DecIPTTL ICMPError TTL expired IPFragmenter(1500) ICMPError TTL expired IPFragmenter(1500) ICMPError must frag ICMPError must frag from Classifier from Classifier ARPQuerier(1.0.0.1, ...) ARPQuerier(2.0.0.1, ...) ToDevice(eth0) ToDevice(eth1) Fig. 8. An IP router configuration. 53 · E. Kohler et al. FromDevice(eth0) FromDevice(eth1) Classifier(...) ARP queries Classifier(...) ARP responses ARP queries IP ARPResponder (1.0.0.1 ...) to Queue ARP responses IP Add a pattern matching element ARPResponder (2.0.0.1 ...) to ARPQuerier to Queue to ARPQuerier Paint(1) Paint(2) Strip(14) CheckIPHeader(...) GetIPAddress(16) LookupIPRoute(...) to Linux DropBroadcasts DropBroadcasts CheckPaint(1) CheckPaint(2) ICMPError redirect IPGWOptions(1.0.0.1) ICMPError redirect IPGWOptions(2.0.0.1) ICMPError bad param FixIPSrc(1.0.0.1) ICMPError bad param FixIPSrc(2.0.0.1) DecIPTTL DecIPTTL ICMPError TTL expired IPFragmenter(1500) ICMPError must frag from Classifier from Classifier ARPQuerier(1.0.0.1, ...) ARPQuerier(2.0.0.1, ...) ToDevice(eth0) ToDevice(eth1) Fig. 8. An IP router configuration. Snap-CPU Snap-GPU (w/ Slicing) 40 30 20 10 0 ICMPError TTL expired IPFragmenter(1500) ICMPError must frag Click Throughput (Gbps) 12 Preserving Click’s flexibility: full IP router 64 128 256 512 1024 1518 Packet Size (Bytes) Figure 5.11. Fully functional IP router + IDS performance 53 of elements, some of which are duplicated sixteen times, once for · E. Kohler et al. FromDevice(eth0) FromDevice(eth1) Classifier(...) ARP queries Classifier(...) ARP responses ARP queries IP ARPResponder (1.0.0.1 ...) to Queue ARP responses IP Add a pattern matching element ARPResponder (2.0.0.1 ...) to ARPQuerier to Queue to ARPQuerier Paint(1) Paint(2) Strip(14) CheckIPHeader(...) GetIPAddress(16) LookupIPRoute(...) to Linux DropBroadcasts DropBroadcasts CheckPaint(1) CheckPaint(2) ICMPError redirect IPGWOptions(1.0.0.1) ICMPError redirect IPGWOptions(2.0.0.1) ICMPError bad param FixIPSrc(1.0.0.1) ICMPError bad param FixIPSrc(2.0.0.1) DecIPTTL DecIPTTL ICMPError TTL expired IPFragmenter(1500) ICMPError must frag from Classifier from Classifier ARPQuerier(1.0.0.1, ...) ARPQuerier(2.0.0.1, ...) ToDevice(eth0) ToDevice(eth1) Fig. 8. An IP router configuration. Snap-CPU Snap-GPU (w/ Slicing) 40 30 20 10 0 ICMPError TTL expired IPFragmenter(1500) ICMPError must frag Click Throughput (Gbps) 12 Preserving Click’s flexibility: full IP router 64 128 256 512 1024 1518 Packet Size (Bytes) Figure 5.11. Fully functional IP router + IDS performance 53 of elements, some of which are duplicated sixteen times, once for Snap Summary • A generic parallel packet processing framework • flexibility of Click • fast parallel power from GPUs • https://github.com/wbsun/snap • Details not presented: network I/O, async scheduling 54 Thesis statement The throughput of system software with parallelizable, computationally expensive tasks can be improved by using GPUs and frameworks with memory-efficient and throughput-oriented designs. 55 Conclusion 56 Conclusion Generic principles: Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA 56 Conclusion Generic principles: Batched processing Asynchronous non-blocking GPU programming Reduce PCIe data transfer with compacted workload All use the same locked memory for zero-copy DMA Concrete frameworks: GPUstore with high throughput storage applications Snap with high throughput network packet processing 56 Thanks! Q&A 57 Backup slides 58 ROI for memory access coalescing packet ROIs 59 How ROI slicing works? • To Click insiders: • Batcher accepts ROI requests from BElements • Batcher merges requested ROIs into result ROIs • Each BElement asks for its ROIs’ offsets • GPU kernels invoked by BElements use variables for offsets 60 How predicated execution works? • To Click insiders: • Manually assigned: where and what! 61 Path-encoded predicate GPUElement-1 1 0 GPUElement-2 00,0 GPUElement-4 01,0 GPUElement-5 GPUElement-3 1,1 0,1 10,0 GPUElement-6 GPUElement-8 GPUElement-7 0,0,1 GPUElement-9 62 1,0,1 GPUElement-10