Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011 There are lots of GPUs ◦ ◦ ◦ ◦ 3 of top 5 supercomputers use GPUs In all new PCs, smart phones, tablets Great for gaming and HPC/batch Unusable in other application domains GPU programming challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 2011 2 There are lots of GPUs ◦ ◦ ◦ ◦ 3 of top 5 supercomputers use GPUs In all new PCs, smart phones tablets These two things are related: Great for gaming and HPC/batch We need OS abstractions Unusable in other application domains GPU programing challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 2011 3 The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 4 programmervisible interface OS-level abstractions Hardware interface 1:1 correspondence between OS-level and user-level abstractions PTask SOSP 2011 5 programmervisible interface GPGPU APIs Shaders/ Kernels Language Integration DirectX/CUDA/OpenCL Runtime 1 OS-level abstraction! 1. 2. 3. No kernel-facing API No OS resource-management Poor composability PTask SOSP 2011 6 GPU benchmark throughput 1200 1000 800 600 400 200 0 Higher is better no CPU load CPU scheduler and GPU not integrated! high CPU load • Image-convolution in CUDA scheduler • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 PTask SOSP 2011 7 OS cannot prioritize cursor updates • Flatter lines Are better WDDM + DWM + CUDA == dysfunction • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 PTask SOSP 2011 8 Raw images “Hand” events detect capture capture camera images noisy point cloud xform detect gestures filter geometric transformation noise filtering High data rates Data-parallel algorithms … good fit for GPU NOT Kinect: this is a harder problem! PTask SOSP 2011 9 #> capture | xform | filter | detect & CPU GPU CPU GPU Modular design flexibility, reuse Utilize heterogeneous hardware Data-parallel components GPU Sequential components CPU Using OS provided tools processes, pipes PTask SOSP 2011 10 GPUs cannot run OS: different ISA Disjoint memory space, no coherence Host CPU must manage GPU execution ◦ Program inputs explicitly transferred/bound at runtime ◦ Device buffers pre-allocated User-mode apps must implement Main memory Copy inputs CPU Send commands Copy outputs GPU memory GPU PTask SOSP 2011 11 #> capture | xform | filter | detect & xform capture read() write() read() copy to GPU camdrv filter write() read() OS copy from GPU executive copy to GPU detect write() copy from GPU GPU driver PCI-xfer PCI-xfer PCI-xfer read() IRP HIDdrv PCI-xfer GPU Run! PTask SOSP 2011 12 GPU Analogues for: ◦ Process API ◦ IPC API ◦ Scheduler hints Abstractions that enable: ◦ Fairness/isolation ◦ OS use of GPU ◦ Composition/data movement optimization PTask SOSP 2011 13 The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 14 ptask (parallel task) ◦ Has priority for fairness ◦ Analogous to a process for GPU execution ◦ List of input/output resources (e.g. stdin, stdout…) ports ◦ Can be mapped to ptask input/outputs ◦ A data source or sink channels ◦ Similar to pipes, connect arbitrary ports • OS objectsOS RM possible ◦ Specialize to eliminate double-buffering • data: specify where, not how graph ◦ DAG: connected ptasks, ports, channels datablocks ◦ Memory-space transparent buffers PTask SOSP 2011 15 #> capture | xform | filter | detect & mapped mem filter f-out xform f-in capture cloud rawimg ptask graph detect GPU mem GPU mem process (CPU) ptask (GPU) port channel Optimized data movement Data arrival triggers computation ptask graph datablock PTask SOSP 2011 16 Graphs scheduled dynamically ◦ ptasks queue for dispatch when inputs ready Queue: dynamic priority order ◦ ptask priority user-settable ◦ ptask prio normalized to OS prio Transparently support multiple GPUs ◦ Schedule ptasks for input locality PTask SOSP 2011 17 Datablock V main 1 gpu0 0 gpu1 1 space M 1 1 1 RW 11 10 10 data Main Memory GPU 0 Memory GPU 1 Memory … Logical buffer ◦ backed by multiple physical buffers ◦ buffers created/updated lazily ◦ mem-mapping used to share across process boundaries Track buffer validity per memory space ◦ writes invalidate other views Flags for access control/data placement PTask SOSP 2011 18 Main Memory Datablock V M RW 0 1 01 0 0 1 main 1 0 1 0 1 01 0 gpu 1 space f-in xform cloud capture rawimg #> capture | xform | filter … filter GPU Memory data PTask SOSP 2011 … process ptask port channel datablock 19 port datablock port • 1-1 correspondence between programmer and OS abstractions • GPU APIs can be built on top of new OS abstractions PTask SOSP 2011 20 The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 21 Windows 7 ◦ Full PTask API implementation ◦ Stacked UMDF/KMDF driver Kernel component: mem-mapping, signaling User component: wraps DirectX, CUDA, OpenCL ◦ syscalls DeviceIoControl() calls Linux 2.6.33.2 ◦ Changed OS scheduling to manage GPU GPU accounting added to task_struct PTask SOSP 2011 22 Windows 7, Core2-Quad, GTX580 (EVGA) Implementations ◦ ◦ ◦ ◦ pipes: capture | xform | filter | detect modular: capture+xform+filter+detect, 1process handcode: data movement optimized, 1process ptask: ptask graph Configurations ◦ real-time: driven by cameras ◦ unconstrained: driven by in-memory playback PTask SOSP 2011 23 relative to handcode 3.5 lower is better 3 2.5 handcode 2 modular 1.5 pipes 1 ptask 0.5 0 runtime compared to hand-code • pipes 11.6% higher throughput compared to user • sys lower CPU util: no driver • ~2.7x less CPU usage program • 16x higher throughput • Windows 7 x64 8GB RAM • ~45% less memory • Intelusage Core 2 Quad 2.66GHz • GTX580 (EVGA) PTask SOSP 2011 24 PTask invocations/second 1600 1400 1200 1000 800 fifo 600 priority ptask 400 200 0 Higher is better 2 4 6 PTask provides throughput 8 proportional to priority PTask priority • FIFO – queue invocations in arrival order • ptask – aged priority queue w OS priority • graphs: 6x6 matrix multiply • priority same for every PTask node • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • GTX580 (EVGA) PTask SOSP 2011 25 Speedup over 1 GPU 2 1.5 1 0.5 • Synthetic graphs: Varying depths priority data-aware 0 Higher is better • Data-aware == priority + locality • Graph depth > 1 req. for any benefit Data-aware provides best throughput, preserves priority • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • 2 x GTX580 (EVGA) PTask SOSP 2011 26 user-prgs R/W bnc cuda-1 cuda-2 user-libs EncFS FUSE libc PTask OS HW SSD1 GPU/ CPU … Linux 2.6.33 SSD2 GPU Simple GPU usage accounting • Restores performance cuda-1 Linux cuda-2 Linux cuda-1 PTask cuda-2 Ptask Read 1.17x -10.3x -30.8x 1.16x 1.16x Write 1.28x -4.6x -10.3x 1.21x 1.20x PTask SOSP 2011 27 • EncFS: nice -20 • cuda-*: nice +19 • AES: XTS chaining • SATA SSD, RAID • seq. R/W 200 MB The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 28 OS support for heterogeneous platforms: ◦ Helios [Nightingale 09], BarrelFish [Baumann 09] ,Offcodes [Weinsberg 08] GPU Scheduling ◦ TimeGraph Pegasus [Gupta 11] Graph-based programming models ◦ ◦ ◦ ◦ ◦ ◦ [Kato 11], Synthesis [Masselin 89] Monsoon/Id [Arvind] Dryad [Isard 07] StreamIt [Thies 02] DirectShow TCP Offload [Currid 04] Tasking ◦ Tessellation, Apple GCD, … PTask SOSP 2011 29 OS abstractions for GPUs are critical ◦ Enable fairness & priority ◦ OS can use the GPU Dataflow: a good fit abstraction ◦ system manages data movement ◦ performance benefits significant Thank you. Questions? PTask SOSP 2011 30