Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September 23 2010 GPU application domains limited CUDA ◦ Rich APIs/abstractions ◦ Language integration familiar environment Composition/integration into systems ◦ Programmer-visible abstractions expressive ◦ Realization/implementations unsatisfactory Better OS-level abstractions are required ◦ (IP/Business concerns aside) int main(argc, argv) { FILE fpI=program fopen(“quack”, “w”); // How*do just a CPU and a disk? if(fp == NULL) fprintf(stderr, “failure\n”); … return 0; } programmervisible interface OS-level abstractions Hardware interface programmervisible interface 1 OS-level abstraction! The programmer gets to work with great abstractions… Why is this a problem? Doing fine without OS support: ◦ Gaming/Graphics Shader Languages DirectX, OpenGL ◦ “GPU Computing” nee “GPGPU” user-mode/batch scientific algorithms Latency-tolerant CUDA The application ecosystem is more diverse ◦ No OS abstractions no traction Gestural Interface Brain-Computer Interface Spatial Audio Image Recognition Processing user input: • need low latency, concurrency • must be multiplexed by OS • protection/isolation Raw images Image Filtering Geometric Transform “Hand” events Point cloud HID InputOS Gesture Recognition High data rates Noisy input Data-parallel algorithms Ack! Noise! #> catusb | xform | detect | hidinput & catusb: xform: Inherently sequential captures Data imageparallel data from usb ◦ Noise filtering ◦ Geometric transformation detect: extract gestures from point cloud hidinput: send mouse events (or whatever) Could parallelize on a CMP, but… #> catusb | xform | detect | hidinput & Run Run Run Run catusb on CPU xform uses GPU detect uses GPU hidinput: on CPU Use CUDA to write xform and detect! GPUs cannot run OS: different ISA Disjoint memory space, no coherence* Host CPU must manage execution ◦ Program inputs explicitly bound at runtime User-mode apps must implement Main memory Copy inputs CPU Copy outputs GPU memory Send commands GPU • 12 kernel crossings • 6 copy_to_user • 6 copy_from_user • Performance tradeoffs for runtime/abstractions catusb hidinput xform user detect CUDA Runtime User Mode Drivers (DXVA) OS Executive kernel Kernel Mode Drivers HAL USB CPU GPU Run GPU Kernel user catusb xform detect hidinput OS Executive kernel Kernel Mode Drivers HAL USB CPU GPU • No CUDA, no high level abstractions • If you’re MS and/or nVidia, this might be tenable… • Solution is specialized but there is still a data migration problem… FSB We’d prefer: Cache pollution CPU • catusb: USB bus GPU memory Wasted bandwidth Wasted power • xform, detect: no transfers • hidinput: single GPUmain mem transfer hidinput • if GPUs become coherent with main memory… DDR2/3 The GPU machinePCI-e can xform detect DIMM Northbridge do this, where are the interfaces? DDR2/3 DIMM DMI catusb Current task: USB 2.0 Southbridge catusb xform Motivation Problems with lack of OS abstractions Can CUDA solve these problems? ◦ ◦ ◦ ◦ CUDA Streams Asynchrony GPUDirect™ OS guarantees New OS abstractions for GPUs Related Work Conclusion CPU “Write-combining memory” FSB (uncacheable) CUDA streams, async: (Overlap capture/xfer/exec) DDR2/3 PCI-e Northbridge DDR2/3 DMI GPU USB 2.0 DIMM Southbridge DIMM Page-locked host memory (faster DMA) Portable Memory (share page-locked) GPUDirect™ Mapped Memory (map mem into GPU space) (transparent xfer app-level upcalls) Overlap Communication with Computation Stream X Stream Y Copy X0 Copy Y0 Kernel Xa Kernel Y Kernel Xb Copy Y1 Copy X1 Copy Engine Compute Engine Copy X0 Copy Y0 Kernel Xa Kernel Xb Copy X1 Copy Y1 Kernel Y Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(Y1); Copy X0 Copy X1 Copy Y0 Compute Engine Kernel Xa Kernel Xb Kernel Y Copy Y1 Each stream proceeds serially, different streams overlap Naïve programming eliminates potential concurrency CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(Y0); KernelY<<<…>>>(); Copy Engine Compute Engine Copy X0 Copy Y0 Kernel Xa Kernel Xb Kernel Y design can’tCopy useX1this anyway! • … xform | detect … Copy Y1 • CUDA Streams in xform, detect • different processes • Order sensitive • different address spaces • Applications must statically determine order • require additional IPC coordination • Couldn’t a scheduler with a global view do a CudaMemcpyAsync(X1…) Our CudaMemcpyAsync(Y1); better job dynamically? xform performance 4000 OS-supported ptask-analogue 3000 CUDA+streams CUDA-async- 2000 ping-pong CUDA+async CUDA-async 1000 Higher is better 0 H->D H<-D HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication H<->D CUDA CUDA • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 “Allows 3rd party devices to access CUDA memory”: (eliminates data copy) Great! but: • requires per-driver support • not just CUDA support! • no programmer-visible interface • OS can generalize Traditional OS guarantees: Fairness Isolation No user-space runtime can provide these! Can support… Cannot guarantee Impact of CPU Saturation 4000 3500 3000 2500 2000 normal load 1500 loaded 1000 Higher is 500 better 0 CPU scheduler and GPU scheduler H->Dintegrated! H<-D H<->D not HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 Flatter lines Are better • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 Process API analogues IPC API analogues Scheduler hint analogues Must integrate with existing interfaces ◦ CUDA/DXGI/DirectX ◦ DRI/DRM/OpenGL Motivation Problems with lack of OS abstractions Can CUDA solve these problems? New OS abstractions for GPUs Related Work Conclusion ptask ◦ Like a process, thread, can exist without user host process ◦ OS abstraction…not a full CPU-process ◦ List of mappable input/output resources endpoint ◦ Globally named kernel object ◦ Can be mapped to ptask input/output resources ◦ A data source or sink (e.g. buffer in GPU memory) channel ◦ ◦ ◦ ◦ Expand system call interface: Similar to a pipe • process API analogues Connect arbitrary endpoints • IPC API analogues 1:1, 1:M, M:1, N:M • scheduler hints Generalization of GPUDirect™ mechanism • 1-1 correspondence between programmer and OS abstractions • existing APIs can be built on top of new OS abstractions g_input process: catusb usbsrc rawimg ptask: xform = process = ptask = endpoint = channel cloud ptask: detect hands hid_in process: hidinput Computation expressed as a graph • Synthesis [Masselin 89] (streams, pumps) • Dryad [Isard 07] • SteamIt [Thies 02] • Offcodes [Weinsberg 08] • others… process: catusb usbsrc USBGPU mem rawimg ptask: xform = process = ptask = endpoint = channel g_input ptask: detect hands hid_in process: hidinput cloud GPU mem GPU mem • Eliminate unnecessary communication… process: catusb usbsrc New data triggers new computation g_input rawimg ptask: detect ptask: xform = process = ptask cloud hands hid_in process: hidinput = endpoint = channel • Eliminates unnecessary communication • Eliminates u/k crossings, computation xform performance 4000 3500 3000 2500 2000 1500 1000 500 0 Higher is better 10x 3.9x H->D H<-D H<->D Segmentation + Geometry ptask-analogue HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication naïve-CUDA •Windows 7 x64 8GB RAM •Intel Core 2 Quad 2.66GHz •Nvidia GeForce GT230 Motivation Problems with lack of OS abstractions Can CUDA solve these problems? New OS abstractions for GPUs Related Work Conclusion OS support for Heterogeneous arch: ◦ Helios [Nightingale 09] ◦ BarrelFish [Baumann 09] ◦ Offcodes [Weinsberg 08] Graph-based programming models ◦ ◦ ◦ ◦ ◦ ◦ Synthesis [Masselin 89] Monsoon/Id [Arvind] Dryad [Isard 07] StreamIt [Thies 02] DirectShow TCP Offload [Currid 04] GPU Computing ◦ CUDA, OpenCL CUDA: programming interface is right ◦ ◦ ◦ ◦ but OS must get involved Current interfaces waste data movement Current interfaces inhibit modularity/reuse Cannot guarantee fairness, isolation OS-level abstractions are required Questions?