TxLinux: Transactional Memory in an Operating System

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011  There are lots of GPUs ◦ ◦ ◦ ◦  3 of top 5 supercomputers use GPUs In all new PCs, smart phones, tablets Great for gaming and HPC/batch Unusable in other application domains GPU programming challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 2011 2  There are lots of GPUs ◦ ◦ ◦ ◦  3 of top 5 supercomputers use GPUs In all new PCs, smart phones tablets These two things are related: Great for gaming and HPC/batch We need OS abstractions Unusable in other application domains GPU programing challenges ◦ GPU+main memory disjoint ◦ Treated as I/O device by OS PTask SOSP 2011 3      The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 4 programmervisible interface OS-level abstractions Hardware interface 1:1 correspondence between OS-level and user-level abstractions PTask SOSP 2011 5 programmervisible interface GPGPU APIs Shaders/ Kernels Language Integration DirectX/CUDA/OpenCL Runtime 1 OS-level abstraction! 1. 2. 3. No kernel-facing API No OS resource-management Poor composability PTask SOSP 2011 6 GPU benchmark throughput 1200 1000 800 600 400 200 0 Higher is better no CPU load CPU scheduler and GPU not integrated! high CPU load • Image-convolution in CUDA scheduler • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 PTask SOSP 2011 7 OS cannot prioritize cursor updates • Flatter lines Are better WDDM + DWM + CUDA == dysfunction • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidia GeForce GT230 PTask SOSP 2011 8 Raw images “Hand” events detect capture capture camera images noisy point cloud xform detect gestures filter geometric transformation   noise filtering High data rates Data-parallel algorithms … good fit for GPU NOT Kinect: this is a harder problem! PTask SOSP 2011 9 #> capture | xform | filter | detect & CPU    GPU CPU GPU Modular design  flexibility, reuse Utilize heterogeneous hardware  Data-parallel components  GPU  Sequential components  CPU Using OS provided tools  processes, pipes PTask SOSP 2011 10    GPUs cannot run OS: different ISA Disjoint memory space, no coherence Host CPU must manage GPU execution ◦ Program inputs explicitly transferred/bound at runtime ◦ Device buffers pre-allocated User-mode apps must implement Main memory Copy inputs CPU Send commands Copy outputs GPU memory GPU PTask SOSP 2011 11 #> capture | xform | filter | detect & xform capture read() write() read() copy to GPU camdrv filter write() read() OS copy from GPU executive copy to GPU detect write() copy from GPU GPU driver PCI-xfer PCI-xfer PCI-xfer read() IRP HIDdrv PCI-xfer GPU Run! PTask SOSP 2011 12  GPU Analogues for: ◦ Process API ◦ IPC API ◦ Scheduler hints  Abstractions that enable: ◦ Fairness/isolation ◦ OS use of GPU ◦ Composition/data movement optimization PTask SOSP 2011 13      The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 14  ptask (parallel task) ◦ Has priority for fairness ◦ Analogous to a process for GPU execution ◦ List of input/output resources (e.g. stdin, stdout…)  ports ◦ Can be mapped to ptask input/outputs ◦ A data source or sink  channels ◦ Similar to pipes, connect arbitrary ports • OS objectsOS RM possible ◦ Specialize to eliminate double-buffering • data: specify where, not how  graph ◦ DAG: connected ptasks, ports, channels  datablocks ◦ Memory-space transparent buffers PTask SOSP 2011 15 #> capture | xform | filter | detect & mapped mem filter f-out xform f-in capture cloud rawimg ptask graph detect GPU mem GPU mem process (CPU) ptask (GPU) port channel Optimized data movement Data arrival triggers computation ptask graph datablock PTask SOSP 2011 16  Graphs scheduled dynamically ◦ ptasks queue for dispatch when inputs ready  Queue: dynamic priority order ◦ ptask priority user-settable ◦ ptask prio normalized to OS prio  Transparently support multiple GPUs ◦ Schedule ptasks for input locality PTask SOSP 2011 17 Datablock V main 1 gpu0 0 gpu1 1 space  M 1 1 1 RW 11 10 10 data Main Memory GPU 0 Memory GPU 1 Memory … Logical buffer ◦ backed by multiple physical buffers ◦ buffers created/updated lazily ◦ mem-mapping used to share across process boundaries  Track buffer validity per memory space ◦ writes invalidate other views  Flags for access control/data placement PTask SOSP 2011 18 Main Memory Datablock V M RW 0 1 01 0 0 1 main 1 0 1 0 1 01 0 gpu 1 space f-in xform cloud capture rawimg #> capture | xform | filter … filter GPU Memory data PTask SOSP 2011 … process ptask port channel datablock 19 port datablock port • 1-1 correspondence between programmer and OS abstractions • GPU APIs can be built on top of new OS abstractions PTask SOSP 2011 20      The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 21  Windows 7 ◦ Full PTask API implementation ◦ Stacked UMDF/KMDF driver  Kernel component: mem-mapping, signaling  User component: wraps DirectX, CUDA, OpenCL ◦ syscalls  DeviceIoControl() calls  Linux 2.6.33.2 ◦ Changed OS scheduling to manage GPU  GPU accounting added to task_struct PTask SOSP 2011 22   Windows 7, Core2-Quad, GTX580 (EVGA) Implementations ◦ ◦ ◦ ◦  pipes: capture | xform | filter | detect modular: capture+xform+filter+detect, 1process handcode: data movement optimized, 1process ptask: ptask graph Configurations ◦ real-time: driven by cameras ◦ unconstrained: driven by in-memory playback PTask SOSP 2011 23 relative to handcode 3.5 lower is better 3 2.5 handcode 2 modular 1.5 pipes 1 ptask 0.5 0 runtime compared to hand-code • pipes 11.6% higher throughput compared to user • sys lower CPU util: no driver • ~2.7x less CPU usage program • 16x higher throughput • Windows 7 x64 8GB RAM • ~45% less memory • Intelusage Core 2 Quad 2.66GHz • GTX580 (EVGA) PTask SOSP 2011 24 PTask invocations/second 1600 1400 1200 1000 800 fifo 600 priority ptask 400 200 0 Higher is better 2 4 6 PTask provides throughput 8 proportional to priority PTask priority • FIFO – queue invocations in arrival order • ptask – aged priority queue w OS priority • graphs: 6x6 matrix multiply • priority same for every PTask node • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • GTX580 (EVGA) PTask SOSP 2011 25 Speedup over 1 GPU 2 1.5 1 0.5 • Synthetic graphs: Varying depths priority data-aware 0 Higher is better • Data-aware == priority + locality • Graph depth > 1 req. for any benefit Data-aware provides best throughput, preserves priority • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • 2 x GTX580 (EVGA) PTask SOSP 2011 26 user-prgs R/W bnc cuda-1 cuda-2 user-libs EncFS FUSE libc PTask OS HW SSD1 GPU/ CPU … Linux 2.6.33 SSD2 GPU Simple GPU usage accounting • Restores performance cuda-1 Linux cuda-2 Linux cuda-1 PTask cuda-2 Ptask Read 1.17x -10.3x -30.8x 1.16x 1.16x Write 1.28x -4.6x -10.3x 1.21x 1.20x PTask SOSP 2011 27 • EncFS: nice -20 • cuda-*: nice +19 • AES: XTS chaining • SATA SSD, RAID • seq. R/W 200 MB      The case for OS support PTask: Dataflow for GPUs Evaluation Related Work Conclusion PTask SOSP 2011 28  OS support for heterogeneous platforms: ◦ Helios  [Nightingale 09], BarrelFish [Baumann 09] ,Offcodes [Weinsberg 08] GPU Scheduling ◦ TimeGraph  Pegasus [Gupta 11] Graph-based programming models ◦ ◦ ◦ ◦ ◦ ◦  [Kato 11], Synthesis [Masselin 89] Monsoon/Id [Arvind] Dryad [Isard 07] StreamIt [Thies 02] DirectShow TCP Offload [Currid 04] Tasking ◦ Tessellation, Apple GCD, … PTask SOSP 2011 29  OS abstractions for GPUs are critical ◦ Enable fairness & priority ◦ OS can use the GPU  Dataflow: a good fit abstraction ◦ system manages data movement ◦ performance benefits significant Thank you. Questions? PTask SOSP 2011 30

TxLinux: Transactional Memory in an Operating System

Related documents

Products

Support

TxLinux: Transactional Memory in an Operating System

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib