Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling Extracting performance models at runtime Memory management Asymmetric Distributed Shared Memory StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March 2010. [link] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf] Today: Bridging runtime and language support ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011 Today: Bridging runtime and language support ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011 best paper! Context: clouds shift to support HPC applications initially tightly coupled applications not suited for could applications today Chinese – cloud with 40Gbps infiniband Amazaon HPC instance GPU instances: Amazon, Nimbix Challenge: make GPUs shared resources in the could. Challenge: make GPUs a shared resource in the could. Why do this? GPUs are costly resources Multiple VMs on a node with a single GPU Increase utilization app level: some apps might not use GPUs much; kernel level: some kernels can be collocatd Two streams 1. 2. How? Evaluate … opportunities gains overheads 1. The ‘How?’ Preamble: Concurrent kernels are supported by today’s GPUs Each kernel can execute a different task Tasks can be mapped to different streaming multiprocessors (using thread-block configuration) Problem: concurrent execution limited to the set of kernels invoked within a single processor context Past virtualization solutions API rerouting / intercept library 1. The ‘How?’ Preamble: Concurrent kernels are supported by today’s GPUs Each kernel can execute a different task Tasks can be mapped to different streaming multiprocessors (using thread-block configuration) Problem: concurrent execution limited to the set of kernels invoked within a single processor context 1. The ‘How?’ Architecture 2. Evaluation – The opportunity The opportunity Key assumption: Under-utilization of GPUs Space-sharing Kernels occupy different SP Time-sharing Kernels time-share same SP (benefit form harware support form context switces) Note: is it not always possible 2. Evaluation – The opportunity The opportunity Key assumption: Under-utilization of GPUs Sharing Space-sharing Time-sharing Kernels occupy different SP Kernels time-share same SP (benefit form harware support form context switces) Note: resource conflicts may prevent this Molding – change kernel configuration (different number of thread blocks / threads per block) to improve collocation 2. Evaluation – The gains 2. Evaluation – The overheads Discussion Limitations Hardware support OpenCL vs. CUDA http://ft.ornl.gov/doku/shoc/level1 http://ft.ornl.gov/pubs-archive/shoc.pdf