Slide 1

 Last time: Runtime infrastructure for hybrid (GPU-based) platforms  Task scheduling   Extracting performance models at runtime Memory management  Asymmetric Distributed Shared Memory StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March 2010. [link] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf]  Today:  Bridging runtime and language support  ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011  Today:  Bridging runtime and language support  ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011  best paper!  Context: clouds shift to support HPC applications   initially tightly coupled applications not suited for could applications today     Chinese – cloud with 40Gbps infiniband Amazaon HPC instance GPU instances: Amazon, Nimbix Challenge: make GPUs shared resources in the could.  Challenge: make GPUs a shared resource in the could.  Why do this?  GPUs are costly resources   Multiple VMs on a node with a single GPU Increase utilization   app level: some apps might not use GPUs much; kernel level: some kernels can be collocatd  Two streams 1. 2. How? Evaluate …    opportunities gains overheads 1. The ‘How?’  Preamble: Concurrent kernels are supported by today’s GPUs     Each kernel can execute a different task Tasks can be mapped to different streaming multiprocessors (using thread-block configuration) Problem: concurrent execution limited to the set of kernels invoked within a single processor context Past virtualization solutions  API rerouting / intercept library 1. The ‘How?’  Preamble: Concurrent kernels are supported by today’s GPUs    Each kernel can execute a different task Tasks can be mapped to different streaming multiprocessors (using thread-block configuration) Problem: concurrent execution limited to the set of kernels invoked within a single processor context 1. The ‘How?’  Architecture 2. Evaluation – The opportunity  The opportunity  Key assumption: Under-utilization of GPUs  Space-sharing   Kernels occupy different SP Time-sharing   Kernels time-share same SP (benefit form harware support form context switces) Note: is it not always possible 2. Evaluation – The opportunity  The opportunity  Key assumption: Under-utilization of GPUs  Sharing  Space-sharing   Time-sharing   Kernels occupy different SP Kernels time-share same SP (benefit form harware support form context switces)  Note: resource conflicts may prevent this Molding – change kernel configuration (different number of thread blocks / threads per block) to improve collocation 2. Evaluation – The gains 2. Evaluation – The overheads  Discussion   Limitations Hardware support OpenCL vs. CUDA   http://ft.ornl.gov/doku/shoc/level1 http://ft.ornl.gov/pubs-archive/shoc.pdf

Slide 1

Related documents

Products

Support

Slide 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib