Slide 1

advertisement

Last time: Runtime infrastructure for
hybrid (GPU-based) platforms

Task scheduling


Extracting performance models at runtime
Memory management

Asymmetric Distributed Shared Memory
StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines,
Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March 2010.
[link]
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac
Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu,
ASPLOS’10 [pdf]

Today:

Bridging runtime and language support

‘Virtualizing GPUs’
Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim,
Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf]
Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation
Framework, Vignesh T. Ravi et al., HPDC 2011

Today:

Bridging runtime and language support

‘Virtualizing GPUs’
Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim,
Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf]
Supporting GPU Sharing in Cloud Environments with a Transparent Runtime
Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011  best paper!

Context: clouds shift to support HPC
applications


initially tightly coupled applications not suited for
could applications
today




Chinese – cloud with 40Gbps infiniband
Amazaon HPC instance
GPU instances: Amazon, Nimbix
Challenge: make GPUs shared resources in
the could.

Challenge: make GPUs a shared resource in
the could.

Why do this?

GPUs are costly resources


Multiple VMs on a node with a single GPU
Increase utilization


app level: some apps might not use GPUs much;
kernel level: some kernels can be collocatd

Two streams
1.
2.
How?
Evaluate …



opportunities
gains
overheads
1. The ‘How?’

Preamble: Concurrent kernels are supported by
today’s GPUs




Each kernel can execute a different task
Tasks can be mapped to different streaming multiprocessors
(using thread-block configuration)
Problem: concurrent execution limited to the set of kernels
invoked within a single processor context
Past virtualization solutions

API rerouting / intercept library
1. The ‘How?’

Preamble: Concurrent kernels are supported by
today’s GPUs



Each kernel can execute a different task
Tasks can be mapped to different streaming multiprocessors
(using thread-block configuration)
Problem: concurrent execution limited to the set of kernels
invoked within a single processor context
1. The ‘How?’

Architecture
2. Evaluation – The opportunity

The opportunity

Key assumption: Under-utilization of GPUs

Space-sharing


Kernels occupy different SP
Time-sharing


Kernels time-share same SP (benefit form harware
support form context switces)
Note: is it not always possible
2. Evaluation – The opportunity

The opportunity

Key assumption: Under-utilization of GPUs

Sharing

Space-sharing


Time-sharing


Kernels occupy different SP
Kernels time-share same SP (benefit form harware support form
context switces)

Note: resource conflicts may prevent this
Molding – change kernel configuration (different number of
thread blocks / threads per block) to improve collocation
2. Evaluation – The gains
2. Evaluation – The overheads

Discussion


Limitations
Hardware support
OpenCL vs. CUDA


http://ft.ornl.gov/doku/shoc/level1
http://ft.ornl.gov/pubs-archive/shoc.pdf
Download