Multi-GPU and Stream Programming

Multi-GPU and Stream
Kishan Wimalawarne
Stream programming
Multi-GPU programming
UVA & GPUDirect
• Paged locked memory (Pinned memory)
– Useful in concurrent kernel execution
– Use cudaHostAlloc() and cudaFreeHost() allocate and free
page-locked host memory
• Mapped memory
– A block of page-locked host memory can also be mapped
into the address space of the device by passing flag
cudaHostAllocMapped to cudaHostAlloc()
• Zero-Copy enables GPU threads to directly access
host memory.
• Requires mapped pinned (non-pageable) memory.
• Zero copy can be used in place of streams because
kernel-originated data transfers automatically
overlap kernel execution without the overhead of
setting up and determining the optimal number of
• Use cudaSetDeviceFlags() with cudaDeviceMapHost()
Stream Programming
• Stream programming (pipeline) is a useful parallel
• Data transfer from host to device is a major
performance bottleneck in GPU programming
• CUDA provides support for asynchronous data transfer
and kernel executions.
• A stream is simply a sequence of operations that are
performed in order on the device.
• Allow concurrent execution of kernels.
• Maximum number of concurrent kernel calls to be
launched is 16.
Asynchronous memory Transfer
• Use cudaMemcpyAsync() instead of
• cudaMemcpyAsync() – non-blocking data
transfer method uses pinned host memory .
• cudaError_t cudaMemcpyAsync ( void * dst,
const void * src, size_t count,
cudaMemcpyKind, cudaStream_t
Stream Structures
• cudaStream_t
– Sepcifies a stream in a CUDA program
• cudaStreamCreate(cudaStream_t * stm)
– Instantiate streams
Streaming example
Event processing
• Events are used for
– Monitor device behavior
– Accurate rate timing
• cudaEvent_t e
• cudaEventCreate(&e);
• cudaEventDestroy(e);
Event processing
• cudaEventRecord() records and event associated with
a stream.
• cudaEventElapsedTime() finds the time between two
input events.
• cudaEventSynchronize() blocks until the event has
actually been recorded
• cudaEventQuery() Check status of an event.
• cudaStreamWaitEvent() makes all future work
submitted to stream wait until event reports
completion before beginning execution.
• cudaEventCreateWithFlags() create events with flags
e.g:- cudaEventDefault, cudaEventBlockingSync
Stream Synchronization
• cudaDeviceSynchronize() waits until all preceding
commands in all streams of all host threads have
• cudaStreamSynchronize() takes a stream as a parameter
and waits until all preceding commands in the given stream
have completed
• cudaStreamWaitEvent() takes a stream and an event as
parameters and makes all the commands added to the
given stream after the call to cudaStreamWaitEvent() delay
their execution until the given event has completed.
• cudaStreamQuery() provides applications with a way to
know if all preceding commands in a stream have
Multi GPU programming
Multiple device access
• cudaSetDevice(devID)
– Devise selection within the code by specifying the
identifier and making CUDA kernels run on the
selected GPU.
Peer to peer memory Access
• Peer-to-Peer Memory Access
– Only on Tesla or above
– cudaDeviceEnablePeerAccess() to check peer
Peer to peer memory Copy
• Using cudaMemcpyPeer()
– works for Geforce 480 and other GPUs.
Programming multiple GPUs
• The most efficient way to use multiple GPUs is
to use host threads for multiple GPUs and
divide the work among them.
– E.g- pthreads
• Need to combine the parallelism of multi-core
processor to in conjunction with multiple
• In each thread use cudaSetDevice() to specify
the device to run.
Multiple GPU
• For each computation on GPU create a
separate thread and specify the device a
CUDA kernel should run.
• Synchronize both CPU threads and GPU.
Multiple GPU Example
void * GPUprocess(void *id){
long tid;
tid = (long)id;
if(tid ==0){
cudaMalloc((void **)&p2 , size);
cudaMemcpy(p2, p0, size, cudaMemcpyHostToDevice );
test<<<10*5024, 1024>>>(p2,tid +2);
cudaMemcpy(p0,p2 , size, cudaMemcpyDeviceToHost );
}else if(tid ==1){
cudaMalloc((void **)&p3 , size);
cudaMemcpy(p3, p1, size, cudaMemcpyHostToDevice );
test<<<10*5024, 1024>>>(p3,tid +2);
cudaMemcpy(p1,p3 , size, cudaMemcpyDeviceToHost );
Multiple GPU Example
#include <pthread.h>
pthread_t thread[NUM_THREADS];
pthread_attr_t attr;
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(t=0; t<NUM_THREADS; t++) {
rc = pthread_create(&thread[t], &attr, GPUprocess, (void *)t);
if (rc) {
printf("ERROR; return code from pthread_create() is %d\n", rc);
Unified Virtual Address
Space (UVA)
• 64-bit process on Windows Vista/7 in TCC
mode (only on Tesla)
• Build on UVA for Tesla (fermi) products.
Related flashcards
ARM architecture

23 Cards

X86 architecture

22 Cards


28 Cards

Create flashcards