05-hetero-programming

advertisement
Martin Kruliš
by Martin Kruliš (v1.0)
4 11. 2014
1

GPU
◦ “Independent” device
◦ Controlled by host
◦ Used for “offloading”

Host Code
◦ Needs to be designed in a way that
 Utilizes GPU(s) efficiently
 Utilize CPU while GPU is working
 CPU and GPU do not wait for each
other
by Martin Kruliš (v1.0)
4 11. 2014
2

Bad Example
CPU
GPU
cudaMemcpy(..., HostToDevice);
Kernel1<<<...>>>(...);
cudaDeviceSynchronize();
cudaMemcpy(..., DeviceToHost);
...
cudaMemcpy(..., HostToDevice);
Kernel2<<<...>>>(...);
cudaDeviceSynchronize();
cudaMemcpy(..., DeviceToHost);
...
Device is
working
by Martin Kruliš (v1.0)
4 11. 2014
3

Overlapping CPU and GPU work
◦ Kernels
 Started asynchronously
 Can be waited for (cudaDeviceSynchronize())
 A little more can be done with streams
◦ Memory transfers
 cudaMemcpy() is synchronous and blocking
 Alternatively cudaMemcpyAsync() starts the transfer
and returns immediately
 Can be synchronized the same way as the kernel
by Martin Kruliš (v1.0)
4 11. 2014
4

Using Asynchronous Transfers
CPU
GPU
cudaMemcpyAsync(HostToDevice);
Kernel1<<<...>>>(...);
cudaMemcpyAsync(DeviceToHost);
...
do_something_on_cpu();
...
cudaDeviceSynchronize();
Workload balance
becomes an issue
by Martin Kruliš (v1.0)
4 11. 2014
5

CPU Threads
◦ Multiple CPU threads may use the GPU

GPU Overlapping Capabilities
◦ Multiple kernels may run simultaneously
 Since Fermi architecture
 cudaDeviceProp.concurrentKernels
◦ Kernel execution may overlap with data transfers
 Or even multiple data transfers
 cudaDeviceProp.asyncEngineCount
by Martin Kruliš (v1.0)
4 11. 2014
6

Stream
◦ In-order GPU command queue (like in OpenCL)
 Asynchronous GPU operations are registered in queue
 Kernel execution
 Memory data transfers
 Commands in different streams may overlap
 Provide means for explicit and implicit synchronization
◦ Default stream (stream 0)
 Always present, does not have to be created
 Global synchronization capabilities
by Martin Kruliš (v1.0)
4 11. 2014
7

Stream Creation
cudaStream_t stream;
cudaStreamCreate(&stream);

Stream Usage
cudaMemcpyAsync(dst, src, size, kind, stream);
kernel<<<grid, block, sharedMem, stream>>>(...);

Stream Destruction
cudaStreamDestroy(stream);
by Martin Kruliš (v1.0)
4 11. 2014
8

Synchronization
◦ Explicit
 cudaStreamSynchronize(stream) – waits until all
commands issued to the stream have completed
 cudaStreamQuery(stream) – a non-blocking test
whether the stream has finished
◦ Implicit
 Operations in different streams cannot overlap if a
special operation is issued between them
 Memory allocation
 A CUDA command to default stream
 Switch between L1/shared memory configuration
by Martin Kruliš (v1.0)
4 11. 2014
9

Overlapping Behavior
◦ Commands in different streams overlap if the
hardware is capable running them concurrently
◦ Unless implicit/explicit synchronization prohibits so
for (int i = 0; i < 2; ++i) {
cudaMemcpyAsync(…HostToDevice, stream[i]);
MyKernel<<<g, b, 0, stream[i]>>>(...);
cudaMemcpyAsync(…DeviceToHost, stream[i]);
}
May have many implicit synchronizations, depending
on CC and hardware overlapping capabilities.
by Martin Kruliš (v1.0)
4 11. 2014
10

Overlapping Behavior
◦ Commands in different streams overlap if the
hardware is capable running them concurrently
◦ Unless implicit/explicit synchronization prohibits so
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(…HostToDevice, stream[i]);
for (int i = 0; i < 2; ++i)
MyKernel<<<g, b, 0, stream[i]>>>(...);
for (int i = 0; i < 2; ++i)
cudaMemcpyAsync(…DeviceToHost, stream[i]);
Much less opportunities for implicit synchronization
by Martin Kruliš (v1.0)
4 11. 2014
11

Callbacks
◦ Callbacks are registered in streams by
cudaStreamAddCallback(stream, fnc, data, 0);
◦ The callback function is invoked asynchronously
after all preceding commands terminate
◦ Callback registered to the default stream is invoked
after previous commands in all streams terminate
◦ Operations issued after registration start after the
callback returns
◦ The callback looks like
void CUDART_CB MyCallback(stream,
errorStatus, userData) { ...
by Martin Kruliš (v1.0)
4 11. 2014
12

Events
◦ Special markers that can be used for
synchronization and performance monitoring
◦ The typical usage is
 Waiting for all commands before the marker finishes
 Explicit synchronization between selected streams
 Measuring time between two events
◦ Example
cudaEvent_t event;
cudaEventCreate(&event);
cudaEventRecord(event, stream);
cudaEventSynchronize(event);
by Martin Kruliš (v1.0)
4 11. 2014
13

Making a Good Use of Overlapping
◦ Split the work into smaller fragments
◦ Create a pipeline effect (load, process, store)
by Martin Kruliš (v1.0)
4 11. 2014
14

Data Gather and Scatter Problem
Input Data
Host Memory
Gather
Multiple cudaMemcpy() calls
may be quite inefficient
Kernel
Execution
GPU Memory
Scatter
Results
Host Memory
by Martin Kruliš (v1.0)
4 11. 2014
15

Gather and Scatter
◦ Reducing overhead
◦ Performed by CPU before/after cudaMemcpy
Main Thread
Stream 0
Stream 1
Gather
HtD copy
Kernel
DtH copy
Scatter
Gather
HtD copy
Kernel
DtH copy
…
Scatter
# of thread per GPU and # of
streams per thread depends on
the workload structure
by Martin Kruliš (v1.0)
4 11. 2014
16

Page-locked (Pinned) Host Memory
◦ Host memory that is prevented from swapping
◦ Created/dismissed by
cudaHostAlloc(), cudaFreeHost()
cudaHostRegister(), cudaHostUnregister()
◦ Optionally with flags
cudaHostAllocWriteCombined
cudaHostAllocMapped
cudaHostAllocPortable
Optimized for writing,
not cached on CPU
◦ Copies between pinned host memory and device are
automatically performed asynchronously
◦ Pinned memory is a scarce resource
by Martin Kruliš (v1.0)
4 11. 2014
17

Device Memory Mapping
◦ Allowing GPU to access portions of host memory
directly (i.e., without explicit copy operations)
 For both reading and writing
◦ The memory must be allocated/registered with flag
cudaHostAllocMapped
◦ The context must have cudaDeviceMapHost flag
(set by cudaSetDeviceFlags())
◦ Function cudaHostGetDevicePointer() gets host
pointer and returns corresponding device pointer
by Martin Kruliš (v1.0)
4 11. 2014
18

Asynchronous Errors
◦ An error may occur outside the a CUDA call
 In case of asynchronous memory transfers or kernel
execution
◦ The error is reported by the following CUDA call
◦ To make sure all errors were reported, the device
must synchronize (cudaDeviceSynchronize())
◦ Error handling functions
 cudaGetLastError()
 cudaPeekAtLastError()
 cudaGetErrorString(error)
by Martin Kruliš (v1.0)
4 11. 2014
19
by Martin Kruliš (v1.0)
4 11. 2014
20
Download