05-hetero-programming

Martin Kruliš by Martin Kruliš (v1.0) 4 11. 2014 1  GPU ◦ “Independent” device ◦ Controlled by host ◦ Used for “offloading”  Host Code ◦ Needs to be designed in a way that  Utilizes GPU(s) efficiently  Utilize CPU while GPU is working  CPU and GPU do not wait for each other by Martin Kruliš (v1.0) 4 11. 2014 2  Bad Example CPU GPU cudaMemcpy(..., HostToDevice); Kernel1<<<...>>>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost); ... cudaMemcpy(..., HostToDevice); Kernel2<<<...>>>(...); cudaDeviceSynchronize(); cudaMemcpy(..., DeviceToHost); ... Device is working by Martin Kruliš (v1.0) 4 11. 2014 3  Overlapping CPU and GPU work ◦ Kernels  Started asynchronously  Can be waited for (cudaDeviceSynchronize())  A little more can be done with streams ◦ Memory transfers  cudaMemcpy() is synchronous and blocking  Alternatively cudaMemcpyAsync() starts the transfer and returns immediately  Can be synchronized the same way as the kernel by Martin Kruliš (v1.0) 4 11. 2014 4  Using Asynchronous Transfers CPU GPU cudaMemcpyAsync(HostToDevice); Kernel1<<<...>>>(...); cudaMemcpyAsync(DeviceToHost); ... do_something_on_cpu(); ... cudaDeviceSynchronize(); Workload balance becomes an issue by Martin Kruliš (v1.0) 4 11. 2014 5  CPU Threads ◦ Multiple CPU threads may use the GPU  GPU Overlapping Capabilities ◦ Multiple kernels may run simultaneously  Since Fermi architecture  cudaDeviceProp.concurrentKernels ◦ Kernel execution may overlap with data transfers  Or even multiple data transfers  cudaDeviceProp.asyncEngineCount by Martin Kruliš (v1.0) 4 11. 2014 6  Stream ◦ In-order GPU command queue (like in OpenCL)  Asynchronous GPU operations are registered in queue  Kernel execution  Memory data transfers  Commands in different streams may overlap  Provide means for explicit and implicit synchronization ◦ Default stream (stream 0)  Always present, does not have to be created  Global synchronization capabilities by Martin Kruliš (v1.0) 4 11. 2014 7  Stream Creation cudaStream_t stream; cudaStreamCreate(&stream);  Stream Usage cudaMemcpyAsync(dst, src, size, kind, stream); kernel<<<grid, block, sharedMem, stream>>>(...);  Stream Destruction cudaStreamDestroy(stream); by Martin Kruliš (v1.0) 4 11. 2014 8  Synchronization ◦ Explicit  cudaStreamSynchronize(stream) – waits until all commands issued to the stream have completed  cudaStreamQuery(stream) – a non-blocking test whether the stream has finished ◦ Implicit  Operations in different streams cannot overlap if a special operation is issued between them  Memory allocation  A CUDA command to default stream  Switch between L1/shared memory configuration by Martin Kruliš (v1.0) 4 11. 2014 9  Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) { cudaMemcpyAsync(…HostToDevice, stream[i]); MyKernel<<<g, b, 0, stream[i]>>>(...); cudaMemcpyAsync(…DeviceToHost, stream[i]); } May have many implicit synchronizations, depending on CC and hardware overlapping capabilities. by Martin Kruliš (v1.0) 4 11. 2014 10  Overlapping Behavior ◦ Commands in different streams overlap if the hardware is capable running them concurrently ◦ Unless implicit/explicit synchronization prohibits so for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…HostToDevice, stream[i]); for (int i = 0; i < 2; ++i) MyKernel<<<g, b, 0, stream[i]>>>(...); for (int i = 0; i < 2; ++i) cudaMemcpyAsync(…DeviceToHost, stream[i]); Much less opportunities for implicit synchronization by Martin Kruliš (v1.0) 4 11. 2014 11  Callbacks ◦ Callbacks are registered in streams by cudaStreamAddCallback(stream, fnc, data, 0); ◦ The callback function is invoked asynchronously after all preceding commands terminate ◦ Callback registered to the default stream is invoked after previous commands in all streams terminate ◦ Operations issued after registration start after the callback returns ◦ The callback looks like void CUDART_CB MyCallback(stream, errorStatus, userData) { ... by Martin Kruliš (v1.0) 4 11. 2014 12  Events ◦ Special markers that can be used for synchronization and performance monitoring ◦ The typical usage is  Waiting for all commands before the marker finishes  Explicit synchronization between selected streams  Measuring time between two events ◦ Example cudaEvent_t event; cudaEventCreate(&event); cudaEventRecord(event, stream); cudaEventSynchronize(event); by Martin Kruliš (v1.0) 4 11. 2014 13  Making a Good Use of Overlapping ◦ Split the work into smaller fragments ◦ Create a pipeline effect (load, process, store) by Martin Kruliš (v1.0) 4 11. 2014 14  Data Gather and Scatter Problem Input Data Host Memory Gather Multiple cudaMemcpy() calls may be quite inefficient Kernel Execution GPU Memory Scatter Results Host Memory by Martin Kruliš (v1.0) 4 11. 2014 15  Gather and Scatter ◦ Reducing overhead ◦ Performed by CPU before/after cudaMemcpy Main Thread Stream 0 Stream 1 Gather HtD copy Kernel DtH copy Scatter Gather HtD copy Kernel DtH copy … Scatter # of thread per GPU and # of streams per thread depends on the workload structure by Martin Kruliš (v1.0) 4 11. 2014 16  Page-locked (Pinned) Host Memory ◦ Host memory that is prevented from swapping ◦ Created/dismissed by cudaHostAlloc(), cudaFreeHost() cudaHostRegister(), cudaHostUnregister() ◦ Optionally with flags cudaHostAllocWriteCombined cudaHostAllocMapped cudaHostAllocPortable Optimized for writing, not cached on CPU ◦ Copies between pinned host memory and device are automatically performed asynchronously ◦ Pinned memory is a scarce resource by Martin Kruliš (v1.0) 4 11. 2014 17  Device Memory Mapping ◦ Allowing GPU to access portions of host memory directly (i.e., without explicit copy operations)  For both reading and writing ◦ The memory must be allocated/registered with flag cudaHostAllocMapped ◦ The context must have cudaDeviceMapHost flag (set by cudaSetDeviceFlags()) ◦ Function cudaHostGetDevicePointer() gets host pointer and returns corresponding device pointer by Martin Kruliš (v1.0) 4 11. 2014 18  Asynchronous Errors ◦ An error may occur outside the a CUDA call  In case of asynchronous memory transfers or kernel execution ◦ The error is reported by the following CUDA call ◦ To make sure all errors were reported, the device must synchronize (cudaDeviceSynchronize()) ◦ Error handling functions  cudaGetLastError()  cudaPeekAtLastError()  cudaGetErrorString(error) by Martin Kruliš (v1.0) 4 11. 2014 19 by Martin Kruliš (v1.0) 4 11. 2014 20

05-hetero-programming

Related documents

Products

Support

05-hetero-programming

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib