CUDA Slides by David Kirk

advertisement
CUDA
Slides by David Kirk
What is GPGPU ?
• General Purpose computation using GPU
in applications other than 3D graphics
– GPU accelerates critical path of application
• Data parallel algorithms leverage GPU attributes
– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Low-latency floating point (FP) computation
• Applications – see //GPGPU.org
– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix
algebra, convolution, correlation, sorting
Previous GPGPU Constraints
• Dealing with graphics API
– Working with the corner cases
of the graphics API
Input Registers
Fragment Program
• Addressing modes
– Limited outputs
• Instruction sets
– Lack of Integer & bit ops
• Communication limited
– Between pixels
– Scatter a[i] = p
Texture
Constants
Temp Registers
– Limited texture size/dimension
• Shader capabilities
per thread
per Shader
per Context
Output Registers
FB
Memory
CUDA
• “Compute Unified Device Architecture”
• General purpose programming model
– User kicks off batches of threads on the GPU
– GPU = dedicated super-threaded, massively data parallel co-processor
• Targeted software stack
– Compute oriented drivers, language, and tools
• Driver for loading computation programs into GPU
–
–
–
–
–
Standalone Driver - Optimized for computation
Interface designed for compute - graphics free API
Data sharing with OpenGL buffer objects
Guaranteed maximum download & readback speeds
Explicit GPU memory management
Parallel Computing on a GPU
•
NVIDIA GPU Computing Architecture
–
–
Via a separate HW interface
In laptops, desktops, workstations, servers
GeForce 8800
•
8-series GPUs deliver 50 to 200 GFLOPS
on compiled parallel C applications
•
•
GPU parallelism is doubling every year Tesla D870
Programming model scales transparently
•
•
Programmable in C with CUDA tools
Multithreaded SPMD model uses
application
data parallelism and thread parallelism
Tesla S870
Extended C
•
Declspecs
– global, device, shared,
local, constant
__device__ float filter[N];
__global__ void convolve (float *image)
__shared__ float region[M];
...
•
•
Keywords
– threadIdx, blockIdx
Intrinsics
– __syncthreads
region[threadIdx] = image[i];
__syncthreads()
...
image[j] = result;
•
•
Runtime API
– Memory, symbol,
execution
management
Function launch
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
{
CUDA Programming Model:
A Highly Multithreaded Coprocessor
• The GPU is viewed as a compute device that:
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel
• Data-parallel portions of an application are
executed on the device as kernels which run in
parallel on many threads
• Differences between GPU and CPU threads
– GPU threads are extremely lightweight
• Very little creation overhead
– GPU needs 1000s of threads for full efficiency
• Multi-core CPU needs only a few
Thread Batching: Grids and Blocks
•
A kernel is executed as a
grid of thread blocks
–
•
All threads share data
memory space
Device
Grid 1
Kernel
1
A thread block is a batch of
threads that can cooperate
with each other by:
–
Synchronizing their execution
•
–
•
Host
For hazard-free shared
memory accesses
Efficiently sharing data
through a low latency shared
memory
Two threads from two
different blocks cannot
cooperate
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NDVIA
Block and Thread IDs
•
Threads and blocks have IDs
–
–
–
•
So each thread can decide
what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Device
Grid 1
Simplifies memory
addressing when processing
multidimensional data
–
–
–
Image processing
Solving PDEs on volumes
…
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NDVIA
CUDA Device Memory Space
Overview
• Each thread can:
(Device) Grid
–
–
–
–
–
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
– Read only per-grid texture
memory
•
Host
The host can R/W
global, constant, and
texture memories
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Global, Constant, and Texture Memories
(Long Latency Accesses)
• Global memory
(Device) Grid
– Main means of
communicating R/W
Data between host and
device
– Contents visible to all
threads
Block (0, 0)
Shared Memory
Registers
• Texture and Constant
Memories
– Constants initialized by
host
– Contents visible to all
threads
Block (1, 0)
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Courtesy: NDVIA
Local
Memory
CUDA – API
CUDA Highlights:
Easy and Lightweight
• The API is an extension to the ANSI C
programming language
Low learning curve
• The hardware is designed to enable
lightweight runtime and driver
High performance
CUDA Device Memory Allocation
• cudaMalloc()
(Device) Grid
– Allocates object in the
device Global Memory
– Requires two parameters
Block (0, 0)
Shared Memory
• Address of a pointer to the
allocated object
• Size of of allocated object
• cudaFree()
– Frees object from device
Global Memory
• Pointer to freed object
Host
Block (1, 0)
Shared Memory
Register
s
Register
s
Register
s
Register
s
Thread (0,
0)
Thread (1,
0)
Thread (0,
0)
Thread (1,
0)
Local
Memor
y
Local
Memor
y
Local
Memor
y
Local
Memor
y
Global
Memory
Constant
Memory
Texture
Memory
CUDA Device Memory Allocation
(cont.)
• Code example:
– Allocate a 64 * 64 single precision float array
– Attach the allocated storage to Md.elements
– “d” is often used to indicate a device data structure
BLOCK_SIZE = 64;
float * d_matrix;
int size = BLOCK_SIZE * BLOCK_SIZE * sizeof(float);
cudaMalloc((void**)&d_matrix, size);
cudaFree(d_matrix);
CUDA Host-Device Data Transfer
• cudaMemcpy()
(Device) Grid
– memory data transfer
– Requires four parameters
•
•
•
•
Block (0, 0)
Shared Memory
Pointer to source
Pointer to destination
Number of bytes copied
Type of transfer
–
–
–
–
Host to Host
Host to Device
Device to Host
Device to Device
• Asynchronous in CUDA
1.1
Host
Block (1, 0)
Shared Memory
Register
s
Register
s
Register
s
Register
s
Thread (0,
0)
Thread (1,
0)
Thread (0,
0)
Thread (1,
0)
Local
Memor
y
Local
Memor
y
Local
Memor
y
Local
Memor
y
Global
Memory
Constant
Memory
Texture
Memory
CUDA Host-Device Data Transfer
(cont.)
• Code example:
– Transfer a 64 * 64 single precision float array
– M is in host memory and Md is in device memory
– cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants
cudaMemcpy(h_matrix, d_matrix, size,
cudaMemcpyHostToDevice);
cudaMemcpy(h_matrx, d_matrix, size,
cudaMemcpyDeviceToHost);
Calling a Kernel Function – Thread
Creation
• A kernel function must be called with an
execution configuration:
__global__ void KernelFunc(...);
dim3
DimGrid(100, 50);
// 5000 thread blocks
dim3
DimBlock(4, 8, 8);
// 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);
• Any call to a kernel function is asynchronous from
CUDA 1.1 on, explicit synch needed for blocking
Memory Model
Why Use the GPU for
Computing ?
• The GPU has evolved into a very flexible and
powerful processor:
GFLOPS
– It’s programmable using high-level languages
– It supports 32-bit floating point precision
– It offers lots of GFLOPS:
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
• GPU in every PC and workstation
NV30 = GeForce FX 5800
What is Behind such an Evolution?
• The GPU is specialized for compute-intensive,
highly data parallel computation (exactly what
graphics rendering is about)
– So, more transistors can be devoted to data
processing rather than data caching and flow control
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM
DRAM
• The fast-growing video game industry exerts
strong economic pressure that forces constant
innovation
split personality
n.
Two distinct personalities in the same entity,
each of which prevails at a particular time.
G80 Thread Computing Pipeline
•
•
Processors
computing
threads
The future ofexecute
GPUs is
programmable
processing
Alternative
operating
modearound
specifically
for computing
So – build the
architecture
the processor
Host
Host
Input
Input Assembler
Assembler
Setup / Rstr / ZCull
Vtx Thread Issue
SP
SP
SP
SP
SP
SP
Geom Thread Issue
SP
SP
SP
SP
Pixel Thread Issue
SP
SP
SP
SP
SP
SP
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Texture
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Load/store
L2
Load/storeL2
FB
FB
Load/store
L2
Load/store
L2
FB Global Memory
FB
Thread Processor
Thread Execution Manager
Load/store
L2
Load/store
L2
FB
FB
GeForce 8800 Series
Technical Specs
•
•
•
Maximum number of threads per block: 512
Maximum size of each dimension of a grid: 65,535
Number of streaming multiprocessors (SM):
–
–
•
Device memory:
–
–
•
•
•
GeForce 8800 GTX: 16 @ 675 MHz
GeForce 8800 GTS: 12 @ 600 MHz
GeForce 8800 GTX: 768 MB
GeForce 8800 GTS: 640 MB
Shared memory per multiprocessor: 16KB divided in 16
banks
Constant memory: 64 KB
Warp size: 32 threads (16 Warps/Block)
What is the GPU Good at?
•
The GPU is good at
data-parallel processing
• The same computation executed on many data elements
in parallel – low control flow overhead
with high SP floating point arithmetic intensity
• Many calculations per memory access
• Currently also need high floating point to integer ratio
•
High floating-point arithmetic intensity and many data
elements mean that memory access latency can be
hidden with calculations instead of big data caches –
Still need to avoid bandwidth saturation!
Drawbacks of (legacy) GPGPU
Model: Hardware Limitations
• Memory accesses are done as pixels
– Only gather: can read data from other pixels
Control
Cache
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
Control
Cache
ALU
ALU
ALU
...
…
d4
d5
d6
d7
…
– No scatter: (Can only write to one pixel)
Control
Cache
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
Control
Cache
ALU
ALU
ALU
...
…
d4
d5
d6
d7
…
Less programming flexibility
Drawbacks of (legacy) GPGPU
Model: Hardware Limitations
• Applications can easily be limited by DRAM
memory bandwidth
Control
Cache
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
Control
Cache
ALU
ALU
ALU
...
d4
d5
d6
d7
…
Waste of computation power due to data
starvation
CUDA Highlights: Scatter
• CUDA provides generic DRAM memory
addressing
–Control
Gather:
ALU ALU
Cache
DRAM
d0
d1
ALU
...
d2
d3
Control
Cache
ALU
ALU
ALU
...
…
d4
d5
d6
d7
…
–Control
And scatter: no longer
limited to write one pixel
Control
ALU ALU ALU
ALU ALU ALU
...
...
…
Cache
Cache
DRAM
d0
d1
d2
d3
d4
d5
More programming flexibility
d6
d7
…
CUDA Highlights:
On-Chip Shared Memory
• CUDA enables access to a parallel on-chip
shared memory for efficient inter-thread data
sharing
Control
Cache
Shared
memory
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
d0
d1
d2
d3
Control
Cache
Shared
memory
ALU
ALU
ALU
...
…
d4
d5
d6
d7
d4
d5
d6
d7
Big memory bandwidth savings
…
A Common Programming Pattern
•
•
Local and global memory reside in device memory
(DRAM) - much slower access than shared memory
So, a profitable way of performing computation on the
device is to block data to take advantage of fast shared
memory:
–
–
Partition data into data subsets that fit into shared memory
Handle each data subset with one thread block by:
•
•
•
Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism
Performing the computation on the subset from shared memory;
each thread can efficiently multi-pass over any data element
Copying results from shared memory to global memory
A Common Programming Pattern
(cont.)
• Texture and Constant memory also reside in
device memory (DRAM) - much slower access
than shared memory
– But… cached!
– Highly efficient access for read-only data
• Carefully divide data according to access
patterns
–
–
–
–
–
R/O no structure  constant memory
R/O array structured  texture memory
R/W shared within Block  shared memory
R/W registers spill to local memory
R/W inputs/results  global memory
G80 Hardware Implementation:
A Set of SIMD Multiprocessors
•
•
•
•
The device is a set of 16
multiprocessors
Each multiprocessor is a set
of 32-bit processors with a
Single Instruction Multiple
Data architecture – shared
instruction unit
At each clock cycle, a
multiprocessor executes the
same instruction on a group
of threads called a warp
The number of threads in a
warp is the warp size
Device
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Processor 1
Processor 2
…
Instruction
Unit
Processor M
Hardware Implementation:
Memory Architecture
•
•
The local, global, constant,
and texture spaces are
regions of device memory
Each multiprocessor has:
–
–
A set of 32-bit registers per
processor
On-chip shared memory
•
–
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Shared Memory
Registers
Processor 1
Registers
Processor 2
Where the shared memory
space resides
•
To speed up access to the
texture memory space
…
Instruction
Unit
Processor M
Texture
Cache
To speed up access to the
constant memory space
A read-only texture cache
Registers
Constant
Cache
A read-only constant cache
•
–
Device
Device memory
Global, constant, texture memories
Hardware Implementation:
Execution Model (review)
• Each thread block of a grid is split into warps, each
gets executed by one multiprocessor (SM)
– The device processes only one grid at a time
• Each thread block is executed by one
multiprocessor
– So that the shared memory space resides in the onchip shared memory
• A multiprocessor can execute multiple blocks
concurrently
–
–
Shared memory and registers are partitioned among the threads of
all concurrent blocks
So, decreasing shared memory usage (per block) and register
usage (per thread) increases number of blocks that can run
concurrently
Threads, Warps, Blocks
•
There are (up to) 32 threads in a Warp
–
•
•
•
•
•
Only <32 when there are fewer than 32 total threads
There are (up to) 16 Warps in a Block
Each Block (and thus, each Warp) executes on a single
SM
G80 has 16 SMs
At least 16 Blocks required to “fill” the device
More is better
–
If resources (registers, thread space, shared memory) allow,
more than 1 Block can occupy each SM
Language Extensions:
Built-in Variables
• dim3 gridDim;
– Dimensions of the grid in blocks (gridDim.z
unused)
• dim3 blockDim;
– Dimensions of the block in threads
• dim3 blockIdx;
– Block index within the grid
• dim3 threadIdx;
– Thread index within the block
Common Runtime Component
• Provides:
– Built-in vector types
– A subset of the C runtime library supported
in both host and device codes
Common Runtime Component:
Built-in Vector Types
• [u]char[1..4], [u]short[1..4],
[u]int[1..4], [u]long[1..4],
float[1..4]
– Structures accessed with x, y, z, w fields:
uint4 param;
int y = param.y;
• dim3
– Based on uint3
– Used to specify dimensions
Common Runtime Component:
Mathematical Functions
•
•
•
•
•
•
pow, sqrt, cbrt, hypot
exp, exp2, expm1
log, log2, log10, log1p
sin, cos, tan, asin, acos, atan, atan2
sinh, cosh, tanh, asinh, acosh, atanh
ceil, floor, trunc, round
•
Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types
Host Runtime Component:
Memory Management
• Device memory allocation
– cudaMalloc(), cudaFree()
• Memory copy from host to device, device to
host, device to device
– cudaMemcpy(), cudaMemcpy2D(),
cudaMemcpyToSymbol(),
cudaMemcpyFromSymbol()
• Memory addressing
– cudaGetSymbolAddress()
Device Runtime Component:
Mathematical Functions
• Some mathematical functions (e.g.
sin(x)) have a less accurate, but faster
device-only version (e.g. __sin(x))
–
–
–
–
__pow
__log, __log2, __log10
__exp
__sin, __cos, __tan
Device Runtime Component:
Synchronization Function
• void __syncthreads();
• Synchronizes all threads in a block
• Once all threads have reached this point,
execution resumes normally
• Used to avoid RAW/WAR/WAW hazards when
accessing shared or global memory
• Allowed in conditional constructs only if the
conditional is uniform across the entire thread
block
Some Useful Information on
Tools
Compilation
• Any source file containing CUDA language
extensions must be compiled with nvcc
• nvcc is a compiler driver
– Works by invoking all the necessary tools and
compilers like cudacc, g++, cl, ...
• nvcc can output:
– Either C code
• That must then be compiled with the rest of the application
using another tool
– Or object code directly
Linking
• Any executable with CUDA code requires
two dynamic libraries:
– The CUDA runtime library (cudart)
– The CUDA core library (cuda)
Debugging Using the
Device Emulation Mode
• An executable compiled in device emulation
mode (nvcc -deviceemu) runs completely on
the host using the CUDA runtime
–
–
No need of any device and CUDA driver
Each device thread is emulated with a host thread
• When running in device emulation mode, one
can:
–
–
–
–
Use host native debug support (breakpoints, inspection, etc.)
Access any device-specific data from host code and vice-versa
Call any host function from device code (e.g. printf) and
vice-versa
Detect deadlock situations caused by improper usage of
__syncthreads
Device Emulation Mode Pitfalls
• Emulated device threads execute sequentially, so
simultaneous accesses of the same memory
location by multiple threads could produce
different results.
• Dereferencing device pointers on the host or host
pointers on the device can produce correct
results in device emulation mode, but will
generate an error in device execution mode
• Results of floating-point computations will slightly
differ because of:
– Different compiler outputs, instruction sets
– Use of extended precision for intermediate results
• There are various options to force strict single precision on
the host
Download