ppt

advertisement
CUDA Programming
Floating Point Operations for the CPU and the GPU
Memory Bandwidth for the CPU and the GPU
GPU Devotes More Transistors to Data Processing
Hillis’ Thesis ’85
(back to the future !)
Piece of
silicon
Sequential computer
Parallel computer
• proposed “The Connection Machine” with massive number of processors each
with small memory operating in SIMD mode.
• CM-1, CM-2 machines from Thinking Machines Corporation (TMC)were examples
of this architecture with 32K-128K processors.
5
CUDA Supports Various Languages or Application
Programming Interfaces
Automatic Scalability
A multithreaded program is partitioned into blocks of threads that execute
independently from each other, so that a GPU with more cores will automatically
execute the program in less time than a GPU with fewer cores.
• NVIDIA GPUs have a number of multiprocessors, each of which executes in
parallel with the others.
• On Tesla, each multiprocessor has a group of 8 stream processors;
a Fermi multiprocessor has two groups of 16 stream processors.
• A core refer to a stream processor. The high end Tesla accelerators have 30
multiprocessors, for a total of 240 cores;
• A high end Fermi has 16 multiprocessors, for 512 cores.
• Each core can execute a sequential thread, but the cores execute in what NVIDIA
calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same
group execute the same instruction at the same time, much like classical SIMD
processors.
•SIMT handles conditionals somewhat differently than SIMD, though the effect is
much the same, where some cores are disabled for conditional operations.
Compute Capability
• Compute capability : of a device is defined by a major revision number
and a minor revision number.
• Devices with the same major revision number are of the same core
architecture.
• The major revision number of devices based on the Fermi architecture is 2.
• Prior devices are all of compute capability 1.x (Their major revision
number is 1).
• The minor revision number corresponds to an incremental improvement
to the core architecture, possibly including new features.
CUDA-Enabled Devices with Compute Capability,
Number of Multiprocessors, and Number of CUDA Cores
CUDA-Enabled Devices with Compute Capability,
Number of Multiprocessors, and Number of CUDA Cores
Features and
Technical Specifications
Grid of Thread Blocks
Memory Hierarchy
Programming Model
• Heterogeneous Programming
• Serial code executes on the host while
parallel code executes on the device.
CUDA Thread Organization
•A thread block can have up to 512 threads
Kernels
• CUDA C extends C by allowing the programmer to define C functions, called kernels,
that, when called, are executed N times in parallel by N different CUDA threads, as
opposed to only once like regular C functions.
•A kernel is defined using the __global__ declaration specifier and the number of
CUDA threads that execute that kernel for a given kernel call is specified using a new
<<<…>>> execution configuration syntax.
•Each thread that executes the kernel is given a unique thread ID that is accessible
within the kernel through the built-in threadIdx variable.
Vector Addition Example
// Kernel definition
__global__ void VecAdd(float* A, float* B,float* C) {
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
...
...
// Kernel invocation with N threads
VecAdd<<<1, N>>>(A, B, C);
}
Thread Hierarchy
• threadIdx : is a 3-component vector, so that threads can be identified using a onedimensional, two-dimensional, or three-dimensional thread index, forming a onedimensional, two-dimensional, or three-dimensional thread block.
• for a one-dimensional block of size Dx, the thread ID of a thread of index (x) is
x
• for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is
(x + y * Dx)
• for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index
(x, y, z) is
(x + y* Dx + z *Dx *Dy)
Matrix Addition Example Using 1 Block
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N]) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
Matrix Addition Example Using Multiple Blocks
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N/threadsPerBlock.x, N/threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
}
Overview of the CUDA Device Memory Model
CUDA API Functions for Device Global
Memory Management.
CUDA API Functions for Data Transfer
Between Memories.
Matrix-Matrix Multiplication Example
Matrix-Matrix Multiplication Example
void MatrixMulOnDevice(float* M, float* N, float* P, int Width)
{
int size = Width * Width * sizeof(float);
// 1. Load M and N to device memory
cudaMalloc(Md, size);
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice) ;
cudaMalloc(Nd, size);
cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
// Allocate P on the device
cudaMalloc(Pd, size);
// 2. Kernel invocation code – to be shown later
…
// 3. Read P from the device
cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
// Free device matrices
cudaFree(Md); cudaFree(Nd); cudaFree (Pd);
}
Kernel Function
// Matrix multiplication kernel – thread specification
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
// 2D Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
// Pvalue stores the Pd element that is computed by the thread
float Pvalue = 0;
for (int k = 0; k < Width; ++k) {
float Mdelement = Md[ty * Width + k];
float Ndelement = Nd[k * Width + tx];
Pvalue += Mdelement * Ndelement;
}
// Write the matrix to device memory each thread writes one element
Pd[ty * Width + tx] = Pvalue;
}
Host Code that Launches a Kernel
// Setup the execution configuration
dim3 dimBlock(WIDTH, WIDTH);
dim3 dimGrid(1, 1);
// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, WIDTH);
Block / Grid Definition Examples
dim3 dimBlock(4, 2, 2);
dim3 dimGrid(2, 2, 1);
KernelFunction<<<dimGrid, dimBlock>>>(…);
dim3 dimBlock(16, 16, 1);
dim3 dimGrid(100, 1, 1);
KernelFunction<<<dimGrid, dimBlock>>>(…);
• Note: the dimension variables can be given as contents of variables ;
they do not need to be compile-time constants
Matrix-Matrix Multiplication on Multiple Blocks
A simple example of using multiple
blocks to calculate Pd
global__ void MatrixMulKernel(float **Md, float **Nd,
float **Pd, int Width)
{
// Calculate the row index of the Pd element and M
int Row = blockId.y * TILE_WIDTH + threadId.y;
// Calculate the column index of Pd and N
Int Col = blockId.x * TILE_WIDTH + threadId.x;
Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k)
Pvalue += Md[Row][k] * Nd[k][Col];
Pd[Row][Col] = Pvalue;
}
CUDA Memories
CUDA Variable Type Qualifiers
Small Matrix Multiplication Example
• Assume 2x2 blocks
• Observation: thread0,0 and thread1,0 both access row 0 of Md.
• Both threads access these Md elements from the global memory.
• If we manage to have thread0,0 and thread1,0 to collaborate so that these Md
elements are only loaded from global memory once, we can reduce the total
number of accesses to the global memory by half.
• In general, we can see that every Md and Nd element are accessed exactly twice
during the execution of block0,0. Therefore, if we can have all the four threads to
collaborate in their accesses to global memory, we can reduce the traffic to the global
memory by half.
Global memory accesses performed by threads in block0,0
• The potential reduction of global memory traffic in matrix multiplication is
proportional to the dimension of the blocks used.
• With NxN blocks, the potential reduction of global memory traffic would be N.
Tiling Md and Nd to utilize shared memory
• Let the threads to collaboratively load
Md and Nd elements into the shared memory
before they individually use these elements
in their dot product calculation.
• Note that the size of the shared memory
is quite small and one must be
careful not to exceed the capacity of the
shared memory when loading these
Md and Nd elements into the shared
memory.
• This can be accomplished by dividing
the Md and Nd matrices into smaller tiles.
• The size of these tiles is chosen so that
they can fit into the shared memory.
• For simplicity, tile dimensions can be chosen to be
equal to the block dimensions
Calculation of the matrix indices in tiled multiplication
Execution phases of a tiled matrix multiplication algorithm (block size=2)
Tiled Matrix Multiplication Kernel using shared memories
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width)
{
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the Pd element to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
float Pvalue = 0;
// Loop over the Md and Nd tiles required to compute the Pd element
for (int m = 0; m < Width/TILE_WIDTH; ++m) {
// Coolaborative loading of Md and Nd tiles into shared memory
Mds[tx][ty] = Md[Row*Width + m*TILE_WIDTH + tx];
Nds[tx][ty] = Nd[(m*TILE_WIDTH + ty) * Width + Col)];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k)
Pvalue += Mds[tx][k] * Nds[k][ty];
__syncthreads();
}
Pd[Row*Width+Col] = Pvalue;
}
Thread Assignment in GeForce-8 Series GPU Devices
• In the GeForce-8 series hardware, the execution resources are organized into
Streaming Multiprocessors. For example,
• GeForce 8800GTX implementation has 16 Streaming Multiprocessors (SMs).
Up to 8 blocks can be assigned to each SM as long as there are enough
resources to satisfy the needs of all the blocks.
With 16 SMs Multiprocessors in a GeForce 8800 GTX processor, up to 128
blocks can be simultaneously assigned to all Streaming Multiprocessors.
Thread Assignment in GeForce-8 Series GPU Devices
• Most grids contain much more than 128 blocks.
• The run-time system maintains a list of blocks that need to execute and
assigns new blocks to Streaming Multiprocessors as they complete the
execution of blocks previously assigned to them.
• In the GeForce 8800GTX design, up to 768 threads can be assigned to
each SM.
• This could be in the form of :
3 blocks of 256 threads each, 6 blocks of 128 threads each, etc.
• With 16 SMs in GeForce 8800 GTX, there can be up to 12,288 threads
simultaneously residing in SMs for execution.
Thread Scheduling
• Implementation specific.
• In the GeForce 8800GTX, once a block is assigned to a SM, it is further divided
into 32-thread units called Warps.
• Warps are the unit of thread scheduling in SMs.
•Each warp consists of 32 threads of consecutive threadId values: thread 0 through 31
form the first warp, 32 through 63 the second warp, and so on.
• Example:
If each block has 256 threads, we should be able to determine the number of
warps that reside in each SM.
Each block has 256/32 or 8 warps.
With three blocks in each SM, we have 8*3 = 24 warps in each SM.
This is in fact the maximal number of warps that can reside in each SM in
GeForce 8800GTX, since there can be no more than 768 threads in each SM
and this amounts to 768/32 = 24 warps.
• SMs are designed such that only one of these warps will be actually executed by
the hardware at any point in time.
Warp Based Thread Scheduling
Warp Based Thread Scheduling
• Why do we need so many warps (if only one of them can execute at any point in
time) ?
• This is how these processors efficiently execute long latency operations such as
access to the global memory.
• When an instruction executed by threads in a warp needs to wait for
the result of a previously initiated long-latency operation, the warp is placed into a
waiting area.
• One of the other resident warps who are no longer waiting for results is selected for
execution.
• If more than one warp is ready for execution, a priority mechanism is used to
select one for execution.
Other Models
Compute Capability
Threads / Warp
Warps / Multiprocessor
Threads / Multiprocessor
Thread Blocks / Multiprocessor
Shared Memory / Multiprocessor (bytes)
Register File Size
1.0
1.1
1.2
1.3
32
24
768
8
16384
8192
32
24
768
8
16384
8192
32
32
1024
8
16384
16384
32
32
1024
8
16384
16384
Divergence in Execution
• At any point in time, the hardware selects and executes one warp at a time.
•An instruction is run for all threads in the same warp, before moving to the next
instruction.
•This style of execution is motivated by hardware cost constraints: it allows the cost of
fetching and processing an instruction to be amortized among a large number of
threads.
• It works well when all threads within a warp follow the same control flow path when
working their data.
• For an if-then else construct, the execution works well when either all threads execute
the then part or all execute the else part.
• When threads within a warp take different control
flow paths, that is when some threads execute the then part and others execute the
else part, the simple execution style no longer works well. In such situation, the
execution of the warp will require multiple passes through these divergent paths.
• One pass will be needed for those threads that follow the then part and another pass
for those that follow the else part. These passes are sequential to each other, thus will
add to the execution time.
•When threads in the same warp follow different paths of control flow, we say that
these threads diverge in their execution.
All Reduce
(total on all threads)
__global__ void sumall(int * A,unsigned int N) {
for (N = N >> 1 ; N ; N = N >> 1 ) {
int temp = A[threadIdx.x] + A[threadIdx.x^N];
__syncthreads();
A[threadIdx.x] = temp;
__syncthreads();
}
}
Memory Coalescing
• If an application can make use of data from multiple, consecutive locations before
moving on to other locations, the DRAMs can supply the data at much higher rate than
if a truly random sequence of locations were accessed.
• In order to achieve anywhere close to the advertised 84.6GB/sec global memory
bandwidth for G80, a kernel must arrange its data accesses so that each request to the
DRAMs is for a large number of consecutive DRAM locations.
• NVIDIA hardware detects whether the threads access consecutive global memory
locations.
• In this case, the hardware combines, or coalesces, all these accesses into a
consolidated access to the DRAMs that requests all consecutive locations involved.
Matrix Layout
Memory Coalescing
Global memory accessed will be coalesced into a single access (1 transaction) if:
– The size of the memory element accessed by each thread is either 4,8, or 16 bytes
– The elements form a contiguous block of memory
– The Nth element is accessed by the Nth thread in the half-warp
– The address of the first element is aligned to 16 times the element’s size.
A half-warp of 16 threads can
coordinate global memory
accesses into a single
transaction
Examples
Examples
Download