thread - PUC-Rio

advertisement
CUDA
All you wanted to know about it, but was afraid to ask!
Paulo Ivson Netto Santos
Waldemar Celes Filho
Nov 2007
CUDA is aimed at
GPGPU
What is GPGPU ?

General Purpose computation using GPU
– Applications other than 3D graphics
– GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributes
– Large data arrays, streaming throughput
– Fine-grain SIMD parallelism
– Floating point (FP) computation

Applications – see //GPGPU.org
– Game effects (FX) physics, image processing
– Physical modeling, computational engineering, matrix
algebra, convolution, correlation, sorting, etc, etc
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Importance of Data Parallelism

GPUs are designed for graphics
– Highly parallel tasks

Data-parallel processing
– GPUs architecture is ALU-heavy
 Multiple pipelines, multiple ALUs per pipe
– Large memory latency
– HUGE memory bandwidth
– Hide memory latency (with more computation)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CPU vs GPU
Design Strategies and Tactics

CPU Strategy: Make a few threads run fast
– Tactics – minimize latency





Big Cache – build for hit
Instruction/Data Prefetch
Speculative Execution
 limited by “perimeter” – communication bandwidth
GPU Strategy: Make many threads run fast
– Tactics – maximize throughput




Small Cache – build for miss
Parallelism (1000s of threads)
Pipelining
 limited by “area” – compute capability
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
What a GPU looks like?
from graphics point of view
GeForce 7800 GTX Parallelism
8 Vertex Engines
Triangle Setup/Raster
Z-Cull
Shader Instruction Dispatch
Fragment Crossbar
Memory
Partition
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory
Partition
24 Pixel Shaders
16 Raster Operation Pipelines
Memory
Partition
Memory
Partition
G80 replaces the pipeline model

The future of GPUs is programmable processing
So – build the architecture around the processor

Host
Data Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
Pixel Thread Issue
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
Geom Thread Issue
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
Work Distribution for Graphics
Vertices are serially distributed to all the SM’s
 SPA processes vertices in parallel
 Vertices are serially gathered from the SM’s

– And sent to Primitive Setup

Pixels are serially distributed in parallel tiles
 SPA processes pixels in parallel
 Pixels are sent to ROP/FB
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
G80 vs. G7x
GeForce 7
GeForce 8800
SM3
SM4
8
128*
6ppc
32ppc
24
128*
Up to 32ppc
Up to 192ppc
Memory Bandwidth
51 GB/sec
96 GB/sec
Compressed
Bandwidth
204 GB/sec
768 GB/sec
Shader Model
Vertex Shaders
HDR Texture Filtering
Dedicated Shader
Pipes
ROP Processing
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Common GPGPU Constraints

Dealing with graphics API
– Working with the corner cases of the graphics API

Addressing modes
– Limited texture size/dimension

Shader capabilities
– Limited outputs

Instruction sets
– Lack of Integer & bit ops

Communication limited
– Between pixels
– Scatter a[i] = p
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Just what is CUDA anyway?


“Compute Unified Device Architecture”
General purpose programming model
– User kicks off batches of threads on the GPU
– GPU is viewed as a dedicated super-threaded co-processor

Targeted software stack
– Compute oriented drivers, language, and tools

Driver for loading computation programs into GPU
–
–
–
–
–
–
Standalone driver - optimized for computation
Interface designed for compute - graphics free API
Data sharing with OpenGL buffer objects
Guaranteed maximum download & readback speeds
Explicit GPU memory management
Debugging support on the CPU!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA/G80
Advantage
Over
Dual Core
20x
CUDA Performance
197x
47x
10x
Rigid Body
Physics
Solver
Matrix
Numerics
Wave
Equation
BLAS1:
60+ GB/s
BLAS3:
100+ GFLOPS
FDTD:
1.2 Gcells/s
FFT:
52 GFLOPS
Biological
Sequence
Match
SSEARCH:
5.2 Gcells/s
(GFLOPS as defined by benchFFT)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Finance
Black Scholes:
4.7 GOptions/s
GPU: A Highly Multithreaded Coprocessor

The GPU is viewed as a compute device that:
– Is a coprocessor to the CPU or host
– Has its own DRAM (device memory)
– Runs many threads in parallel


Identify data-parallel portions of an application
Execute them on the device as kernels
– Which run in parallel on many threads

Differences between GPU and CPU threads
– GPU threads are extremely lightweight
 Very little creation overhead
– GPU needs 1000s of threads for full efficiency
 Multi-core CPU needs only a few
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Thread Batching: Grids and Blocks

Grid of thread blocks
– Corresponds to one kernel
– All threads access global
memory

Device
Grid 1
Kernel
1
Thread block
– A batch of threads that can
cooperate with each other
– Share data through a low
latency shared memory
– Barrier synchronization for
hazard-free shared memory
accesses

Host
Threads from different
blocks cannot cooperate
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NVDIA
Block and Thread IDs

Threads and blocks have IDs
– Each thread can decide
what data to work on
– Block ID: 1D or 2D
– Thread ID: 1D, 2D, or 3D
 Multidimensional data
– Image processing
– Solving PDEs on volumes
– …
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NVDIA
CUDA Device Memory
Overview

Each thread can:
(Device) Grid
–
–
–
–
–
R/W per-thread registers
R/W per-thread local memory
R/W per-block shared memory
R/W per-grid global memory
Read only per-grid constant
memory
– Read only per-grid texture
memory

The host can R/W global,
constant, and texture
memories
Host
Block (0, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Block (1, 0)
Local
Memory
Local
Memory
Global, Constant, and Texture
Memories

(Device) Grid
Global memory
Block (0, 0)
– Communicating data
between host and
device
– Visible to all threads

Shared Memory
Registers
Texture and
Constant memories
– Read-only data
initialized by host
– Visible to all threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Block (1, 0)
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Courtesy: NVDIA
Local
Memory
A Common Programming Pattern

Local and global memory reside in DRAM
– Much slower access than shared memory

Profitable way of performing computation
– Block data and computation to take advantage of
fast shared memory
– Partition data into data subsets that fit into shared
memory
– Handle each data subset with one thread block by:
 Loading the subset from global memory to shared memory,
using multiple threads to exploit memory-level parallelism
 Performing the computation on the subset from shared
memory; each thread can efficiently multi-pass over any
data element
 Copying results from shared memory to global memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
A Common Programming Pattern

Texture and Constant memory also reside in
device memory (DRAM)
– Much slower access than shared memory
– But… cached!
– Highly efficient access for read-only data

Carefully divide data according to access
patterns
–
–
–
–
–
R/O no structure  constant memory
R/O array structured  texture memory
R/W shared within Block  shared memory
R/W registers spill to local memory
R/W inputs/results  global memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
That’s it!
Or not... so many things still
missing!
How to code?
1.
•
API, SDK, etc
How does it actually work in the GPU?
2.
•
HW details that make all the difference
How to get the best of it?
3.
•
Tips and tricks to get those GFLOPs!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA API
Extended C

Declspecs
– global, device,
shared, local,
constant

__device__ float filter[N];
__global__ void convolve (float *image)
__shared__ float region[M];
...
Keywords
region[threadIdx] = image[i];
__syncthreads()
...
– threadIdx, blockIdx

Intrinsics
– __syncthreads

Runtime API
– Memory, symbol,
execution
management

Function launch
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
{
Extended C
Integrated source
(foo.cu)
cudacc
EDG C/C++ frontend
Open64 Global Optimizer
GPU Assembly
CPU Host Code
foo.s
foo.cpp
OCG
gcc / cl
G80 SASS
foo.sass
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Device Memory Allocation

cudaMalloc()
(Device) Grid
– Allocates the device
Global Memory
– Requires two parameters
Block (0, 0)
Shared Memory
 Address of a pointer to
the allocated object
 Size of allocated object

cudaFree()
– Frees object from device
Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Host
Block (1, 0)
Shared Memory
Register
s
Register
s
Register
s
Register
s
Thread (0,
0)
Thread (1,
0)
Thread (0,
0)
Thread (1,
0)
Local
Memor
y
Local
Memor
y
Local
Memor
y
Local
Memor
y
Global
Memory
Constant
Memory
Texture
Memory
CUDA Device Memory Allocation

Code example:
– Allocate 256 * 256 single precision float array
– Use “d” suffix to indicate device data structure
float* elementsd;
int size = 256 * 256 * sizeof(float);
cudaMalloc( (void**)&dataOnDevice, size );
cudaFree( dataOnDevice );
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Host-Device Data Transfer

(Device) Grid
cudaMemcpy()
–
–
Block (0, 0)
Memory data transfer
Requires four parameters
1.
2.
3.
4.
Shared Memory
Pointer to destination
Pointer to source
Number of bytes copied
Type of transfer
–
–
–
–
Host to Host
Host to Device
Device to Host
Device to Device
Host
Shared Memory
Register
s
Register
s
Register
s
Register
s
Thread (0,
0)
Thread (1,
0)
Thread (0,
0)
Thread (1,
0)
Local
Memor
y
Local
Memor
y
Local
Memor
y
Local
Memor
y
Global
Memory
Constant
Memory
Texture
Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Block (1, 0)
CUDA Host-Device Data Transfer
(cont.)

Code example:
–
–
–
–
Transfer a 64 * 64 single precision float array
elements is in host memory
elementsd is in device memory
cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are
symbolic constants
cudaMemcpy( elementsd, elements, size, cudaMemcpyHostToDevice );
cudaMemcpy( elements, elementsd, size, cudaMemcpyDeviceToHost );
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations
Executed
on the:
Only callable
from the:
__device__ float DeviceFunc()
device
device
__global__ void KernelFunc()
device
host
host
host
__host__ float HostFunc()

__global__ defines a kernel function
– Must return void
__device__ and __host__ can be used
together
 __host__ is optional

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA Function Declarations
__device__ functions cannot have
their address taken
 For functions executed on the device:

– No recursion
– No static variable declarations inside the
function
– No variable number of arguments
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Calling a Kernel – Thread Creation

Kernel functions are called with an execution configuration
__global__ void KernelFunc(...);
dim3 DimGrid(100, 50); // 5000 thread blocks
dim3 DimBlock(4, 8, 8); // 256 threads per block
size_t SharedMemBytes = 64; // 64 bytes of shared memory
KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...);

Calls to a kernel function are asynchronous
– But only one kernel active at a time per GPU
– Implicit synchronizations
 Second kernel launch
 Memory read backs
– Explicit synchronizations

cudaThreadSynchronize()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Some Additional API
Features
math functions, thread and block
ids, etc
Application Programming Interface

The API is an extension to the C programming
language
 It consists of:
– Language extensions
 To target portions of the code for execution on the device
– A runtime library split into:
 A common component providing built-in vector types and a
subset of the C runtime library in both host and device
codes
 A host component to control and access one or more
devices from the host
 A device component providing device-specific functions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Language Extensions:
Built-in Variables

dim3 gridDim;
– Dimensions of the grid in blocks
– Grids are at most 2D! gridDim.z is unused

dim3 blockDim;
– Dimensions of the block in threads

dim3 blockIdx;
– Block index within the grid

dim3 threadIdx;
– Thread index within the block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Common Runtime Component

Provides:
– Built-in vector types
– A subset of the C runtime library supported
in both host and device codes
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Built-in Vector Types

[u]char[1..4], [u]short[1..4],
[u]int[1..4], [u]long[1..4],
float[1..4]
– Structures accessed with x, y, z, w fields:
uint4 param;
int y = param.y;

dim3
– Based on uint3
– Used to specify dimensions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Mathematical Functions
pow, sqrt, cbrt, hypot
 exp, exp2, expm1
 log, log2, log10, log1p
 sin, cos, tan, asin, acos, atan, atan2
 sinh, cosh, tanh, asinh, acosh, atanh
 ceil, floor, trunc, round
 Etc.
– When executed on the host, a given function uses
the C runtime implementation if available
– These functions are only supported for scalar types,
not vector types

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Host Runtime Component

Provides functions to deal with:
–
–
–
Device management (including multi-device systems)
Memory management
Error handling

Initializes the first time a runtime function is called

A host thread can invoke a kernel on only one device
–
Multiple host threads required to run on multiple devices
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Management

Device memory allocation
– cudaMalloc(), cudaFree()

Memory copy from host to device, device to host,
device to device
– cudaMemcpy(), cudaMemcpy2D(),
cudaMemcpyToSymbol(), cudaMemcpyFromSymbol()

Memory addressing
– cudaGetSymbolAddress()
– Used to transfer data to constant memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Device Mathematical Functions

Some mathematical functions (e.g.
sin(x)) have a less accurate, but faster
device-only version (e.g. __sin(x))
–
–
–
–
__pow
__log, __log2, __log10
__exp
__sin, __cos, __tan
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Device Synchronization Function





void __syncthreads();
Synchronizes all threads in a block
Once all threads have reached this point,
execution resumes normally
Avoid RAW/WAR/WAW hazards when
accessing shared or global memory
Allowed in conditional constructs only if the
conditional is uniform across the entire thread
block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Graphics Interoperability
one last API bit...
Overview


Interface to exchange data between OpenGL / D3D and
CUDA without reading it back to the host
Buffer objects can be mapped into the CUDA address
space and then used as global memory
–


Textures can be accessed by casting them to buffer objects
Data can be accessed as any other global data in the
device code
Useful for
–
–
–
–
Frame post-processing
Visualization
Physical Simulation
…
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
OpenGL Interoperability

Mapping GL buffer object to CUDA
cudaError_t
cudaGLMapBufferObject( unsigned int bufobj,
void **Ptr,
cudaContext_t ctxt = def)

Unmapping GL buffer object from CUDA
cudaError_t
cudaGLUnmapBufferObject( unsigned int bufobj,
cudaContext_t ctxt = def)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
OpenGL Interoperability

Example (from simpleGL in the SDK)
float *dptr;
cudaGLMapBufferObject( vbo, (void**)
&dptr);
dim3 grid( 1, 1, 1);
dim3 block( num_threads, 1, 1);
kernel<<< grid, block>>>(dptr);
cudaGLUnmapBufferObject( vbo );
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Practical Code Example
AKA: breaking the inertia with a
simple, illustrative (= useless)
example 
Matrix Multiplication

Illustrates the basic features of
– Global Memory usage
– Memory transfer API
– Thread allocation
– Local, register usage
– Thread ID usage
– Only example, not efficient!
 i.e. Leave shared memory usage for later
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
A Matrix Data Type

NOT part of CUDA
– 2D matrix
– single precision float
elements
– width * height elements
– data elements
allocated and attached
to elements
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
typedef struct {
int width;
int height;
float* elements;
} Matrix;
Square Matrix Multiplication
P = M * N of size WIDTH x WIDTH
 Without blocking
 One thread handles one element
of P
P
WIDTH
M
WIDTH

N
WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
WIDTH
Step 1: Matrix Data Transfers
// Allocate the host memory M where we will copy to device
Matrix AllocateMatrix(const int height, const int width, float initVal)
{
Matrix M;
M.width = MATRIX_SIZE;
M.height = MATRIX_SIZE;
int size = MATRIX_SIZE * MATRIX_SIZE * sizeof(float);
M.elements = (float*) malloc(size);
for (int i = 0; i < height; i++) {
for (int j = 0; j < width; j++) {
M.elements[i*width + j] = initVal;
}
}
return M;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Step 2: Validation Method
// Matrix multiplication on the (CPU) host in double precision
// For simplicity, we will assume that all dimensions are equal
void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P)
{
for (int i = 0; i < M.height; ++i)
for (int j = 0; j < N.width; ++j) {
double sum = 0;
for (int k = 0; k < M.width; ++k) {
double a = M.elements[i * M.width + k];
double b = N.elements[k * N.width + j];
sum += a * b;
}
P.elements[i * N.width + j] = sum;
}
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Multiply Using One Thread Block
N
Grid 1

One Block of threads compute
matrix P
Block 1
2
4
– Each thread computes one
element of P

Each thread
– Loads a row of matrix M
– Loads a column of matrix N
– Perform one multiply and
addition for each pair of M and
N elements
– Compute to off-chip memory
access ratio close to 1:1 (not
very high)

2
Thread
(2, 2)
Size of matrix limited by the
number of threads allowed in a
thread block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
6
3
2
5
4
48
MATRIX_SIZE
M
P
Step 3: Host-side Main Code
int main(void)
{
// Allocate and initialize the matrices
Matrix M = AllocateMatrix(MATRIX_SIZE, MATRIX_SIZE, 1);
Matrix N = AllocateMatrix(MATRIX_SIZE, MATRIX_SIZE, 1);
Matrix P = AllocateMatrix(MATRIX_SIZE, MATRIX_SIZE, 0);
// M * N on the device
MatrixMulOnDevice(M, N, P);
// Free matrices
FreeMatrix(M);
FreeMatrix(N);
FreeMatrix(P);
return 0;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Host-side Code
// Matrix multiplication on the device
void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P)
{
// Load M and N to the device
Matrix Md = AllocateDeviceMatrix(M);
CopyToDeviceMatrix(Md, M);
Matrix Nd = AllocateDeviceMatrix(N);
CopyToDeviceMatrix(Nd, N);
// Allocate P on the device
Matrix Pd = AllocateDeviceMatrix(P);
CopyToDeviceMatrix(Pd, P); // Clear memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Step 4: Host-side Code (cont.)
// Setup the execution configuration
dim3 dimBlock(MATRIX_SIZE, MATRIX_SIZE);
dim3 dimGrid(1, 1);
// Launch the device computation threads!
MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd);
// Read P from the device
CopyFromDeviceMatrix(P, Pd);
// Free device matrices
FreeDeviceMatrix(Md);
FreeDeviceMatrix(Nd);
FreeDeviceMatrix(Pd);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Step 5: Device-side Kernel
// Matrix multiplication kernel – thread specification
__global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P)
{
// 2D Thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
// Pvalue is used to store the element of the matrix that is computed by the thread
float Pvalue = 0;
for (int k = 0; k < MATRIX_SIZE; ++k)
{
float Melement = M.elements[ty * M.width + k];
float Nelement = N.elements[k * N.width + tx];
Pvalue += Melement * Nelement;
}
// Write the matrix to device memory; each thread writes one element
P.elements[ty * P.width + tx] = Pvalue;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Step 3: Some Loose Ends
// Allocate a device matrix of same size as M.
Matrix AllocateDeviceMatrix(const Matrix M)
{
Matrix Mdevice = M;
int size = M.width * M.height * sizeof(float);
cudaMalloc((void**)&Mdevice.elements, size);
return Mdevice;
}
// Free a device matrix.
void FreeDeviceMatrix(Matrix M) {
cudaFree(M.elements);
}
void FreeMatrix(Matrix M) {
free(M.elements);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Step 3: Some Loose Ends
// Copy a host matrix to a device matrix.
void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost)
{
int size = Mhost.width * Mhost.height * sizeof(float);
cudaMemcpy(Mdevice.elements, Mhost.elements, size,
cudaMemcpyHostToDevice);
}
// Copy a device matrix to a host matrix.
void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice)
{
int size = Mdevice.width * Mdevice.height * sizeof(float);
cudaMemcpy(Mhost.elements, Mdevice.elements, size,
cudaMemcpyDeviceToHost);
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Performance Results (??)

Core 2 Duo 2.4GHz vs 8800 GTS 640MB
Matrix size = 16x16
 1 block of 256 threads
 Host processing time: 0.005550 (ms)
 Device processing time: 0.398564 (ms)


I told you it was an illustrative example!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Performance Results

Of course, we can try cheating and make 12
multiplications in parallel




Matrix size = 16x16
12 blocks of 256 threads each
Host processing time: 0.062140 (ms)
Device processing time: 0.396850 (ms)

Hmm... since it is still illustrative, lets experiment
a little more!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication
Shared Memory
__global__ void matrixMulSimpleKernelShared( float* m, float* n, float* p )
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
float sum = 0;
__shared__ float MMs[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float NNs[BLOCK_SIZE][BLOCK_SIZE];
MMs[ty][tx] = m[ty*BLOCK_SIZE + tx];
NNs[ty][tx] = n[ty*BLOCK_SIZE + tx];
__syncthreads();
for( int k = 0; k < BLOCK_SIZE; ++k )
{
sum += MMs[ty][k] * NNs[k][tx];
}
p[ty*BLOCK_SIZE + tx] = sum;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication
Constant Memory
__constant__ float Mc[BLOCK_SIZE*BLOCK_SIZE];
__constant__ float Nc[BLOCK_SIZE*BLOCK_SIZE];
__global__ void matrixMulSimpleKernelConstant( float* m, float* n, float* p )
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
float sum = 0;
for( int k = 0; k < BLOCK_SIZE; ++k )
{
const float a = Mc[ty*BLOCK_SIZE + k];
const float b = Nc[k*BLOCK_SIZE + tx];
sum += a * b;
}
p[ty*BLOCK_SIZE + tx] = sum;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication
Constant Memory (host)
int byteTotal = BLOCK_SIZE*BLOCK_SIZE*sizeof(float);
cudaMemcpyToSymbol( Mc, m, byteTotal ) ;
cudaMemcpyToSymbol( Nc, n, byteTotal ) ;

then call kernel
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication
Texture Memory
texture<float, 2> texM;
texture<float, 2> texN;
__global__ void matrixMulSimpleKernelTexture( float* p )
{
const int tx = threadIdx.x;
const int ty = threadIdx.y;
float sum = 0;
for( int k = 0; k < BLOCK_SIZE; ++k )
{
const float a = tex2D( texM, k, ty );
const float b = tex2D( texN, tx, k );
sum += a * b;
}
p[ty*BLOCK_SIZE + tx] = sum;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication
Texture Memory (host)
// Allocate arrays for texture access
cudaArray* mArray = NULL;
cudaArray* nArray = NULL;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
cudaMallocArray( &mArray, &channelDesc, BLOCK_SIZE, BLOCK_SIZE );
cudaMallocArray( &nArray, &channelDesc, BLOCK_SIZE, BLOCK_SIZE );
// Bind the arrays to the textures
cudaBindTextureToArray( texM, mArray ) ;
cudaBindTextureToArray( texN, nArray ) ;
// Set M texture parameters
texM.addressMode[0] = cudaAddressModeClamp;
texM.addressMode[1] = cudaAddressModeClamp;
texM.filterMode = cudaFilterModePoint;
texM.normalized = false;
// Set N texture parameters
texN.addressMode[0] = cudaAddressModeClamp;
texN.addressMode[1] = cudaAddressModeClamp;
texN.filterMode = cudaFilterModePoint;
texN.normalized = false;
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication
Texture Memory (host, pt 2)
int byteTotal = BLOCK_SIZE*BLOCK_SIZE*sizeof(float);
cudaMemcpyToArray( mArray, 0, 0, m, byteTotal, cudaMemcpyHostToDevice );
cudaMemcpyToArray( nArray, 0, 0, n, byteTotal, cudaMemcpyHostToDevice );

then call kernel
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Performance Results?
About the same
 Constant memory seems faster (about
0.01ms)
 Not really any difference, still slower than
CPU

 We
will see the proper way of doing
Matrix Multiplication in a few slides!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Useful Information on
Tools
like...err... DEBUGGING!
Compilation


Any source file containing CUDA language extensions
must be compiled with nvcc
nvcc is a compiler driver
–

Works by invoking all the necessary tools and compilers like
cudacc, g++, cl, ...
nvcc can output:
–
Either C code

–
That must then be compiled with the rest of the application using
another tool
Or object code directly
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Linking

Any executable with CUDA code requires
two dynamic libraries:
– The CUDA runtime library (cudart)
– The CUDA core library (cuda)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Debugging Using the
Device Emulation Mode

An executable compiled in device emulation mode (nvcc deviceemu) runs completely on the host using the CUDA runtime
– No need of any device and CUDA driver
– Each device thread is emulated with a host thread

When running in device emulation mode, one can:
– Use host native debug support (breakpoints, inspection, etc.)
– Access any device-specific data from host code and vice-versa
– Call any host function from device code (e.g. printf) and viceversa
– Detect deadlock situations caused by improper usage of
__syncthreads
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Device Emulation Mode Pitfalls

Emulated device threads execute sequentially, so
simultaneous accesses of the same memory
location by multiple threads could produce
different results.
 Dereferencing device pointers on the host or host
pointers on the device can produce correct
results in device emulation mode, but will
generate an error in device execution mode
 Results of floating-point computations will slightly
differ because of:
– Different compiler outputs, instruction sets
– Use of extended precision for intermediate results
 There are various options to force strict single precision on
the host
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
CUDA SDK

CUBLAS
– Blas level 1, 2 and 3 ready-to-use functions

CUFFT
– Discrete Fast Fourier Transform
– API similar to popular FFTWin

CUDPP (still beta)
– Parallel primitives
– Prefix sum, sort, reduction, etc

Full support for clusters running Rocks
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Hardware
now, take a breath...
G80 Thread Computing Pipeline

The future of GPUs is programmable processing
So – build the architecture around the processor

Host
Input Assembler
Setup / Rstr / ZCull
SP
SP
SP
TF
SP
SP
TF
L1
TF
L1
SP
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
Pixel Thread Issue
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
Geom Thread Issue
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor
Vtx Thread Issue
L2
FB
G80 Thread Computing Pipeline


Processors execute computing threads
Alternative operating mode specifically for computing
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Load/store
Load/store
GeForce 8800 Series
Technical Specs



Maximum number of threads per block: 512
Maximum size of each dimension of a grid: 65,535
Number of streaming multiprocessors (SM):
–
–

GeForce 8800 GTX: 16 @ 1.35 GHz
GeForce 8800 GTS: 12 @ 1.2 GHz
Device memory:
–
–
GeForce 8800 GTX: 768 MB
GeForce 8800 GTS: 640 MB

Shared memory per multiprocessor: 16KB divided in 16
banks
 Constant memory: 64 KB
 Warp size: 32 threads (16 Warps/Block)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
What is the GPU Good at?


Data-parallel processing
– Same computation executed on many
data elements in parallel
– Low control flow overhead
With high SP floating point arithmetic
intensity
– Many calculations per memory access
– Currently need high floating point to
integer ratio
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
What is the GPU Good at?




High floating-point arithmetic intensity
+
Many data elements
=
Memory access latency can be hidden
with calculations instead of big caches
Still need to avoid bandwidth saturation!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Drawbacks of (legacy)
GPGPU Model
how it was done prior to CUDA
Hardware Limitations

Memory accesses are done as pixels
–
Only gather: can read data from other pixels
Control
Cache
ALU
ALU
ALU
...
Control
Cache
ALU
ALU
ALU
...
…
…
d0
d1 (Each
d3
d4 write
d5 to one
d7
d2 shader
d6
No scatter:
can only
pixel)
DRAM
–
Control
Cache
DRAM
ALU
ALU
ALU
...
Control
Cache
Less programming flexibility
d0
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
d1
d2
d3
ALU
ALU
ALU
...
…
d4
d5
d6
d7
…
Hardware Limitations

Applications can easily be limited by
DRAM memory bandwidth
Control
Cache
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
Control
Cache
ALU
ALU
ALU
...
d4
d5
d6
d7
…
Waste of computation power due to
data starvation
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
But with CUDA
what we can do differently
Scatter

CUDA provides generic DRAM memory
addressing
– Gather:
Control
Cache
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
Control
Cache
ALU
ALU
ALU
...
…
d4
d5
d6
d7
…
– And scatter: no longer limited to write one pixel
Control
Cache
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
Control
Cache
ALU
ALU
ALU
...
…
d4
d5
d6
d7
…
More programming flexibility
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
On-Chip Shared Memory

CUDA enables access to a parallel on-chip
shared memory for efficient inter-thread data
sharing
Control
Cache
Shared
memory
DRAM
ALU
ALU
ALU
...
d0
d1
d2
d3
d0
d1
d2
d3
Control
Cache
Shared
memory
ALU
ALU
ALU
...
…
d4
d5
d6
d7
d4
d5
d6
d7
Big memory bandwidth savings
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
…
Memory Model &
Hardware
how does the CUDA memory
model work in the hardware?
CUDA Memory Spaces
Grid

Each thread can:
– Read/write per-thread registers
– Read/write per-thread local memory
– Read/write per-block shared
memory
– Read/write per-grid global memory
– Read only per-grid constant
memory
– Read only per-grid texture memory

The host can read/write
global, constant, and
texture memory
Host
Block (0, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Block (1, 0)
Local
Memory
Local
Memory
Hardware Implementation


The local, global, constant,
and texture spaces are
regions of device memory
Each multiprocessor has:
–
–
A set of 32-bit registers per
processor
On-chip shared memory

–
Multiprocessor N
Multiprocessor 2
Multiprocessor 1
Shared Memory
Registers
Processor 1
Registers
Processor 2
A read-only constant cache

–
Where the shared memory
space resides
Device
To speed up access to the
constant memory space
To speed up access to the
texture memory space
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
…
Instruction
Unit
Processor M
Constant
Cache
Texture
Cache
A read-only texture cache

Registers
Device memory
Global, constant, texture memories
Memory Summary
Memory
Location
Cached
Access
Local
Off-chip
No
Read/write One thread
Shared
On-chip
N/A - resident Read/write All threads in a block
Global
Off-chip
No
Read/write All threads + host
Constant
Off-chip
Yes
Read
All threads + host
Texture
Off-chip
Yes
Read
All threads + host
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Who
Access Times

Register
–

Shared Memory
–

DRAM, cached, 1…10s…100s of cycles, depending on cache
locality
Texture Memory
–

DRAM, no cache - *slow*
Constant Memory
–

DRAM, no cache - *slow*
Global Memory
–

dedicated HW - single cycle
Local Memory
–

dedicated HW - single cycle
DRAM, cached, 1…10s…100s of cycles, depending on cache
locality
Instruction Memory (invisible)
–
DRAM, cached
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Processing Model &
Hardware
now that we can reach the data,
how do we actually process it?
CUDA: A Set of SIMD
Multiprocessors

A set of 16 multiprocessors
 Each multiprocessor
–
–

Multiprocessor N
Multiprocessor 2
Multiprocessor 1
At each clock cycle
–

A set of 8 processors (32-bit)
Single Instruction Multiple
Data architecture (shared
instruction unit)
Device
The multiprocessor executes
the same instruction on a
group of threads called a
warp
The number of threads in a
warp is the warp size
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Processor 1
Processor 2
…
Instruction
Unit
Processor M
Threads, Warps, Blocks

There are (up to) 32 threads in a Warp
–
Only <32 when there are fewer than 32 total threads

There are (up to) 16 Warps in a Block
 Each Block (and thus, each Warp) executes on a
single SM
 G80 has 16 SMs
 At least 16 Blocks required to “fill” the device
 More is better
–
If resources (registers, thread space, shared memory) allow,
more than 1 Block can occupy each SM
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Execution Model (review)

Each thread block of a grid is split into warps, each gets
executed by one multiprocessor (SM)
–
The way a block is split into warps is always the same


Each thread block is executed by one multiprocessor
–

Each warp contains threads of consecutive, increasing thread indices
with the first warp containing thread 0
So that the shared memory space resides in the on-chip shared
memory
A multiprocessor can execute multiple blocks concurrently
–
–
Shared memory and registers are partitioned among the threads of all
concurrent blocks
So, decreasing shared memory usage (per block) and register usage (per
thread) increases number of blocks that can run concurrently
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix: Reloaded
a better approach to matrix
multiplication
Recall: Matrix Multiplication
Device-Side Kernel Function
M
N
WIDTH
for (int k = 0; k < M.width; ++k)
{
float Melement = M.elements[ty * M.pitch + k];
float Nelement = Nd.elements[k * N.pitch + tx];
Pvalue += Melement * Nelement;
}
// Write the matrix to device memory;
// each thread writes one element
P.elements[ty * P.pitch + tx] = Pvalue;
P
WIDTH
ty
tx
WIDTH
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
WIDTH
Idea: Use Shared Memory to reuse
global memory data
Each input element is read by WIDTH threads
 Load each element into Shared Memory
 Several threads use the local version
 Drastically reduce the memory bandwidth

– Tiled algorithms
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Tiled Multiply Using Thread
Blocks
bx
0
1
2
tx
BLOCK_SIZE
bsize-1
N
BLOCK_SIZE
One thread computes one element of Psub
Assume that the dimensions of M and N are
multiples of BLOCK_SIZE and square shape

M
P
by
0
1
2
1
ty
Psub
bsize-1
BLOCK_SIZE BLOCK_SIZE
BLOCK_SIZE
WIDTH
WIDTH
2
WIDTH
0
BLOCK_SIZE

012
WIDTH
One block computes one square
sub-matrix Psub of size BLOCK_SIZE

Shared Memory Usage

Each SMP has 16KB shared memory
– Each Thread Block uses 2*256*4B = 2KB of shared
memory.
– Potentially up to 8 Thread Blocks actively executing
– For BLOCK_SIZE = 16, this allows up to 8*512 =
4,096 pending loads
 In practice, there will probably be up to half of this due to
scheduling to make use of SPs.
– The next BLOCK_SIZE 32 would lead to 2*32*32*4B=
8KB shared memory usage per Thread Block,
allowing only up to 2 Thread Blocks active at the
same time
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
First-order Size Considerations

Each Thread Block should have a minimal of 192
threads
– BLOCK_SIZE of 16 gives 16*16 = 256 threads

A minimal of 32 Thread Blocks
– A 1024*1024 P Matrix gives 64*64 = 4096 Thread
Blocks

Each thread block perform
– 2*256 = 512 float loads from global memory
– for 256 * (2*16) = 8,192 mul/add operations
– Memory bandwidth no longer a limiting factor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Kernel Execution Configuration
// Setup computation
// Matrix size could be 1024
dim3 dimBlock( BLOCK_SIZE, BLOCK_SIZE );
dim3 dimGrid( MATRIX_SIZE/BLOCK_SIZE, MATRIX_SIZE/BLOCK_SIZE );
// Launch device kernel
matrixMulKernel<<< dimGrid, dimBlock >>>( md, nd, pd );
For very large N and M dimensions, one
will need to add another level of blocking
and execute the second-level blocks
sequentially (several kernels);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Kernel Code: initialization
__global__ void matrixMulKernel( float* m, float* n, float* p )
{
// Block and thread index
int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first and last sub-matrix of A processed by the block
int mBegin = MATRIX_SIZE * BLOCK_SIZE * by; int mEnd = mBegin + MATRIX_SIZE - 1;
// Step size used to iterate through the sub-matrices of A
int mStep = BLOCK_SIZE;
// Index of the first and last sub-matrix of B processed by the block
int nBegin = BLOCK_SIZE * bx; int nStep = BLOCK_SIZE * MATRIX_SIZE;
// sum is used to store the element of the block sub-matrix that is computed by the thread
float sum = 0;
// Declaration of the shared memory arrays Ms and Ns used to store the sub-matrices of A and B
__shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE];
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Tiled Multiply Using Thread
Blocks
bx
0
1
tx
bsize-1
BLOCK_SIZE
012
N
BLOCK_SIZE
One thread computes one element of Psub
Assume that the dimensions of M and N are
multiples of BLOCK_SIZE and square shape

M
WIDTH
One block computes one square
sub-matrix Psub of size BLOCK_SIZE

P
by
0
1
2
1
ty
Psub
bsize-1
BLOCK_SIZE BLOCK_SIZE
BLOCK_SIZE
WIDTH
WIDTH
2
WIDTH
0
BLOCK_SIZE

2
Kernel Code: main computation
// Loop over all the sub-matrices of A and B required to compute the block sub-matrix
for( int a = mBegin, b = nBegin; a <= mEnd; a+=mStep, b+=nStep )
{
// Load the matrices from device memory to shared memory;
// each thread loads one element of each matrix
MS(ty, tx) = m[a + MATRIX_SIZE * ty + tx];
NS(ty, tx) = n[b + MATRIX_SIZE * ty + tx];
// Synchronize to make sure the matrices are loaded
__syncthreads();
// Multiply the two matrices together; each thread computes one element of the block sub-matrix
for( int k = 0; k < BLOCK_SIZE; ++k )
sum += MS(ty, k) * NS(k, tx);
// Synchronize to make sure that the preceding computation is done before
// loading two new sub-matrices of A and B in the next iteration
__syncthreads();
}
// Write the block sub-matrix to device memory; each thread writes one element
int c = MATRIX_SIZE * BLOCK_SIZE * by + BLOCK_SIZE * bx;
p[c + MATRIX_SIZE*ty + tx] = sum;
}
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
This code should run at
about 45 GFLOPS
Performance Results
6
4.838223
Processing Time (ms)
5
4
3
CPU
GPU
2
1
0.241846
0.251588
0.26058
0.480575
0.301256
0.29736
0
0
0.00553520
0.03742840
0.125462
60
80
Matrix Size (NxN)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
100
120
140
Performance Results
GPU block sizes
Size = 4: 314 ms
Size = 8: 156 ms
Size = 16: 65 ms
Matrix: 1024 x 1024
CPU: 5420 ms
GPU: 65 ms
6000
Processing Time (ms)
5000
4000
3000
CPU
GPU
2000
1000
0
0
200
400
600
Matrix Size (NxN)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
800
1000
1200
Performance Results
Important note: CPU code not optimized!
 GPU seems almost 100x faster than CPU
 But

– Even if we can get a 2x speed-up in CPU (!)
– GPU would still be about 50x faster!

Do the math!
– Fastest Core 2 Duo has peak at ~10 GFLOPs
– Previous CUDA code is ~45 GFLOPs
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Matrix Multiplication: Verdict
No need to code and optimize this!
 Use CUBLAS

– CUDA Blas level 1, 2 and 3 library
– Ready for use, optimized like hell!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
HW Architecture
everybody take a deep breath
now......
Streaming Processor Array
(SPA)
TPC
TPC
TPC
Texture Processor Cluster
TPC
TPC
TPC
TPC
Streaming Multiprocessor
Instruction L1
SM
TPC
Data L1
Instruction Fetch/Dispatch
Shared Memory
TEX
SP
SM
SP
SP
SP
SFU
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
SFU
SP
SP
SP
SP
Texture Processor Cluster (TPC)

TPC
SM0
T
e
x
t
u
r
e
L
1
C
a
c
h
e
Instruction Fetch
Instruction L 1 Cache
Instruction Decode
T
e
x
t
u
r
e
Shared Memory
S
F
U
SP0
R
R
SP 4
SP1
R
R
SP 5
SP2
R
R
SP 6
SP3
R
R
SP 7
S
F
U
L
2
Constant L1 Cache
SM1
Instruction Fetch
U
n
i
t
I
&
C
Instruction L 1 Cache
Instruction Decode
Shared Memory
S
F
U
SP0
R
R
SP 4
SP1
R
R
SP 5
SP2
R
R
SP 6
SP3
R
R
SP 7
Constant L 1 Cache
Load/Store
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
S
F
U
C
a
c
h
e

Texture-Processor Cluster (TPC)
– Texture unit, L1 texture cache
– 2 Streaming Multiprocessors
(SM)
– 8 FP MAD / clock
– L2 Instruction & Data Caches
Memory and Texture access
– Texture, load/store interfaces
– Registers decouple latency
Streaming Multiprocessor (SM)






Streaming Multiprocessor (SM)
– 8 Streaming Processors (SP)
– 2 Super Function Units (SFU)
Multi-threaded instruction dispatch
– 1 to 768 threads active
– SIMD instruction per 16/32
threads
– Cover latency of texture/memory
loads
Hot clock 1.35 GHz
– 20+ GFLOPS
local register file (RFn)
16 KB shared memory
DRAM texture and memory access
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Streaming Multiprocessor (SM)
Instruction Fetch
Instruction L 1 Cache
L1 Fill
Thread / Instruction Dispatch
Work
Shared Memory
S
F
U
Control
SP0
RF0
RF4
SP4
SP1
RF1
RF5
SP5
SP2
RF2
RF6
SP6
SP3
RF3
RF7
SP7
Results
S
F
U
Load Texture
Constant L1 Cache
Load from Memory
L1 Fill
Store to
Store to Memory
Streaming Processor (SP)

One scalar ALU
– Serves as datapath for 1 thread of a warp
– Each SM has 8 SP
– Each SM has 2 SFU

Threads
– A warp instruction can issue every clock
– Need ~8 warps to typically saturate the
MAD/SFU pipes
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
SM Instruction Buffer

Fetch one warp instruction/cycle
–
–

from instruction L1 cache
into any instruction buffer slot
Issue one “ready-to-go” warp
instruction/cycle
–
–
from any warp - instruction buffer slot
operand scoreboarding used to prevent
hazards
I$
L1
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select

Issue selection based on round-robin/age
of warp
 SM broadcasts SIMD instruction to 32
Threads of a Warp
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
MAD
SFU
Scoreboarding

All operands of all instructions in the
Instruction Buffer are scoreboarded
– prevents hazards
– cleared instructions are eligible for issue

Decoupled Memory/Processor pipelines
– any thread can continue to issue instructions
until scoreboarding prevents issue
– allows Memory/Processor ops to proceed in
shadow of Memory/Processor ops
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Branching



Conditional branch to label, subroutine call
SM schedules each Warp independently
SM executes 32 threads of a Warp as a SIMD
instruction
– SM enables/disables sets of threads when branches
diverge

Synchronization
– Re-converge diverged threads in a Warp

Barrier Synchronization
– CUDA uses barrier instruction to synchronize
multiple Warps in a Thread Block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
SM Register File

Register File (RF)
– 32 KB
– Provides 4 operands/clock

TEX pipe can also read/write RF
– 2 SMs share 1 TEX

Load/Store pipe can also
read/write RF
I$
L1
Multithreaded
Instruction Buffer
R
F
Shared
Mem
Operand Select
MAD
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
C$
L1
SFU
Constants

Immediate address constants
 Indexed address constants
 Constants stored in memory, and
cached on chip
– L1 per SM
I$
L1
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
SFU
Shared Memory

Each SM has 16 KB of Shared
Memory
– 16 banks of 32bit words

CUDA uses Shared Memory as
shared storage visible to all threads in
a thread block
– read and write access

Not used explicitly for pixel shader
programs
– we dislike pixels talking to each other 
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
I$
L1
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
Execution Pipes

Scalar MAD pipe
–
–
–

FMUL,FADD,FMAD
integer ops, conversions
One instruction/clock
Scalar SFU pipe
–


one instruction/4 clocks
–
R
F
C$
L1
Shared
Mem
also supports FMUL, MOV
TEX pipe (external to SM, shared by all
SM’s in a TPC)
LD/ST pipe
–
Multithreaded
Instruction Buffer
RCP,RSQ,LG2,EX2,SIN,COS

–
I$
L1
thread register spill to memory, used for
indexable registers
CUDA has both global and local memory
access through LD/ST
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Operand Select
MAD
SFU
Texture (Memory read)
Clustering/Batching

Use another independent Texture memory read to hide Texture
memory latency
–

Instead of:
–
–
–
–

TEX 0 (long latency)
Dependent MATH 0
TEX 1 (long latency)
Dependent MATH 1
Do:
–
–
–
–

Use same thread to help hide own latency
TEX 0 (long latency)
TEX 1 (long latency - hidden)
MATH 0
MATH 1
Compiler handles this!
–
But, you must have enough non-dependent LDs and Math
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Load/Store (Memory read/write)
Clustering/Batching

Use LD to hide LD latency (non-dependent LD ops only)
–

Instead of:
–
–
–
–

LD 0 (long latency)
Dependent MATH 0
LD 1 (long latency)
Dependent MATH 1
Do:
–
–
–
–

Use same thread to help hide own latency
LD 0 (long latency)
LD 1 (long latency - hidden)
MATH 0
MATH 1
Compiler handles this!
–
But, you must have enough non-dependent LDs and Math
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Performance Issues
the real gems of how to code
efficient algorithms in CUDA
(and G80+ in general!)
CUDA Instruction Performance

Instruction cycles (per warp) = sum of
–
–
–

Operand read cycles
Instruction execution cycles (both memory access and FP)
Result update cycles
Therefore instruction throughput depends on
–
–
–
Nominal instruction throughput
Memory latency
Memory bandwidth
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Maximizing Instruction Throughput

Minimize use of low-throughput instructions
–

Maximize use of high-bandwidth memory
–
–
–
–

Will cover specifics later
Maximize use of shared memory
Maximize locality and synchrony of cached accesses
Minimize accesses to (uncached) global and local memory
Maximize coalescing of global memory accesses
Optimize performance by overlapping memory
accesses with HW computation
–
High arithmetic intensity programs

–
i.e. high ratio of math to memory transactions
Many concurrent threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Arithmetic Instruction Throughput

int and float add, shift, min, max and float mul, mad: 2
cycles per warp
–
int multiply (*) is by default 32-bit

–

requires multiple cycles / warp
Use __mul24() / __umul24() intrinsics for 2-cycle 24-bit int
multiply
Integer divide and modulo are expensive
–
–
–
Compiler will convert literal power-of-2 divides to shifts
Be explicit in cases where compiler can’t tell that divisor is a
power of 2!
Useful trick: foo % n == foo & (n-1) if n is a power of 2
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Arithmetic Instruction Throughput

Reciprocal, reciprocal square root, sin/cos, log, exp: 8
cycles per warp
–
–

These are the versions prefixed with “__”
Examples:__rcp(), __sin(), __exp()
Other functions are combinations of the above
–
–
y / x == rcp(x) * y == 10 cycles per warp
sqrt(x) == rcp(rsqrt(x)) == 16 cycles per warp
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Runtime Math Library

There are two types of runtime math operations
–
__func(): direct mapping to hardware ISA


–
func() : compile to multiple instructions



Fast but low accuracy (see prog. guide for details)
Examples: __sin(x), __exp(x), __pow(x,y)
Slower but higher accuracy (5 ulp or less)
Examples: sin(x), exp(x), pow(x,y)
The -use_fast_math compiler option forces every func()
to compile to __func()
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Inside the Hardware




Need ~8 warps to typically saturate the MAD/SFU pipes
Avoid many SFU calls (RCP,RSQ,SIN,etc)
Optimized for FMUL operations
SMs share instruction slots by integer ops, loads, stores,
etc and floating point operations
– The more floating point you fit, the more flops you get

Keep instruction workload constant: spread fmads,
memory fetches, etc
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Make your program float-safe!

Future hardware will have double precision support
–
–
–

G80 is single-precision only
Double precision will have additional performance cost
Careless use of double or undeclared types may run more
slowly on G80+
Important to be float-safe (be explicit whenever you want
single precision) to avoid using double precision where it
is not needed
–
Add ‘f’ specifier on float literals:


–
foo = bar * 0.123;
foo = bar * 0.123f;
// double assumed
// float explicit
Use float version of standard library functions


foo = sin(bar);
foo = sinf(bar);
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
// double assumed
// single precision explicit
Deviations from IEEE-754

Addition and Multiplication are IEEE 754 compliant
–

However, often combined into multiply-add (FMAD)
–




Maximum 0.5 ulp error
Intermediate result is truncated
Division is non-compliant (2 ulp)
Not all rounding modes are supported
Denormalized numbers are not supported
No mechanism to detect floating-point exceptions
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
GPU Floating Point Features
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes
for FADD and
FMUL
Round to nearest
and round to zero
All 4 IEEE, round to
nearest, zero, inf, inf
Round to nearest
only
Round to
zero/truncate only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and
Infinity support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
How thread blocks are
partitioned

Thread blocks are partitioned into warps
–
–

Partitioning is always the same
–
–

Thread IDs within a warp are consecutive and increasing
Warp 0 starts with Thread ID 0
Thus you can use this knowledge in control flow
(Covered next)
However, DO NOT rely on any ordering between warps
–
If there are any dependencies between threads, you must
__syncthreads() to get correct results
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Branching and
Divergence
“This way! No, that way!”
SIMD Operation

The SM is a multithreaded SIMD machine
– SIMD allows overhead of fetch-decode-schedule to be amortized
across many threads (the threads of a warp)
– Implication is that higher percentage of SM area is computation
units => better perf/area than say a multicore CPU

However, only works if threads are truly “coherent” (in
lock step executing the exact same instructions on
different data sets)
– Branches represent opportunities for thread “divergence”
– When threads of a warp diverge, we lose a degree of SIMD
– Loss of SIMD increases exponentially for each divergence until
all threads of a warp are executed one-at-a-time
– At that point, for a warp size of W, we operate at efficiency 1/W
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Coding for Performance in the
Face of Branches

If you have to branch…
– Do it as little as possible
– The “divergence distance” in a shader should
be as small as possible
 The distance between a branch that can diverge
and an instruction which “resyncs” a divergence –
a join point
– SM provides means in the ISA to converge a
set of threads at some common point
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Control Flow Instructions

Main performance concern with branching is divergence
–
–

Threads within a single warp take different paths
Different execution paths must be serialized
Avoid divergence when branch condition is a function of
thread ID
–
Example with divergence:


–
If (threadIdx.x > 2) { }
Branch granularity < warp size
Example without divergence:


If (threadIdx.x / WARP_SIZE > 2) { }
Branch granularity is a whole multiple of warp size
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Instruction Predication in G80



Comparison instructions set condition codes (CC)
Instructions can be predicated to write results only when CC meets
criterion (CC != 0, CC >= 0, etc.)
Compiler tries to predict if a branch condition is likely to produce
many divergent warps
–
–
If guaranteed not to diverge: only predicates if < 4 instructions
If not guaranteed: only predicates if < 7 instructions

May replace branches with instruction predication

ALL predicated instructions take execution cycles
–
Those with false conditions don’t write their output

–
Or invoke memory loads and stores
Saves branch instructions, so can be cheaper than serializing
divergent paths
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Performance
Memory Bandwith
Memory Instructions

Memory instructions take 2 cycles per warp
–
–
–

Issue global and local memory loads / stores (not cached)
Constant and texture loads (cached)
Shared memory reads / writes
Example
__shared__ float shared[];
__device__ float global[];
shared[threadIdx.x] = global[threadIdx.x];



2 cycles to issue read from global (device) memory,
2 cycles to issue write to shared memory
200-300 cycles to read a float from global (device) memory
–
But can be hidden by scheduling independent math instructions or even other loads /
stores if there are enough active threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Bandwidth


Effective bandwidth depends on access patterns
Minimize device memory accesses
–

Much lower bandwidth than on-chip shared memory
Common CUDA kernel structure:
1.
2.
3.
4.
5.
Notes:
Load data from global memory to shared memory
__syncthreads()
Process the data in shared memory with many threads
__syncthreads() (if needed)
Store results from shared memory to global memory
– Steps 2-4 may be repeated, looped, etc.
– Step
4 is not necessary if there is no dependence of stored
data on other threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Global, Local and Shared memory

Local and global device memory not cached on
GeForce 8800 Series GPUs
– High latency, launching more threads hides latency
– Important to minimize accesses, optimize
coherence
– Coalesce global memory accesses (more later)

Shared memory is on-chip, very high bandwidth
– Low latency
– Like a user-managed per-multiprocessor cache
– But must be careful to avoid bank conflicts (more
later)
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Texture and Constant Memory

Texture partition is cached
– Uses the texture cache also used for graphics
– Optimized for 2D spatial locality
– Best performance when threads of a warp read
locations that are close together in 2D

Constant memory is cached
– 2 cycles per address read within a single warp
 Total cost 2 cycle if all threads in a warp read same
address
 Total cost 32 cycles if all threads read different addresses
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Registers


Register reads are generally “free”
But delays can be caused by
– Register read-after-write dependencies
– Register memory bank conflicts

Register bank conflicts are minimized by
thread scheduler
– No programmer control
– No need to pack data into float4 or int4 types
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Data Transfers

Device memory to host memory bandwidth
much lower than device memory to device
bandwidth

Minimize transfers
– Intermediate data structures can be allocated,
operated on, and deallocated without ever copying
them to host memory

Group transfers
– One large transfer much better than many small
ones
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Performance
Hiding Latencies
Highlights So Far

Whenever make memory access
– Make as many computations as possible to
hide latency!

The same computation executed on many
data elements in parallel
– Low control flow overhead

Many calculations per memory access
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Texture (Memory read)
Clustering/Batching

Use another independent Texture memory read to hide Texture
memory latency
–

Instead of:
–
–
–
–

TEX 0 (long latency)
Dependent MATH 0
TEX 1 (long latency)
Dependent MATH 1
Do:
–
–
–
–

Use same thread to help hide own latency
TEX 0 (long latency)
TEX 1 (long latency - hidden)
MATH 0
MATH 1
Compiler handles this!
–
But, you must have enough non-dependent LDs and Math
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Load/Store (Memory read/write)
Clustering/Batching

Use LD to hide LD latency (non-dependent LD ops only)
–

Instead of:
–
–
–
–

LD 0 (long latency)
Dependent MATH 0
LD 1 (long latency)
Dependent MATH 1
Do:
–
–
–
–

Use same thread to help hide own latency
LD 0 (long latency)
LD 1 (long latency - hidden)
MATH 0
MATH 1
Compiler handles this!
–
But, you must have enough non-dependent LDs and Math
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Load Groups


FP:LD ratio from 8:1 to 32:1
Need high data re-use for memory operands
– Mimics FP:TEX => FP:LD ratio
– Higher ratios imply less memory BW to keep FP units busy
– Make use of the local memory as a type of SW-controlled cache
– higher data reuse rates

Larger “LD groups”
– Code programs to dispatch multiple loads before the first “use” of
a load result
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Access
Strategies
Coalescing Global Memory
Access
Coalesced Loads and Stores

__local__ and __device__ are not cached on G80
– Important to minimize accesses, optimize coherence

If per-thread memory accesses for a single warp form a contiguous
range of addresses, accesses will be coalesced into a single
access
–
–

Coalesced accesses are much faster than non-coalesced
Non-coalesced accesses are serialized
Thread N within a warp should access address
BaseAddress + size * N
–
size is byte size of each read/written memory block

–
4, 8, or 16
BaseAddress is aligned to 16 * size
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Memory Access
Strategies
Shared Memory Bank Conflicts
Parallel Memory Architecture

In a parallel machine, many threads access memory
–
–

Each bank can service one address per cycle
–

Therefore, memory is divided into banks
Essential to achieve high bandwidth
A memory can service as many simultaneous
accesses as it has banks
Multiple simultaneous accesses to a bank
result in a bank conflict
–
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Conflicting accesses are serialized
Bank 15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Bank Addressing Examples

No Bank Conflicts
– Linear addressing
stride == 1

No Bank Conflicts
– Random 1:1
Permutation
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
Thread 15
Bank 15
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Bank Addressing Examples

2-way Bank Conflicts
– Linear addressing
stride == 2
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 8
Thread 9
Thread 10
Thread 11
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign

8-way Bank Conflicts
– Linear addressing
stride == x8
8
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 15
Thread 15
Bank 0
Bank 1
Bank 2
x8
Bank 7
Bank 8
Bank 9
Bank 15
How addresses map to banks on G80



Each bank has a bandwidth of 32 bits per
clock cycle
Successive 32-bit words are assigned to
successive banks
G80 has 16 banks
– So bank = address % 16
– Same as the size of a half-warp
 No bank conflicts between different half-warps,
only within a single half-warp
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Shared memory bank conflicts

Shared memory is as fast as registers if there are no
bank conflicts

The fast case:
–
–

If all threads of a half-warp access different banks, there is no
bank conflict
If all threads of a half-warp access the identical address, there
is no bank conflict (broadcast)
The slow case:
–
–
–
Bank Conflict: multiple threads in the same half-warp access
the same bank
Must serialize the accesses
Cost = max # of simultaneous accesses to a single bank
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Linear Addressing

Given:
__shared__ float shared[256];
float foo =
shared[baseIndex + s * threadIdx.x];

This is only bank-conflict-free if s shares
no common factors with the number of
banks
–
16 on G80, so s must be odd
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
s=1
Thread 0
Thread 1
Bank 0
Bank 1
Thread 2
Thread 3
Bank 2
Bank 3
Thread 4
Bank 4
Thread 5
Thread 6
Bank 5
Bank 6
Thread 7
Bank 7
Thread 15
Bank 15
s=3
Thread 0
Thread 1
Bank 0
Bank 1
Thread 2
Thread 3
Bank 2
Bank 3
Thread 4
Bank 4
Thread 5
Thread 6
Bank 5
Bank 6
Thread 7
Bank 7
Thread 15
Bank 15
Data types and bank conflicts

This has no conflicts if type of shared is 32-bits:
foo = shared[baseIndex + threadIdx.x]

But not if the data type is smaller
–
4-way bank conflicts:
__shared__ char shared[];
foo = shared[baseIndex + threadIdx.x];
–
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
2-way bank conflicts:
__shared__ short shared[];
foo = shared[baseIndex + threadIdx.x];
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Structs and Bank Conflicts

Struct assignments compile into as many memory accesses as
there are struct members:
struct vector { float x, y, z; };
struct myType {
float f;
char c;
};
__shared__ struct vector vectors[64];
__shared__ struct myType myTypes[64];

This has no bank conflicts; struct size is 3 words
–
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 15
Bank 15
3 accesses per thread, contiguous banks (no common factor with
16)
struct vector v = vectors[baseIndex + threadIdx.x];

This has 2-way bank conflicts; struct size is 5 bytes (2 accesses
per thread)
struct myType m = myTypes[baseIndex + threadIdx.x];
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Common Bank Conflict Patterns (1)

Each thread loads 2 elements into shared
mem:
–
2-way-interleaved loads result in
2-way bank conflicts:
int tid = threadIdx.x;
shared[2*tid] = global[2*tid];
shared[2*tid + 1] = global[2*tid+1];
–
Better to not interleave:
shared[tid] = global[tid];
shared[tid + blockDim.x]
= global[tid + blockDim.x];
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Thread 8
Thread 9
Thread 10
Thread 11
Bank 15
Thread 0
Thread 1
Bank 0
Bank 1
Thread 2
Thread 3
Bank 2
Bank 3
Thread 4
Bank 4
Thread 5
Thread 6
Bank 5
Bank 6
Thread 7
Bank 7
Thread 15
Bank 15
Common Bank Conflict Patterns (2)

Operating on 2D array of floats in
shared memory
–

–
Each thread processes a row
So threads in a block access the columns
simultaneously (example: row 1 in purple)
16-way bank conflicts: rows all start at
bank 0
Solution 1) pad the rows
–

e.g. image processing
Example: 16x16 block
–
–

Bank Indices without Padding
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
0 1 2 3 4 5 6 7
15
Add one float to the end of each row
Solution 2) transpose before
processing
–
–
Suffer bank conflicts during transpose
But possibly save them later
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
0 1 2 3 4 5 6 7
0
1
2
3
4
5
6
7
15
Bank Indices with Padding
1 2 3 4 5 6 7
15 0
2 3 4 5 6 7 8
0 1
3 4 5 6 7 8 9
1 2
4 5 6 7 8 9 10
2 3
5 6 7 8 9 10 11
3 4
6 7 8 9 10 11 12
4 5
7 8 9 10 11 12 13
5 6
8 9 10 11 12 13 14
7 8
15 0 1 2 3 4 5 6
14 15
Resource Management
and Occupancy
Scarce Resources
Performance Variables


Shorter programs are overhead limited
Longer programs are instruction-rate
limited
– Must have enough threads per thread block
– at least 192, more is better
– Must have enough thread blocks – at least
32, more is better
 RF load balancing
– RF space commonly in high demand
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Performance Variables (2)

Compiler quality important for good
performance
– Minimize register usage in CUDA programs
 Reduces spilling to memory
– Interleave non-dependent FP/DATA ops
 maximizes issue rate
– Cluster non-dependent texture and memory
reads
 decreases program latency
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Resource Management
and Occupancy
Optimizing Threads per Block
Optimizing threads per block

Given: total threads in a grid
–

Enough threads per block to keep machine busy
–

If multiple blocks exist that aren’t all waiting at a __syncthreads(),
machine can stay busy
Blocks / multiprocessors > 2 increases efficiency
–

Cover memory latency
Enough blocks to avoid idle multiprocessors during syncs
–

Choose block size / # blocks to maximize utilization of the device
Blocks stream through machine in pipeline fashion
Per-block resources at most half of total available
–
–
Shared memory and registers
So multiple blocks can coexist in a multiprocessor
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Optimizing threads per block

Choose threads per block as a multiple of warp size
–

More threads per block == fewer regs per thread
–

Avoid wasting computation on under-populated warps
Kernel invocations can fail if too many registers are used
Heuristics
–
Minimum: 64 threads per block

–
192 or 256 threads a better choice

–
Only if multiple concurrent blocks
Usually still enough regs to compile and invoke successfully
Blocks per grid > 100 to scale to future devices


1000 blocks per grid will scale across multiple generations
Of course this depends on problem size
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Optimizing threads per block

Only up to 8 thread blocks can be active
on a multiprocessor
– If your block is only of size 8, max active
threads is 8x8=64 (occupancy = 64/768 =
1/12th).

A compute intensive kernel with few
threads per block may be faster
– 64 threads/block with loop unrolling nearly 2x
faster than 256 threads/block without unrolling
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Parameterize Your Application


Parameterization helps adaptation to different GPUs
GPUs vary in many ways
–
–
–
–
–

# of multiprocessors
Shared memory size
Register file size
Threads per block
Memory bandwidth
You can even make apps self-tuning (like FFTW)
–
“Experiment” mode discovers and saves optimal config
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Design Evaluation

Key questions to ask
– How many threads can be supported?
– How many threads are needed?
– How are the data structures shared?
– Is there enough work in each thread between
synchronizations to make parallel execution
worthwhile?
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Design Evaluation Example –
Matrix Multiplication





Each thread likely need at least 32 FOPS between
synchronizations to make parallel processing worthwhile
At least 192 threads are needed in a block to fully utilize
the hardware
The M and N sub-blocks of each thread group must fit
into 16KB of Shared Memory
The design will likely end up with about 16 ● 16 subblocks given all the constraints
The minimal matrix size is around 1K elements in each
dimension to make parallel execution worthwhile
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Conclusion

CUDA and GeForce 8800 can achieve
great results on data-parallel
computations if you use a few
performance strategies
– Optimize for memory locality
– Size thread blocks to maximize
multiprocessor utilization and reduce
memory stalls
– Ensure memory addresses are coalesced
– Avoid shared memory bank conflicts
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
General Guidelines



Blocks count at least double of number of SMs: 16x2=32
Shared memory allocated per block should be at most half of total
available shared memory: 16 / 2 = 8k
Threads per block must be a multiple of 64
–
–
–
–


Number of blocks per grid at least 100.
Maximize arithmetic intensity
–

Ideally each SM has more than 192 threads
Minimum of 64 threads only if there are lots of concurrent blocks
Usually, 192 or 256 threads per block is good
Maximum is 512 threads per block
Hide memory access latencies
Branches
–
–
–
–
May not be worth it if instruction count is 5 or less
Pre-compute values if possible
Avoid branches when result may be pre-determined
Granularity of 16x16 (16x4 is ok)

Make neighboring threads follow same execution path!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Possible Performance Hotspots
(to get 100% occupancy)






1. registers <= 10
2. threads/block mod 32 == 0
3. warps/block is a divisor of 24
4. shared mem/block <= 16Kb * (warps/block) / 24 - any alignment constraint?
constant memory does not come into it as it is the same for all blocks
and then run N * 16 * 24 / (warps/block) blocks, assuming they all execute for the
same time.

Since there are only a few thread counts that satisfy those requirements, maybe we
can summarize like this:







Max registers: 10
Threads per Block........Max shared mem (bytes)
96...............................2048
128.............................2730
192.............................4096
256.............................5461
384.............................8192
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
References

CUDA home page
– http://developer.nvidia.com/object/cuda.html

Official CUDA forum
– http://forums.nvidia.com/index.php?showforum=62

University of Illinois Parallel Computing Course
–
–
–
–
–
http://courses.ece.uiuc.edu/ece498/al/
Presentations
Pod-casts
Nvidia chief scientist David Kirk present at all classes!
Source of inspiration for slides shown 
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Download