Cuda

advertisement
Lecture 8:
CUDA
CUDA

A scalable parallel programming model for GPUs and
multicore CPUs

Provides facilities for heterogeneous programming

Allows the GPU to be used as both a graphics processor
and a computing processor.
Pollack’s Rule

Performance increase is roughly proportional to the square
root of the increase in complexity
performance  √complexity

Power consumption increase is roughly linearly
proportional to the increase in complexity
power consumption  complexity
CUDA

SPMD (Single Program Multiple Data) Programming Model

Programmer writes code for a single thread and GPU runs
thread instances in parallel

Extends C and C++
CUDA
Three key abstractions
 A hierarchy of thread groups
 Shared memories
 Barrier synchronization
CUDA provides fine-grained data parallelism and thread
parallelism nested within coarse-grained data parallelism
and task parallelism
CUDA
Kernel: a function designed to be executed by many threads
Thread block: a set of concurrent threads that can cooperate
among themselves through barrier synchronization and
through shared-memory access
Grid: a set of thread blocks execute the same kernel program
function designed to be executed by many threads
CUDA
Three key abstractions
 A hierarchy of thread groups
 Shared memories
 Barrier synchronization
CUDA provides fine-grained data parallelism and thread
parallelism nested within coarse-grained data parallelism
and task parallelism
CUDA
__ global__
void mykernel (int a, …)
{
...
}
main()
{
...
nblocks = N/512;
// max. 512 threads per block
mykernel <<< nblocks, 512 >>> (aa, …);
...
}
CUDA
CUDA

Thread management is performed by hardware

Max. 512 threads per block

The number of blocks can exceed the number of processors

Blocks execute independently and in any order

Threads can communicate through shared memory

Atomic memory operations exist on the global memory
CUDA
Memory Types
Local Memory: private to a thread
Shared Memory: shared by all threads of the block
__shared__
Device memory: shared by all threads of an application
__device__
Pollack’s Rule

Performance increase is roughly proportional to the square
root of the increase in complexity
performance  √complexity

Power consumption increase is roughly linearly
proportional to the increase in complexity
power consumption  complexity
CUDA
Three key abstractions
 A hierarchy of thread groups
 Shared memories
 Barrier synchronization
CUDA provides fine-grained data parallelism and thread
parallelism nested within coarse-grained data parallel
and task parallelism
CUDA
__ global__
void mykernel (float* a, …) {
...
}
main() {
...
int
nbytes=N*sizeof(float);
float*
ha=(float*)malloc(nbytes);
float*
da=0;
cudaMalloc((void**)&da, nbytes);
cudaMemcpy(da, ha, nbytes, CudaMemcpyHosttoDevice);
mykernel <<< N/blocksize, blocksize >>> (da, …);
cudaMemcpy(ha, da, nbytes, CudaMemcpyDevicetoHost);
cudaFree(da);
...
}
CUDA
Synchronization Barrier: Threads wait until all threads in the
block arrive at the barrier
__syncthreads()
The thread increments barrier count and the scheduler marks
it as waiting. When all threads arrive barrier, scheduler
releases all waiting threads.
CUDA
__ global__
void shift_reduce (int *inp, int N, int *tot)
{
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
__shared__ int x[blocksize];
x[tid] = (i<N) ? inp[i] : 0;
__synchthreads();
for (int s=blockDim.x / 2; s>0; s=s/2)
{
if (tid<s) x[tid] += x[tid+s];
__synchthreads();
}
if (tid==0) atomicAdd(tot, x[tid]);
}
CUDA
SPMD (Single Program Multiple Data) programming model
 All threads execute the same program
 Threads coordinate with barrier synchronization



Threads of a block express fine-grained data parallelism and thread
parallelism
Independent blocks of a grid express coarse-grained data parallelism
Independent grids express coarse-grained task parallelism
CUDA
CUDA
Scheduler

Hardware management and scheduling of threads and thread
blocks

Scheduler has minimal runtime overhead
CUDA
Multithreading

Memory and texture fetch latency requires hundreds of
processor clock cycles

While one thread is waiting for a load or texture fetch, the
processor can execute another thread

Thousands of independent threads can keep many processors
busy
CUDA
GPU Multiprocessor Architecture



Lightweight thread creation
Zero-overhead thread scheduling
Fast barrier synchronization
Each thread has its own
 Private registers
 Private per-thread memory
 PC
 Thread execution state
Support very fine-grained
parallelism
CUDA
CUDA
GPU Multiprocessor Architecture
Each SP core
 contains scalar integer and floating-point units
 is hardware multithreaded
 supports up to 64 threads
 is pipelined and executes one instruction per thread per clock
 has a large register file (RF), 1024 32-bit registers,
 registers are partitioned among the assigned threads (Programs
declare their register demand; compiler optimizes register allocation.
Ex: (a) 32 registers per thread => 256 threads per block, or
(b) fewer registers – more threads, or (c) more registers – fewer
threads)
CUDA
Single Instruction Multiple Thread (SIMT)
SIMT: a processor architecture that applies one instruction to
multiple independent threads in parallel
Warp: the set of parallel threads that execute the same instruction
together in a SIMT architecture



Warp size is 32 threads (4 threads per SP, executed in 4 clock
cycles)
Threads in a warp start at the same program address, but they
can branch and execute independently.
Individual threads may be inactive due to independent
branching
CUDA
CUDA
SIMT Warp Execution




There are 4 thread lanes per SP
An issued warp instruction executes in 4 processor cycles
Instruction scheduler selects a warp every 4 clocks
The controler:
•
•
•
•
•
Collects thread programs into warps
Allocates a warp
Allocates registers for the warp threads (it can start a warp only
when it can allocate the requested register count)
Starts warp execution
When all threads exit, it frees the registers
CUDA
Streaming Processor (SP)



Has 1024 32-bit registers (RF)
Can perform 32-bit and 64-bit integer operations: arithmetic,
comparison, conversion, logic operations
Can perform 32-bit floating-point operations: add, multiply, min,
max, multiply-add, etc.
SFU (Special Function Unit)
 Pipelined unit
 Generates one 32-bit floating-point function per cycle: square root,
sin, cos, 2x, log2x
CUDA
CUDA
Memory System





Global Memory – external DRAM
Shared Memory – on chip
Per-thread local memory – external DRAM
Constant memory – in external DRAM and cached in shared memory
Texture memory – on chip
Project
Performance Measurement, Evaluation and Prediction of
Multicore and GPU systems
Multicore systems
 CPU performance (instruction execution time, pipelining, etc.)
 Cache performance
 Performance using algorithmic structures
GPU systems (NVIDIA-CUDA)
 GPU core performance (instruction execution time, pipelining, etc.)
 Global and shared memory performance (2)
 Performance using algorithmic structures
GPU performance in MATLAB environment
Download