PBSM

advertisement
GPU Architecture & Implications
David Luebke
NVIDIA Research
GPU Architecture
CUDA provides a parallel programming model
The Tesla GPU architecture implements this
This talk will describe the characteristics, goals,
and implications of that architecture
© NVIDIA Corporation 2007
G80 GPU Implementation: Tesla C870
681 million transistors
470 mm2 in 90 nm CMOS
128 thread processors
518 GFLOPS peak
1.35 GHz processor clock
1.5 GB DRAM
76 GB/s peak
800 MHz GDDR3 clock
384 pin DRAM interface
ATX form factor card
PCI Express x16
170 W max with DRAM
Block Diagram Redux
G80 (launched Nov 2006)
128 Thread Processors execute kernel threads
Up to 12,288 parallel threads active
Per-block shared memory (PBSM) accelerates processing
Host
Input Assembler
Thread Execution Manager
Thread Processors
Thread Processors
PBSM
PBSM
PBSM
PBSM
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
Thread Processors
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
PBSM
Load/store
© NVIDIA Corporation 2007
Global Memory
PBSM
PBSM
PBSM
PBSM
Streaming Multiprocessor (SM)
Processing elements
SM
MT IU
t0 t1 … tB
8 scalar thread processors (SP)
32 GFLOPS peak at 1.35 GHz
8192 32-bit registers (32KB)
½ MB total register file space!
usual ops: float, int, branch, …
SP
Hardware multithreading
up to 8 blocks resident at once
up to 768 active threads in total
Shared
Memory
© NVIDIA Corporation 2007
16KB on-chip memory
low latency storage
shared amongst threads of a block
supports thread communication
Goal: Scalability
Scalable execution
Program must be insensitive to the number of cores
Write one program for any number of SM cores
Program runs on any size GPU without recompiling
Hierarchical execution model
Decompose problem into sequential steps (kernels)
Decompose kernel into computing parallel blocks
Decompose block into computing parallel threads
Hardware distributes independent blocks to SMs as
available
Blocks Run on Multiprocessors
Kernel launched by host
...
Device processor array
MT IU
MT IU
MT IU
MT IU
MT IU
SP
SP
SP
SP
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
...
Device Memory
MT IU
MT IU
MT IU
SP
SP
SP
SP
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Goal: easy to program
Strategies:
Familiar programming language mechanics
C/C++ with small extension
Simple parallel abstractions
Simple barrier synchronization
Shared memory semantics
Hardware-managed hierarchy of threads
Hardware Multithreading
Hardware allocates resources to blocks
blocks need: thread slots, registers, shared memory
blocks don’t run until resources are available
SM
MT IU
SP
Hardware schedules threads
threads have their own registers
any thread not waiting for something can run
context switching is (basically) free – every cycle
Shared
Memory
Hardware relies on threads to hide latency
i.e., parallelism is necessary for performance
© NVIDIA Corporation 2007
Goal: Performance per millimeter
For GPUs, perfomance == throughput
Strategy: hide latency with computation not cache
Heavy multithreading – already discussed by Kevin
Implication: need many threads to hide latency
Occupancy – typically need 128 threads/SM minimum
Multiple thread blocks/SM good to minimize effect of
barriers
Strategy: Single Instruction Multiple Thread (SIMT)
Balances performance with ease of programming
SIMT Thread Execution
Groups of 32 threads formed into warps
always executing same instruction
shared instruction fetch/dispatch
some become inactive when code path diverges
hardware automatically handles divergence
SM
MT IU
SP
Warps are the primitive unit of scheduling
pick 1 of 24 warps for each instruction slot
Shared
Memory
SIMT execution is an implementation choice
© NVIDIA Corporation 2007
sharing control logic leaves more space for ALUs
largely invisible to programmer
must understand for performance, not correctness
SIMT Multithreaded Execution
Weaving: the original parallel thread
technology is about 10,000 years old
Warp: a set of 32 parallel threads
that execute a SIMD instruction
SM multithreaded
instruction scheduler
time
warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 95
..
.
warp 8 instruction 12
warp 3 instruction 96
12
SM hardware implements zero-overhead
warp and thread scheduling
Each SM executes up to 768 concurrent
threads, as 24 SIMD warps of 32 threads
Threads can execute independently
SIMD warp automatically diverges and
converges when threads branch
Best efficiency and performance when
threads of a warp execute together
SIMT across threads (not just SIMD
across data) gives easy single-thread
scalar programming with SIMD efficiency
gh07 Hot3D: Tesla GPU Computing
© NVIDIA Corporation 2007
Memory Architecture
Direct load/store access to device memory
treated as the usual linear sequence of bytes (i.e., not pixels)
Texture & constant caches are read-only access paths
On-chip shared memory shared amongst threads of a block
I Cache
MT IU
important for communication amongst threads
provides low-latency temporary storage (~100x less than DRAM)
Shared
Memory
SP
Texture Cache
Constant Cache
Device Memory
© NVIDIA Corporation 2007
Host
Memory
PCIe
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
NO: CUDA compiles directly to the hardware
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines… NO: warps are 32-wide
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive… NOPE
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers. NO: scalar thread processors
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient:
No – 4-10x perf/W advantage, up to 89x reported for some studies
GPUs don’t do real floating point
Myths of GPU Computing
GPUs layer normal programs on top of graphics
GPUs architectures are:
Very wide (1000s) SIMD machines…
…on which branching is impossible or prohibitive…
…with 4-wide vector registers.
GPUs are power-inefficient:
GPUs don’t do real floating point
GPU Floating Point Features
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes for
FADD and FMUL
Round to nearest and
round to zero
All 4 IEEE, round to
nearest, zero, inf, -inf
Round to nearest
only
Round to
zero/truncate only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and Infinity
support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754
Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways
GPU FP getting better every generation
Double precision support shortly
Goal: best of class by 2009
Questions?
David Luebke
dluebke@nvidia.com
Applications
&
Sweet Spots
GPU Computing Sweet Spots
Applications:
High arithmetic intensity:
Dense linear algebra, PDEs, n-body, finite difference, …
High bandwidth:
Sequencing (virus scanning, genomics), sorting, database…
Visual computing:
Graphics, image processing, tomography, machine vision…
© NVIDIA Corporation 2007
GPU Computing Example Markets
Computational
Geoscience
Computational
Chemistry
Computational
Medicine
Computational
Modeling
Computational
Science
Computational
Biology
Computational
Finance
© NVIDIA Corporation 2007
Image
Processing
Applications - Condensed
3D image analysis
Adaptive radiation therapy
Acoustics
Astronomy
Audio
Automobile vision
Bioinfomatics
Biological simulation
Broadcast
Cellular automata
Computational Fluid Dynamics
Computer Vision
Cryptography
CT reconstruction
Data Mining
Digital cinema/projections
Electromagnetic simulation
Equity training
© NVIDIA Corporation 2007
Film
Financial - lots of areas
Languages
GIS
Holographics cinema
Imaging (lots)
Mathematics research
Military (lots)
Mine planning
Molecular dynamics
MRI reconstruction
Multispectral imaging
nbody
Network processing
Neural network
Oceanographic research
Optical inspection
Particle physics
Protein folding
Quantum chemistry
Ray tracing
Radar
Reservoir simulation
Robotic vision/AI
Robotic surgery
Satellite data analysis
Seismic imaging
Surgery simulation
Surveillance
Ultrasound
Video conferencing
Telescope
Video
Visualization
Wireless
X-ray
GPU Computing Sweet Spots
From cluster to workstation
The “personal supercomputing” phase change
From lab to clinic
From machine room to engineer, grad student desks
From batch processing to interactive
From interactive to real-time
GPU-enabled clusters
A 100x or better speedup changes the science
Solve at different scales
Direct brute-force methods may outperform cleverness
New bottlenecks may emerge
Approaches once inconceivable may become practical
© NVIDIA Corporation 2007
New Applications
Real-time options implied volatility engine
Ultrasound imaging
Swaption volatility cube calculator
HOOMD Molecular Dynamics
Manifold 8 GIS
Also…
Image rotation/classification
Graphics processing toolbox
Microarray data analysis
Data parallel primitives
© NVIDIA Corporation 2007
Astrophysics
simulations
SDK: Mandelbrot, computer vision
Seismic migration
The Future of GPUs
GPU Computing drives new applications
Reducing “Time to Discovery”
100x Speedup changes science and research methods
New applications drive the future of GPUs and
GPU Computing
Drives new GPU capabilities
Drives hunger for more performance
Some exciting new domains:
Vision, acoustic, and embedded applications
Large-scale simulation & physics
© NVIDIA Corporation 2007
Accuracy
&
Performance
GPU Floating Point Features
G80
SSE
IBM Altivec
Cell SPE
Precision
IEEE 754
IEEE 754
IEEE 754
IEEE 754
Rounding modes for
FADD and FMUL
Round to nearest and
round to zero
All 4 IEEE, round to
nearest, zero, inf, -inf
Round to nearest
only
Round to
zero/truncate only
Denormal handling
Flush to zero
Supported,
1000’s of cycles
Supported,
1000’s of cycles
Flush to zero
NaN support
Yes
Yes
Yes
No
Overflow and Infinity
support
Yes, only clamps to
max norm
Yes
Yes
No, infinity
Flags
No
Yes
Yes
Some
Square root
Software only
Hardware
Software only
Software only
Division
Software only
Hardware
Software only
Software only
Reciprocal estimate
accuracy
24 bit
12 bit
12 bit
12 bit
Reciprocal sqrt
estimate accuracy
23 bit
12 bit
12 bit
12 bit
log2(x) and 2^x
estimates accuracy
23 bit
No
12 bit
No
© NVIDIA Corporation 2007
Do GPUs Do Real IEEE FP?
G8x GPU FP is IEEE 754
Comparable to other processors / accelerators
More precise / usable in some ways
Less precise in other ways
GPU FP getting better every generation
Double precision support shortly
Goal: best of class by 2009
© NVIDIA Corporation 2007
CUDA Performance Advantages
Performance:
BLAS1: 60+ GB/sec
BLAS3: 127 GFLOPS
FFT: 52 benchFFT*
GFLOPS
FDTD: 1.2 Gcells/sec
SSEARCH: 5.2 Gcells/sec
Black Scholes: 4.7
GOptions/sec
How:
Leveraging shared
memory
GPU memory bandwidth
GPU GFLOPS
performance
Custom hardware
intrinsics
__sinf(), __cosf(),
__expf(), __logf(),
…
VMD: 290 GFLOPS
All benchmarks are compiled code!
© NVIDIA Corporation 2007
GPGPU
vs.
GPU Computing
Problem: GPGPU
OLD: GPGPU – trick the GPU into general-purpose
computing by casting problem as graphics
Turn data into images (“texture maps”)
Turn algorithms into image synthesis (“rendering passes”)
Promising results, but:
Tough learning curve, particularly for non-graphics experts
Potentially high overhead of graphics API
Highly constrained memory layout & access model
Need for many passes drives up bandwidth consumption
© NVIDIA Corporation 2007
Solution: CUDA
NEW: GPU Computing with CUDA
CUDA = Compute Unified Driver Architecture
Co-designed hardware & software for direct GPU computing
Hardware: fully general data-parallel architecture
General thread launch
Global load-store
Parallel data cache
Scalar architecture
Integers, bit operations
Double precision (soon)
Software: program the GPU in C
Scalable data-parallel
execution/memory model
© NVIDIA Corporation 2007
C with minimal yet
powerful extensions
Graphics Programming Model
Graphics Application
Vertex Program
Rasterization
Fragment Program
Display
© NVIDIA Corporation 2007
Streaming GPGPU Programming
OpenGL Program to
Add A and B
Start by creating a quad
Vertex Program
“Programs” created with raster operation
Rasterization
Fragment Program
CPU Reads Texture
Memory for Results
© NVIDIA Corporation 2007
Read textures as input
to OpenGL shader program
Write answer to texture memory as a “color”
All this just to do A + B
What’s Wrong With GPGPU?
Application
Input Registers
Vertex Program
Rasterization
Texture
Pixel Program
Pixel Program
Temp Registers
Display
Output Registers
© NVIDIA Corporation 2007
Constants
What’s Wrong With GPGPU?
APIs are specific to graphics
Application
Vertex Program
Rasterization
Input Registers
Limited texture size and
dimension
Fragment Program
Limited instruction set
No thread communication
Fragment Program
Texture
Constants
Temp Registers
Limited local storage
Display
Output Registers
Limited shader outputs
© NVIDIA Corporation 2007
No scatter
Building a Better Pixel
Input Registers
Texture
Fragment Program
Constants
Registers
Output Registers
© NVIDIA Corporation 2007
Building a Better Pixel Thread
Features
Millions of instructions
Full Integer and Bit instructions
Thread Number
No limits on branching, looping
1D, 2D, or 3D thread ID allocation
Texture
Thread Program
Constants
Registers
Output Registers
© NVIDIA Corporation 2007
Global Memory
Features
Fully general load/store to
GPU memory
Thread Number
Untyped, not fixed texture types
Pointer support
Texture
Thread Program
Constants
Registers
Global Memory
© NVIDIA Corporation 2007
Parallel Data Cache
Features
Dedicated on-chip memory
Thread Number
Shared between threads for
inter-thread communication
Explicitly managed
As fast as registers
Texture
Thread Program
Constants
Registers
Parallel Data Cache
Global Memory
© NVIDIA Corporation 2007
Example Algorithm - Fluids
Goal: Calculate PRESSURE in a fluid
Pressure = Sum of neighboring pressures
Pn’ = P1 + P2 + P3 + P4
So the pressure for each particle is…
Pressure1 = P1 + P2 + P3 + P4
Pressure2 = P3 + P4 + P5 + P6
Pressure depends
on neighbors
© NVIDIA Corporation 2007
Pressure3 = P5 + P6 + P7 + P8
Pressure4 = P7 + P8 + P9 + P10
Example Fluid Algorithm
CPU
Control
Cache
GPGPU
DRAM
P1
Pn’=P1+P2+P3+P4P2
P3
P4
Thread
Execution
Manager
Control
Pn’=P1+P2+P3+P4
P1,
P2
P3,
P4
ALU
ALU
Pn’=P1+P2+P3+P4
Control
P1,P2
P3,P4
Control
Pn’=P1+P2+P3+P4
Video
Memory
Single
thread out
of cache
Parallel
Data
Cache
Control
ALU
AL
U
GPU Computing
with CUDA
Control
ALU
ALU
P1,P2
P3,P4
ALU
Pn’=P1+P2+P3+P4
Shared
Data
P1
P2
P3
P4
P5
DRAM
Pn’=P1+P2+P3+P4
Control
Data/Computation
Multiple passes
through video
memory
ALU
ALU
Pn’=P1+P2+P3+P4
Program/Control
© NVIDIA Corporation 2007
Parallel execution through cache
Parallel Data Cache
Bring the data closer to the ALU

•
•
Addresses a fundamental problem
of stream computing:
The data are far from the FLOPS,
video RAM latency is high
Threads can only communicate
their results through this high
latency RAM
GPGPU
Control
ALU
Pn’=P1+P2+P3+P4
P1,
P2
P3,
P4
Control
ALU
Pn’=P1+P2+P3+P4
P1,P
2
P3,P
4
Video
Memory
Control
ALU
ALU
P1,P2
P3,P4
Pn’=P1+P2+P3+P4
Multiple passes
through video
memory
© NVIDIA Corporation 2007
Parallel Data Cache
Bring the data closer to the ALU





Thread
Execution
Manager
Parallel
Data
Cache
Control
Stage computation for the parallel
data cache
Minimize trips to external memory
Share values to minimize overfetch
and computation
Increases arithmetic intensity by
keeping data close to the
processors
User managed generic memory,
threads read/write arbitrarily
ALU
Pn’=P1+P2+P3+P4
Control
ALU
Pn’=P1+P2+P3+P4
Shared
Data
P1
P2
P3
P4
P5
DRAM
Control
ALU
ALU
Pn’=P1+P2+P3+P4
© NVIDIA Corporation 2007
Parallel execution through
cache
Streaming vs. GPU Computing
Streaming
GPGPU
Gather in, Restricted write
Memory is far from ALU
No inter-element communication
GPU Computing with CUDA
ALU
More general data parallel model
CUDA
Full Scatter / Gather
PDC brings the data closer to the ALU
App decides how to decompose the
problem across threads
Share and communicate between threads
to solve problems efficiently
ALU
© NVIDIA Corporation 2007
GPU Design
CPU/GPU Parallelism
Moore’s Law gives you more and more transistors
What do you want to do with them?
CPU strategy: make the workload (one compute thread) run as
fast as possible
Tactics:
– Cache (area limiting)
– Instruction/Data prefetch
– Speculative execution
limited by “perimeter” – communication bandwidth
…then add task parallelism…multi-core
GPU strategy: make the workload (as many threads as possible)
run as fast as possible
Tactics:
– Parallelism (1000s of threads)
– Pipelining
 limited by “area” – compute capability
© NVIDIA Corporation 2007
Background: Unified Design
Unified Design
Discrete Design
Shader A
Shader B
ibuffer
ibuffer
ibuffer
ibuffer
Shader
Core
Shader C
Shader D
© NVIDIA Corporation 2007
obuffer
obuffer
obuffer
obuffer
Hardware Implementation:
Collection of SIMT Multiprocessors
Each multiprocessor is a set of
SIMT thread processors
Device
Multiprocessor N
Single Instruction Multiple Thread
Multiprocessor 2
Each thread processor has:
Multiprocessor 1
program counter, register file, etc.
scalar data path
read/write memory access
Processor 1
Processor 2
…
Instruction
Unit
Processor M
Unit of SIMT execution: warp
execute same instruction/clock
Hardware handles thread
scheduling and divergence
transparently
Warps enable a friendly data-parallel programming model!
© NVIDIA Corporation 2007
Hardware Implementation:
Memory Architecture
Device
Multiprocessor N
The device has local
device memory
Can be read and written by
the host and by the
multiprocessors
Each multiprocessor has:
A set of 32-bit registers per
processor
on-chip shared memory
A read-only constant
cache
A read-only texture cache
© NVIDIA Corporation 2007
Multiprocessor 2
Multiprocessor 1
Shared Memory
Registers
Processor 1
Registers
Processor 2
Registers
…
Instruction
Unit
Processor M
Constant
Cache
Texture
Cache
Device memory
Hardware Implementation:
Memory Model
Grid
Each thread can:
Read/write per-block onchip shared memory
Read per-grid cached
constant memory
Read/write non-cached
device memory:
Per-grid global memory
Per-thread local memory
Read cached texture
memory
Block (0, 0)
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
© NVIDIA Corporation 2007
Block (1, 0)
CUDA
Programming
CUDA SDK
Libraries:FFT, BLAS,…
Example Source Code
Integrated CPU
and GPU C Source Code
NVIDIA C Compiler
NVIDIA Assembly
for Computing
CUDA
Driver
Debugger
Profiler
GPU
© NVIDIA Corporation 2007
CPU Host Code
Standard C Compiler
CPU
CUDA: Features available to kernels
Standard mathematical functions
sinf, powf, atanf, ceil, etc.
Built-in vector types
float4, int4, uint4, etc. for dimensions 1..4
Texture accesses in kernels
texture<float,2> my_texture; // declare texture reference
float4 texel = texfetch(my_texture, u, v);
© NVIDIA Corporation 2007
G8x CUDA = C with Extensions
Philosophy: provide minimal set of extensions necessary to expose power
Function qualifiers:
__global__ void MyKernel() { }
__device__ float MyDeviceFunc() { }
Variable qualifiers:
__constant__ float MyConstantArray[32];
__shared__
float MySharedArray[32];
Execution configuration:
dim3 dimGrid(100, 50); // 5000 thread blocks
dim3 dimBlock(4, 8, 8); // 256 threads per block
MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel
Built-in variables and functions valid in device code:
dim3
dim3
dim3
dim3
void
gridDim;
// Grid dimension
blockDim; // Block dimension
blockIdx; // Block index
threadIdx; // Thread index
__syncthreads(); // Thread synchronization
© NVIDIA Corporation 2007
CUDA: Runtime support
Explicit memory allocation returns pointers to GPU memory
cudaMalloc(), cudaFree()
Explicit memory copy for host ↔ device, device ↔ device
cudaMemcpy(), cudaMemcpy2D(), ...
Texture management
cudaBindTexture(), cudaBindTextureToArray(), ...
OpenGL & DirectX interoperability
cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), …
© NVIDIA Corporation 2007
Example: Adding matrices w/ 2D grids
CPU C program
CUDA C program
void addMatrix(float *a, float *b,
float *c, int N)
{
int i, j, index;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
index = i + j * N;
c[index]=a[index] + b[index];
}
}
}
__global__ void addMatrix(float *a,float *b,
float *c, int N)
{
int i=blockIdx.x*blockDim.x+threadIdx.x;
int j=blockIdx.y*blockDim.y+threadIdx.y;
int index = i + j * N;
if ( i < N && j < N)
c[index]= a[index] + b[index];
}
void main()
{
.....
addMatrix(a, b, c, N);
}
© NVIDIA Corporation 2007
void main()
{
..... // allocate & transfer data to GPU
dim3 dimBlk (blocksize, blocksize);
dim3 dimGrd (N/dimBlk.x, N/dimBlk.y);
addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N);
}
Example: Vector Addition Kernel
// Compute vector sum C = A+B
// Each thread performs one pair-wise addition
__global__ void vecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
C[i] = A[i] + B[i];
}
© NVIDIA Corporation 2007
Example: Invoking the Kernel
__global__ void vecAdd(float* A, float* B, float* C);
void main()
{
// Execute on N/256 blocks of 256 threads each
vecAdd<<< N/256, 256>>>(d_A, d_B, d_C);
}
© NVIDIA Corporation 2007
Example: Host code for memory
// allocate host (CPU) memory
float* h_A = (float*) malloc(N * sizeof(float));
float* h_B = (float*) malloc(N * sizeof(float));
… initalize h_A and h_B …
// allocate
float* d_A,
cudaMalloc(
cudaMalloc(
cudaMalloc(
device (GPU) memory
d_B, d_C;
(void**) &d_A, N * sizeof(float));
(void**) &d_B, N * sizeof(float));
(void**) &d_C, N * sizeof(float));
// copy host memory to device
cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice));
cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice));
// execute the kernel on N/256 blocks of 256 threads each
vecAdd<<<N/256, 256>>>(d_A, d_B, d_C);
© NVIDIA Corporation 2007
A quick review
device = GPU = set of multiprocessors
Multiprocessor = set of processors & shared
memory
Kernel = GPU program
Grid = array of thread blocks that execute a kernel
Thread block = group of SIMD threads that execute
a kernel and can communicate via shared memory
Memory
Location
Cached
Access
Who
Local
Off-chip
No
Read/write
One thread
Shared
On-chip
N/A - resident
Read/write
All threads in a block
Global
Off-chip
No
Read/write
All threads + host
Constant
Off-chip
Yes
Read
All threads + host
Texture
Off-chip
Yes
Read
All threads + host
© NVIDIA Corporation 2007
Data-Parallel
Programming
Scan Literature
Pre-Hibernation
First proposed in APL by Iverson (1962)
Used as a data parallel primitive in the Connection Machine (1990)
Feature of C* and CM-Lisp
Guy Blelloch used scan as a primitive for various parallel
algorithms; his balanced-tree scan is used in the example here
Blelloch, 1990, “Prefix Sums and Their Applications”
Post-Democratization
O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2)
Applied to Summed Area Tables by Hensley et al. (EG05)
O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al.
(EG06)
O(n) work & space GPU implementation by Harris et al. (2007)
NVIDIA CUDA SDK and GPU Gems 3
Applied to radix sort, stream compaction, and summed area tables
Parallel Reduction Complexity
Log(N) parallel steps, each step S does N/2S
independent ops
Step Complexity is O(log N)
For N=2D, performs S[1..D]2D-S = N-1 operations
Work Complexity is O(N) – It is work-efficient
i.e. does not perform more operations than a sequential
algorithm
With P threads physically in parallel (P processors),
time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
Unrolling Last Steps
Only one warp is active during the last few steps
Unroll them and remove unneeded
__syncthreads()
for (unsigned int s = bd/2; s > 32; s >>= 1)
{
if (t < s) {
data[t] += data[t + s];
}
__syncthreads();
}
if (t < 32) data[t] += data[t + 32];
if (t < 16) data[t] += data[t + 16];
if (t < 8) data[t] += data[t + 8];
if (t < 4) data[t] += data[t + 4];
if (t < 2) data[t] += data[t + 2];
if (t < 1) data[t] += data[t + 1];
Unrolling the Loop Completely
When block
#define STEP(d) \
size is known
if (t < (d)) data[t] += data[t+(d));
at compile
#define SYNC __syncthreads();
time, we can
completely
unroll the loop template <unsigned int bsize>
__global__ void d_reduce(int *g_idata,
It often is,
int *g_odata)
since the
{ ...
maximum
if (bsize == 512) STEP(512) SYNC
thread block
size of 512
if (bsize >= 256) STEP(256) SYNC
constrains us
if (bsize >= 128) STEP(128) SYNC
if (bsize >= 64) STEP(64) SYNC
Use
if (bsize >= 32) { STEP(32) STEP(16)
templates…
STEP(8)
STEP(4) STEP(2)
STEP(1) }
}
GPU Computing
Motivation
Computing Challenge
graphic
Task Computing
© NVIDIA Corporation 2007
Data Computing
Extreme Growth in Raw Data
Walmart Transaction Tracking
Millions
Millions
YouTube Bandwidth Growth
Source: Alexa, YouTube 2006
Source: Hedburg, CPI, Walmart
BP Oil and Gas Active Data
NOAA Weather Data
NOAA NASA Weather Data in Petabytes
90
80
Terabytes
70
Petabytes
60
50
40
30
20
10
0
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
© NVIDIA Corporation 2007
Source: Jim Farnsworth, BP May 2005
Source: John Bates, NOAA Nat. Climate Center
Computational Horsepower
GPU is a massively parallel computation engine
High memory bandwidth (5-10x CPU)
High floating-point performance (5-10x CPU)
© NVIDIA Corporation 2007
Benchmarking: CPU vs. GPU Computing
G80 vs. Core2 Duo 2.66 GHz
Measured against commercial CPU benchmarks when possible
© NVIDIA Corporation 2007
“Free” Massively Parallel Processors
It’s not science fiction, it’s just funded by them
Asst Master Chief Harvard
Success
Stories
Success Stories: Data to Design
Acceleware EM Field simulation technology for the GPU
3D Finite-Difference and Finite-Element (FDTD)
Modeling of:
Cell phone irradiation
MRI Design / Modeling
Printed Circuit Boards
Radar Cross Section (Military)
20X
700
600
500
Performance
(Mcells/s)
400
200
100
Pacemaker with Transmit Antenna
10X
300
5X
1X
0
CPU
© NVIDIA Corporation
2007
3.2 GHz
1 GPU
2 GPUs
4 GPUs
EvolvedMachines
130X Speed up
Simulate brain circuitry
Sensory computing: vision, olfactory
EvolvedMachines
© NVIDIA Corporation 2007
Matlab: Language of Science
10X with MATLAB CPU+GPU
Pseudo-spectral simulation of 2D Isotropic turbulence
http://developer.nvidia.com/object/matlab_cuda.html
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
© NVIDIA Corporation 2007
MATLAB Example:
Advection of an elliptic vortex
256x256 mesh, 512 RK4 steps, Linux, MATLAB file
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m
Matlab
168 seconds
Matlab with CUDA
(single precision FFTs)
20 seconds
© NVIDIA Corporation 2007
MATLAB Example:
Pseudo-spectral simulation of 2D Isotropic turbulence
512x512 mesh, 400 RK4 steps, Windows XP, MATLAB file
http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m
MATLAB
992 seconds
MATLAB with CUDA
(single precision FFTs)
93 seconds
© NVIDIA Corporation 2007
NAMD/VMD Molecular Dynamics
240X speedup
Computational biology
© NVIDIA Corporation 2007
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/
Molecular Dynamics Example
Case study: molecular dynamics research
at U. Illinois Urbana-Champaign
(Scientist-sponsored) course project for CS 498AL:
Programming Massively Parallel Multiprocessors (Kirk/Hwu)
Next slides stolen from a nice description of problem,
algorithms, and iterative optimization process available at:
http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/
© NVIDIA Corporation 2007
© NVIDIA Corporation 2007
Molecular Modeling: Ion Placement
Biomolecular simulations
attempt to replicate in vivo
conditions in silico.
Model structures are
initially constructed in
vacuum
Solvent (water) and ions are
added as necessary for the
required biological
conditions
Computational
requirements scale with the
size of the simulated
structure
© NVIDIA Corporation 2007
Evolution of Ion Placement Code
First implementation was sequential
Virus structure with 10^6 atoms would require 10
CPU days
Tuned for Intel C/C++ vectorization+SSE, ~20x
speedup
Parallelized /w pthreads: high data parallelism =
linear speedup
Parallelized GPU accelerated implementation: 3
GeForce 8800GTX cards outrun ~300 Itanium2
CPUs!
Virus structure now runs in 25 seconds on 3 GPUs!
Further speedups should still be possible…
© NVIDIA Corporation 2007
Multi-GPU CUDA
Coulombic Potential Map Performance
Host: Intel Core 2 Quad,
8GB RAM, ~$3,000
3 GPUs: NVIDIA GeForce
8800GTX, ~$550 each
32-bit RHEL4 Linux
(want 64-bit CUDA!!)
235 GFLOPS per GPU for
current version of
coulombic potential map
kernel
705 GFLOPS total for
multithreaded multi-GPU
version
© NVIDIA Corporation 2007
Three GeForce 8800GTX GPUs
in a single machine, cost ~$4,650
Professor
Partnership
NVIDIA Professor Partnership
Support faculty research & teaching efforts
Small equipment gifts (1-2 GPUs)
Significant discounts on GPU purchases
Especially Quadro, Tesla equipment
Useful for cost matching
Research contracts
Small cash grants (typically ~$25K gifts)
Medium-scale equipment donations
(10-30 GPUs)
Easy
Competitive
Informal proposals, reviewed quarterly
Focus areas: GPU computing, especially with an
educational mission or component
http://www.nvidia.com/page/professor_partnership.html
© NVIDIA Corporation 2007
Download