PPT

advertisement
Introduction to
CUDA (1 of n*)
Joseph Kider
University of Pennsylvania
CIS 565 - Spring 2011
* Where n is 2 or 3
Agenda
GPU architecture review
 CUDA

 First
of two or three dedicated classes
Acknowledgements

Many slides are from
 Kayvon
Fatahalian's From Shader Code to a
Teraflop: How GPU Shader Cores Work:

http://bps10.idav.ucdavis.edu/talks/03fatahalian_gpuArchTeraflop_BPS_SIGGRAPH201
0.pdf
 David

Kirk and Wen-mei Hwu’s UIUC course:
http://courses.engr.illinois.edu/ece498/al/
GPU Architecture Review

GPUs are:
 Parallel
 Multithreaded
 Many-core

GPUs have:
 Tremendous
computational horsepower
 High memory bandwidth
GPU Architecture Review

GPUs are specialized for
 Compute-intensive,
highly parallel computation
 Graphics!

Transistors are devoted to:
 Processing
 Not:
Data caching
 Flow control

GPU Architecture Review
Transistor Usage
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Threading Hardware in G80
Sources
Slides by ECE 498 AL : Programming
Massively Parallel Processors : Wen-Mei
Hwu
 John Nickolls, NVIDIA

Fixed-function pipeline
3D API:
OpenGL or
Direct3D
3D API
Commands
3D
Application
Or Game
CPU-GPU Boundary (AGP/PCIe)
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Programmable
Fragment
Processor
Transformed
Fragments
GPU
Front End
Pixel
Pixel
Location
Updates
Stream
Rasterization
Raster
Frame
and
Operations
Buffer
Interpolation
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
Vertex
Index
Stream
Programmable pipeline
3D API:
OpenGL or
Direct3D
3D API
Commands
3D
Application
Or Game
CPU-GPU Boundary (AGP/PCIe)
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Programmable
Vertex
Processor
Programmable
Fragment
Processor
Transformed
Fragments
GPU
Front End
Pixel
Pixel
Location
Updates
Stream
Rasterization
Raster
Frame
and
Operations
Buffer
Interpolation
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
Vertex
Index
Stream
3D API:
OpenGL or
Direct3D
3D API
Commands
Unified Programmable
pipeline
3D
Application
Or Game
CPU-GPU Boundary (AGP/PCIe)
Pixel
Pixel
Location
Updates
Stream
Rasterization
Raster
Frame
and
Operations
Buffer
Interpolation
Primitive
Assembly
Pre-transformed
Fragments
Pre-transformed
Vertices
Unified Vertex,
Fragment,
Geometry
Processor
Transformed
Fragments
Assembled
Primitives
Transformed
Vertices
GPU
Command &
Data Stream
GPU
Front End
Vertex
Index
Stream
General Diagram (6800/NV40)
TurboCache
Uses PCI-Express bandwidth to render
directly to system memory
 Card needs less memory
 Performance boost while lowering cost
 TurboCache Manager dynamically
allocates from main memory
 Local memory used to cache data and to
deliver peak performance when needed

NV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four
FP32 components), one scalar FP32 operation, and make one access to the
texture per clock cycle
NV40 Fragment Processors
Early termination from mini z buffer and z
buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Why NV40 series was better


Massive parallelism
Scalability


Lower end products have fewer pixel pipes
and fewer vertex shader units
Computation Power

222 million transistors
 First to comply with Microsoft’s DirectX 9
spec

Dynamic Branching in pixel shaders
Dynamic Branching
Helps detect if pixel needs shading
 Instruction flow handled in groups of pixels
 Specify branch granularity (the number of
consecutive pixels that take the same
branch)
 Better distribution of blocks of pixels
between the different quad engines

General Diagram (7800/G70)
General Diagram
(7800/G70)
General
Diagram
(6800/NV40)
GeForce Go 7800 – Power Issues





Power consumption and package are the same as the 6800 Ultra
chip, meaning notebook designers do not have to change very much
about their thermal designs
Dynamic clock scaling can run as slow as 16 MHz
 This is true for the engine, memory, and pixel clocks
Heavier use of clock gating than the desktop version
Runs at voltages lower than any other mobile performance part
Regardless, you won’t get much battery-based runtime for a 3D
game
GeForce 7800 GTX
Parallelism
8 Vertex Engines
Z-Cull
Triangle Setup/Raster
Shader Instruction Dispatch
Fragment Crossbar
Memory
Partition
Memory
Partition
24 Pixel Shaders
16 Raster Operation Pipelines
Memory
Partition
Memory
Partition
G80 – Graphics Mode
The future of GPUs is programmable
processing
So – build the architecture around the
processor

Host
Input Assembler
Setup / Rstr / ZCull
Vtx Thread Issue
SP
SP
SP
TF
SP
TF
L1
SP
TF
L1
SP
SP
SP
TF
L1
L1
SP
SP
TF
L1
L2
FB
Pixel Thread Issue
SP
TF
L2
FB
SP
SP
TF
L1
L2
FB
SP
Geom Thread Issue
SP
TF
L1
L2
FB
SP
L1
L2
FB
Thread Processor

L2
FB

G80 CUDA mode – A Device
Example

Processors execute computing threads
New operating mode/HW interface for
computing
Host
Input Assembler
Thread Execution Manager
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Parallel Data
Cache
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Texture
Load/store
Load/store
Load/store
Load/store
Global Memory
Load/store
Load/store
Why Use the GPU for
Computing ?

The GPU has evolved into a very flexible and
powerful processor:


GFLOPS

It’s programmable using high-level languages
It supports 32-bit floating point precision
It offers lots of GFLOPS:
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra

GPU in every PC and workstation
NV30 = GeForce FX 5800
What
is Behind
such
Evolution?
 The GPU
is specialized
for an
compute-intensive,
highly data parallel computation (exactly what
graphics rendering is about)

So, more transistors can be devoted to data
processing rather than data caching and flow control
ALU
ALU
ALU
ALU
Control
CPU
GPU
Cache
DRAM

DRAM
The fast-growing video game industry exerts
What is (Historical) GPGPU ?

General Purpose computation using GPU and
graphics API in applications other than 3D
graphics
 GPU

accelerates critical path of application
Data parallel algorithms leverage GPU attributes
 Large
data arrays, streaming throughput
 Fine-grain SIMD parallelism
 Low-latency floating point (FP) computation

Applications – see //GPGPU.org
Previous
GPGPU
Constraints
 Dealing with graphics API
 Working
with the corner cases
of the graphics API

Fragment Program
Addressing modes
 Limited

Input Registers
Temp Registers
Output Registers
FB
 Limited

Instruction sets
 Lack

outputs
of Integer & bit ops
Communication limited
 Between
pixels
 Scatter a[i] = p
Texture
Constants
texture size/dimension
Shader capabilities
per thread
per Shader
per Context
Memory
An Example of Physical Reality
Behind CUDA
CPU
(host)
GPU w/
local DRAM
(device)
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of
threads
– All threads run the same code (SPMD)
– Each thread has an ID that it uses to compute
memory addresses and make control decisions
threadID
0 1 2 3 4 5 6 7
…
float x = input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Blocks: Scalable
Cooperation

Divide monolithic thread array into multiple
blocks

Threads within a block cooperate via shared
memory, atomic operations and barrier
synchronization
Thread Block
0
Thread
Block N - 1
 Threads
blocks
cannot
cooperate
Thread Block 0in different
threadID
0
1
2
3
4
5
6
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
7
0
1
2
3
4
5
6
7
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
0
…
1
2
3
4
5
6
7
…
float x =
input[threadID];
float y = func(x);
output[threadID] = y;
…
Thread Batching: Grids and Blocks

A kernel is executed as a
grid of thread blocks


All threads share data
memory space
Device
Grid 1
Kernel
1
A thread block is a batch of
threads that can cooperate
with each other by:

Synchronizing their execution



Host
For hazard-free shared
memory accesses
Efficiently sharing data
through a low latency shared
memory
Two threads from two
different blocks cannot
cooperate
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NDVIA
Block and Thread IDs

Threads and blocks have IDs






Grid 1
So each thread can decide
what data to work on
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D
Simplifies memory
addressing when processing
multidimensional data

Device
Image processing
Solving PDEs on volumes
…
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
Courtesy: NDVIA
CUDA Device Memory Space
Overview

Each thread can:
(Device) Grid

R/W per-thread registers
 R/W per-thread local memory
 R/W per-block shared memory
 R/W per-grid global memory
 Read only per-grid constant
memory
 Read only per-grid texture
memory
Host

The host can R/W
global, constant, and
texture memories
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Global, Constant, and Texture Memories
(Long Latency Accesses)

Global memory
(Device) Grid
 Main
means of
communicating R/W
Data between host and
device
 Contents visible to all
threads

Block (0, 0)
Shared Memory
Registers
Texture and Constant
Memories
 Constants
initialized by
host
 Contents visible to all
threads
Block (1, 0)
Host
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Courtesy: NDVIA
Local
Memory
Block IDs and Thread IDs
Host

Each thread uses IDs to
decide what data to
work on
Device
Grid 1
Kernel
1

Block ID: 1D or 2D
 Thread ID: 1D, 2D, or
3D
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)

Simplifies memory
addressing when
processing
multidimensional data

Image processing
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Thread Thread Thread Thread
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Org
CUDA Memory Model Overview

Global memory
 Main
means of
communicating R/W
Data between host
and device
 Contents visible to
all threads
 Long latency
Host
access

We will focus on
global memory for
Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Global Memory
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+
GFLOPS
on compiled parallel C applications

GeForce 8800
Available in laptops, desktops, and
clusters
Tesla D870



GPU parallelism is doubling every
year
Programming model scales
transparently
Programmable in C with CUDA
Tesla S870
Single-Program Multiple-Data
(SPMD)

CUDA integrated CPU + GPU application
C program

Serial C code executes on CPU
 Parallel Kernel C code executes on GPU
CPU Serial Code
thread blocks
Grid 0
GPU Parallel Kernel
KernelA<<< nBlk, nTid >>>(args);
...
CPU Serial Code
Grid 1
GPU Parallel Kernel
KernelB<<< nBlk, nTid >>>(args);
...
Grids and Blocks

A kernel is executed as
a grid of thread blocks


All threads share global
memory space
A thread block is a
batch of threads that
can cooperate with
each other by:



Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Kernel
2
Block (1, 1)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Synchronizing their
Thread Thread Thread Thread
execution using barrier
(0,0,0) (1,0,0) (2,0,0) (3,0,0)
Efficiently sharing data
Thread Thread Thread Thread
(0,1,0) (1,1,0) (2,1,0) (3,1,0)
through a low latency
Courtesy: NDVIA
shared memory
Two threads fromFigure
two 3.2. An Example of CUDA Thread Org
CUDA Thread Block

Programmer declares (Thread)
Block:






Block size 1 to 512 concurrent
threads
Block shape 1D, 2D, or 3D
Block dimensions in threads
All threads in a Block execute
the same thread program
Threads share data and
synchronize while doing their
share of the work
Threads have thread id
numbers within Block
CUDA Thread Block
Thread Id #:
0123…
m
Thread program
Courtesy: John Nickolls, NVIDIA
GeForce-8 Series HW Overview
Streaming Processor Array
TPC
TPC
TPC
Texture Processor Cluster
…
TPC
TPC
Streaming Multiprocessor
Instruction L1
SM
TPC
Data L1
Instruction Fetch/Dispatch
Shared Memory
TEX
SP
SM
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
CUDA
Processor
Terminology
 SPA


TPC


Streaming Processor Array (variable across
GeForce 8-series, 8 in GeForce8800)
Texture Processor Cluster (2 SM + TEX)
SM

Streaming Multiprocessor (8 SP)
 Multi-threaded processor core
 Fundamental processing unit for CUDA thread
block

SP

Streaming Processor
Streaming Multiprocessor (SM)

Streaming Multiprocessor (SM)



Multi-threaded instruction
dispatch






8 Streaming Processors (SP)
2 Super Function Units (SFU)
1 to 512 threads active
Shared instruction fetch per 32
threads
Cover latency of texture/memory
loads
20+ GFLOPS
16 KB shared memory
texture and global memory
Streaming Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
G80 Thread Computing Pipeline
Processors
computing
threads
The
future ofexecute
GPUs is
programmable
processing
Alternative operating mode specifically for
Generates Thread
grids based on
computing
So
– build the architecture
around
the
kernel calls
processor


Host
Host
Input
Input Assembler
Assembler
Setup / Rstr / ZCull
Thread Execution Manager
Vtx Thread Issue
SP
SP
SP
SP
SP
SP
Geom Thread Issue
SP
SP
SP
SP
Pixel Thread Issue
SP
SP
SP
SP
SP
SP
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Parallel
Data
TF
Cache
Texture
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Texture
L1
Load/store
L2
Load/storeL2
FB
FB
Load/store
L2
Load/store
L2
FB Global Memory
FB
Thread Processor

Load/store
L2
Load/store
L2
FB
FB
Thread Life Cycle in HW


Grid is launched on the
SPA
Thread Blocks are
serially distributed to all
the SM’s


Each SM launches
Warps of Threads



Potentially >1 Thread
Block per SM
2 levels of parallelism
SM schedules and
executes Warps that are
ready to run
As Warps and Thread
Blocks complete,
Host
Device
Grid 1
Kernel
1
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
SM Executes Blocks
t0 t1 t2 … tm
SM 0 SM 1
MT IU
SP
t0 t1 t2 … tm
MT IU
Blocks
SP

Blocks
Shared
Memory
Threads are assigned to
SMs in Block granularity

Shared
Memory

TF
Texture L1
Up to 8 Blocks to each SM
as resource allows
SM in G80 can take up to
768 threads

L2

Memory

Could be 256 (threads/block)
* 3 blocks
Or 128 (threads/block) * 6
blocks, etc.
Threads run concurrently
Thread Scheduling/Execution

Each Thread Blocks is divided
in 32-thread Warps



This is an implementation
decision, not part of the CUDA
programming model
Warps are scheduling units in
SM
If 3 blocks are assigned to an
SM and each Block has 256
threads, how many Warps are
there in an SM?


Each Block is divided into 256/32
= 8 Warps
There are 8 * 3 = 24 Warps
Block 1 Warps
…
t0 t1 t2 … t31
…
Block 2 Warps
…
t0 t1 t2 … t31
…
Streaming Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
SM Warp Scheduling

SM multithreaded
Warp scheduler
SM hardware implements
zero-overhead Warp
scheduling

time
warp 8 instruction 11
warp 1 instruction 42

warp 3 instruction 95
..
.
warp 8 instruction 12

warp 3 instruction 96

Warps whose next instruction
has its operands ready for
consumption are eligible for
execution
Eligible Warps are selected for
execution on a prioritized
scheduling policy
All threads in a Warp execute
the same instruction when
selected
4 clock cycles needed to
SM Instruction Buffer – Warp
Scheduling

Fetch one warp instruction/cycle



from instruction L1 cache
into any instruction buffer slot
Issue one “ready-to-go” warp
instruction/cycle



I$
L1
from any warp - instruction buffer slot
operand scoreboarding used to
prevent hazards
Issue selection based on roundrobin/age of warp
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
Scoreboarding

All register operands of all instructions in the
Instruction Buffer are scoreboarded




Instruction becomes ready after the needed values
are deposited
prevents hazards
cleared instructions are eligible for issue
Decoupled Memory/Processor pipelines

TB1, W1 stall
TB2, W1 stall
TB3, W2 stall
any thread can continue to issue instructions
until
TB1
TB2
TB3 issue
TB3
TB2
TB1
TB1
TB1
TB3
scoreboarding
prevents
W1
W1
W1
W2
W1
W1
W2
W3
W2
Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4
 allows Memory/Processor ops to proceed in
Time of other waiting
TB =
Thread Block, W = Warp ops
shadow
Memory/Processor
Granularity Considerations

For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32
tiles?

For 4X4, we have 16 threads per block, Since each SM can take up to
768 threads, the thread capacity allows 48 blocks. However, each SM
can only take up to 8 blocks, thus there will be only 128 threads in each
SM!


For 8X8, we have 64 threads per Block. Since each SM can take up to
768 threads, it could take up to 12 Blocks. However, each SM can only
take up to 8 Blocks, only 512 threads will go into each SM!



There are 8 warps but each warp is only half full.
There are 16 warps available for scheduling in each SM
Each warp spans four slices in the y dimension
For 16X16, we have 256 threads per Block. Since each SM can take up
to 768 threads, it can take up to 3 Blocks and achieve full capacity
unless other resource considerations overrule.


There are 24 warps available for scheduling in each SM
Each warp spans two slices in the y dimension
Memory Hardware in G80
CUDA Device Memory Space:
Review

Each thread can:

R/W per-thread registers
 R/W per-thread local
memory
 R/W per-block shared
memory
 R/W per-grid global
memory
 The host can
Host
 Read only per-grid constant
R/W
global,
memory
constant,
and
 Read only
per-grid texture
memory
texture
(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
Local
Memory
Local
Memory
Parallel Memory Sharing

Thread
Local Memory
Local Memory:
thread


Block
Shared
Memory

Grid 0

...

Grid 1
...
Private per thread
Auto variables, register
spill
Shared Memory:
Block

per-
per-
Shared by threads of the
same block
Inter-thread
communication
Global
Sequential
Global
Memory:
Memory
Grids perapplication in Time

Shared by all threads
 Inter-Grid communication
SM Memory Architecture
t0 t1 t2 … tm
SM 0 SM 1
MT IU
SP
t0 t1 t2 … tm
MT IU
Blocks
SP
Blocks

Shared
Memory
Shared
Memory
Threads in a block share
data & results

TF

Texture L1
Courtesy:
John Nicols, NVIDIA
L2
Memory

In Memory and Shared
Memory
Synchronize at barrier
instruction
Per-Block Shared Memory
Allocation

Keeps data close to
processor
SM Register File

Register File (RF)


32 KB (8K entries) for each SM in
G80
TEX pipe can also read/write RF


I$
L1
2 SMs share 1 TEX
Load/Store pipe can also
read/write RF
Multithreaded
Instruction Buffer
R
F
C$
L1
Shared
Mem
Operand Select
MAD
SFU
Programmer View of Register
3 blocks
4 blocks
File

There are 8192
registers in each SM
in G80
 This
is an
implementation
decision, not part of
CUDA
 Registers are
dynamically partitioned
across all blocks
assigned to the SM
 Once assigned to a
Matrix Multiplication Example

If each Block has 16X16 threads and each
thread uses 10 registers, how many thread
can run on each SM?
 Each
block requires 10*256 = 2560 registers
 8192 = 3 * 2560 + change
 So, three blocks can run on an SM as far as
registers are concerned

How about if each thread increases the
use of registers by 1?
 Each
Block now requires 11*256 = 2816
registers
 8192 < 2816 *3
More on Dynamic Partitioning

Dynamic partitioning gives more flexibility
to compilers/programmers
 One
can run a smaller number of threads that
require many registers each or a large
number of threads that require few registers
each

This allows for finer grain threading than traditional
CPU threading models.
 The
compiler can tradeoff between
instruction-level parallelism and thread level
Let’s program
this thing!
GPU Computing History

2001/2002 – researchers see GPU as dataparallel coprocessor
 The

GPGPU field is born
2007 – NVIDIA releases CUDA
– Compute Uniform Device Architecture
 GPGPU shifts to GPU Computing
 CUDA

2008 – Khronos releases OpenCL
specification
CUDA Abstractions
A hierarchy of thread groups
 Shared memories
 Barrier synchronization

CUDA Terminology

Host – typically the CPU
 Code

written in ANSI C
Device – typically the GPU (data-parallel)
 Code
written in extended ANSI C
Host and device have separate memories
 CUDA Program

 Contains
both host and device code
CUDA Terminology

Kernel – data-parallel function
 Invoking
a kernel creates lightweight threads
on the device


Threads are generated and scheduled with
hardware
Does a kernel remind you of a shader in OpenGL?
CUDA Kernels

Executed N times in parallel by N different
CUDA threads
Thread ID
Declaration
Specifier
Execution
Configuration
CUDA Program Execution
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies

Grid – one or more thread blocks
 1D

or 2D
Block – array of threads
 1D,
2D, or 3D
 Each block in a grid has the same number of
threads
 Each thread in a block can
Synchronize
 Access shared memory

Thread Hierarchies
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies

Block – 1D, 2D, or 3D
 Example:
Index into vector, matrix, volume
Thread Hierarchies
Thread ID: Scalar thread identifier
 Thread Index: threadIdx

1D: Thread ID == Thread Index
 2D with size (Dx, Dy)

 Thread

ID of index (x, y) == x + y Dy
3D with size (Dx, Dy, Dz)
 Thread
ID of index (x, y, z) == x + y Dy + z Dx Dy
Thread Hierarchies
2D Index
1 Thread Block
2D Block
Thread Hierarchies

Thread Block
 Group
of threads
G80 and GT200: Up to 512 threads
 Fermi: Up to 1024 threads

 Reside
on same processor core
 Share memory of that core
Thread Hierarchies

Thread Block
 Group
of threads
G80 and GT200: Up to 512 threads
 Fermi: Up to 1024 threads

 Reside
on same processor core
 Share memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block Index: blockIdx
 Dimension: blockDim

 1D
or 2D
Thread Hierarchies
16x16
Threads per block
2D Thread Block
Thread Hierarchies

Example: N = 32
 16x16

threads per block (independent of N)
threadIdx ([0, 15], [0, 15])
 2x2
thread blocks in grid
blockIdx ([0, 1], [0, 1])
 blockDim = 16


i = [0, 1] * 16 + [0, 15]
Thread Hierarchies

Thread blocks execute independently
 In
any order: parallel or series
 Scheduled in any order by any number of
cores

Allows code to scale with core count
Thread Hierarchies
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Thread Hierarchies

Threads in a block
 Share
(limited) low-latency memory
 Synchronize execution
To coordinate memory accesses
 __syncThreads()

Barrier – threads in block wait until all threads reach this
 Lightweight

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers

Host can transfer to/from device
 Global
memory
 Constant memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers

cudaMalloc()
 Allocate

global memory on device
cudaFree()
 Frees
memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Pointer to device memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Size in bytes
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers

cudaMemcpy()
 Memory
transfer
Host to host
 Host to device
 Device to host
 Device to device


Host
Device
Global Memory
Does this remind you of VBOs in OpenGL?
CUDA Memory Transfers

cudaMemcpy()
 Memory
transfer
Host to host
 Host to device
 Device to host
 Device to device

Host
Device
Global Memory
CUDA Memory Transfers

cudaMemcpy()
 Memory
transfer
Host to host
 Host to device
 Device to host
 Device to device

Host
Device
Global Memory
CUDA Memory Transfers

cudaMemcpy()
 Memory
transfer
Host to host
 Host to device
 Device to host
 Device to device

Host
Device
Global Memory
CUDA Memory Transfers

cudaMemcpy()
 Memory
transfer
Host to host
 Host to device
 Device to host
 Device to device


Host
Device
Global Memory
All transfers are asynchronous
CUDA Memory Transfers
Host to device
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Destination (device)
Source (host)
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Host
Device
Global Memory
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply



P=M*N
Assume M and N are
square for simplicity
Is this data-parallel?
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply

1,000 x 1,000 matrix

1,000,000 dot products
 Each 1,000 multiples and 1,000 adds
Matrix Multiply: CPU Implementation
void MatrixMulOnHost(float* M, float* N, float* P, int width)
{
for (int i = 0; i < width; ++i)
for (int j = 0; j < width; ++j)
{
float sum = 0;
for (int k = 0; k < width; ++k)
{
float a = M[i * width + k];
float b = N[k * width + j];
sum += a * b;
}
P[i * width + j] = sum;
}
}
Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply

Step 1
 Add
CUDA memory transfers to the skeleton
Matrix Multiply: Data Transfer
Allocate input
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Allocate output
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Read back
from device
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer

Does this remind you of GPGPU with GLSL?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply

Step 2
 Implement
the kernel in CUDA C
Matrix Multiply: CUDA Kernel
Accessing a matrix, so using a 2D block
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
Each kernel computes one output
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
Where did the two outer for loops
in the CPU implementation go?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Kernel
No locks or synchronization, why?
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply

Step 3
 Invoke
the kernel in CUDA C
Matrix Multiply: Invoke Kernel
One block with width
by width threads
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply


Block 1
One Block of threads compute
matrix Pd

Nd
Grid 1
2
4
Each thread computes one element
of Pd
2
Thread
(2, 2)
6
Each thread

Loads a row of matrix Md
 Loads a column of matrix Nd
 Perform one multiply and addition
for each pair of Md and Nd
elements
 Compute to off-chip memory
access ratio close to 1:1 (not very
high)

Size of matrix limited by the number
of threads allowed in a thread block
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009
ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
3
2
5
4
48
WIDTH
Md
Pd
125
Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt
Matrix Multiply
What is the major performance problem
with our implementation?
 What is the major limitation?

Download