Presentation - Outreach - University of Wisconsin–Madison

advertisement
Getting Started with GPU Computing
Dan Negrut
Assistant Professor
Simulation-Based Engineering Lab
Dept. of Mechanical Engineering
University of Wisconsin-Madison
San Diego
August 30, 2009
Acknowledgement

Colleagues helping to organize the GPU Workshop:

Sara McMains, Krishnan Suresh, Roshan D’Souza

Wen-mei W. Hwu

NVIDIA Corporation

My students


Hammad Mazhar
Toby Heyn
2
Acknowledgements: Financial Support [Dan Negrut]

NSF

NVIDIA Corporation

British Aerospace Engineering (BAE), Land Division

Argonne National Lab
3
Overview

Parallel computing: why, and why now? (15 mins)

GPU Programming: The democratization of parallel computing (60 mins)

NVIDIA’s CUDA, a facilitator of GPU computing





Comments on the execution configuration and execution model
The memory layout
Gauging resource utilization
IDE support
Comments on GPU computing (15 mins)


Sources of information
Beyond CUDA
4
Scientific Computing: A Change of Tide...

A paradigm shift taking place in Scientific Computing

Moving from sequential to parallel data processing

Triggered by changes in the microprocessor industry
5
CPU: Three Walls to Serial Performance



Memory Wall
Instruction Level
Parallelism (ILP) Wall
Source: excellent article, “The ManyCore Inflection Point for Mass Market
Computer Systems”, by John L.
Manferdelli, Microsoft Corporation
http://www.ctwatch.org/quarterly/articles
/2007/02/the-many-core-inflection-pointfor-mass-market-computer-systems/
Power Wall
6
Memory Wall

There is a growing disparity of speed between CPU and memory access
outside the CPU chip

S. Cray: “Anyone can build a fast CPU. The trick is to build a fast system”
7
Memory Wall

The processor often data starved (idle) due to latency and limited
communication bandwidth beyond chip boundaries


From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory
access speed only improved at 10%.
Some fixes


Strong push for ever growing caches to improve the average memory reference time to
fetch or write data
Hyper-threading Technology (HTT)
8
The Power Wall

“Power, and not manufacturing, limits traditional general purpose
microarchitecture improvements” (F. Pollack, Intel Fellow)

Leakage power dissipation gets worse as gates get smaller,
because gate dielectric thicknesses must proportionately decrease
W / cm2
Nuclear reactor
Pentium II
Pentium
i386
i486
Pentium 4
Core DUO
Pentium III
Pentium Pro
Technology from older to newer (μm)
Adapted from
F. Pollack (MICRO’99)
9
The Power Wall

Power dissipation in clocked digital devices is proportional to the square
of clock frequency imposing natural limit on clock rates

Significant increase in clock speed without heroic (and expensive) cooling
is not possible. Chips would simply melt.
10
The Power Wall

Clock speed increased by a factor of 4,000 in less than two decades

The ability of manufacturers to dissipate heat is limited though…

Look back at the last five years, the clock rates are pretty much flat

2010 Intel’s Sandy Bridge microprocessor architecture, to go up to 4.0 GHz
11
The Bright Spot: Moore’s Law

1965 paper: Doubling of the number of transistors on integrated
circuits every two years


Moore himself wrote only about the density of components (or
transistors) at minimum cost
Increase in transistor count to some extent as a rough measure of
computer processing performance
http://news.cnet.com/Images-Moores-Law-turns-40/2009-1041_3-5649019.html
12
Micro2015: Evolving Processor Architecture, Intel® Developer Forum, March 2005
Intel’s Vision:
Evolutionary Configurable Architecture
Large, Scalar cores for
high single-thread
performance
Scalar plus many core for
highly threaded workloads
Multi-core array
• CMP with ~10 cores
Many-core array
• CMP with 10s-100s low
power cores
• Scalar cores
• Capable of TFLOPS+
• Full System-on-Chip
• Servers, workstations
embedded…
Dual core
• Symmetric multithreading
CMP = “chip multi-processor”
Presentation Paul Petersen,
Sr. Principal Engineer, Intel
13
Putting things in perspective…
The way business has been run in the past
It will probably change to this…
Rely exclusively on frequency increase
Parallelism is primary method of
performance improvement
For the commoner: Don’t bother
parallelizing an application
(after all, you get a meager speedup)
No scientific computing application relies
on one core chips
Less than linear scaling for a
multiprocessor is failure
Sub-linear speedups are ok as long as you
beat the sequential
Slide Source: Berkeley View of Landscape
14

Some numbers would be good…
15
GPU vs. CPU
Flop Rate Comparison
(single precision rate for GPU)
Seymour Cray: "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?"
16
Key Parameters
GPU, CPU
GPU – NVIDIA
Tesla C1060
CPU – Intel core I7 975 Extreme
Processing Cores
240
4
Memory
4 GB
- 32 KB L1 cache / core
- 256 KB L2 (I&D)cache / core
- 8 MB L3 (I&D) shared by all cores
Clock speed
1.33 GHz
3.20 GHz
Memory bandwidth
102 GB/s
32.0 GB/s
Floating point
operations/s
933 x 109
Single Precision
70 x 109
Double Precision
17
The GPU Hardware
18
19
GPU: Underlying Hardware

NVIDIA nomenclature used below, reminiscent of GPU’s mission

The hardware organized as follows:

One Stream Processor Array (SPA)…

… has a collection of Texture
Processor Clusters (TPC, ten of them
on C1060) …

each
a¾¾
¾®
…and each TPC has three Stream
Multiprocessors (SM) …

…and each SM is made up of eight
Stream or Scalar Processor (SP)
20
NVIDIA TESLA C1060
21

240 Scalar Processors

4 GB device memory

Memory Bandwidth: 102 GB/s

Clock Rate: 1.3GHz

Approx. $1,250
Layout of Typical Hardware Architecture
CPU
(the host)
GPU w/ local
DRAM
(the device)
22
GPGPU Computing

GPGPU computing: “General Purpose” GPU computing

The GPU can be used for more than just graphics: the computational
resources are there, and they are most of the time underutilized

GPU can be used to accelerate data parallel parts of an application
23
GPGPU: Pluses and Minuses

Simple architecture optimized for compute intensive task




High precision floating point arithmetic support


Large data arrays, streaming throughput
Fine-grain SIMD (Singe Instruction Multiple Data) parallelism
Low-latency floating point (FP) computation
32bit floating point IEEE 754
However, GPU was only programmable relying on graphics library APIs
24
GPGPU: Pluses and Minuses
[Cntd.]

Dealing with graphics API

Addressing modes


Limited texture size/dimension
Shader capabilities

Input Registers
Fragment Program
Limited outputs
per thread
per Shader
per Context
Texture
Constants
Temp Registers

Instruction sets

Lack of Integer & bit ops
Output Registers
FB

Memory
Communication limited


Between pixels
Only gather (can read data from other pixels), but
no scatter (can only write to one pixel)
Summing Up: Mapping computation problems to graphics rendering pipeline tedious…
25
CUDA: Addressing the Minuses in GPGPU

“Compute Unified Device Architecture”

It represents a general purpose programming model


Targeted software stack


User kicks off batches of threads on the GPU
Scientific computing oriented drivers, language, and tools
Driver for loading computation programs into GPU




Standalone Driver - Optimized for computation
Interface designed for compute - graphics free API
Guaranteed maximum download & readback speeds
Explicit GPU memory management
26
The CUDA Execution Model
GPU Computing – The Basic Idea

The GPU is linked to the CPU by a reasonably fast connection

The idea is to use the GPU as a co-processor

Farm out big parallelizable tasks to the GPU

Keep the CPU busy with the control of the execution and “corner” tasks
28
GPU Computing – The Basic Idea
[Cntd.]

You have to copy data onto the GPU and later fetch results back.

For this to pay off, the data transfer should be overshadowed by the
number crunching that draws on that data

GPUs also work in asynchronous mode

Data transfer for future task can happen while the GPU processes current job
29
Some Nomenclature…

The HOST


This is your CPU executing the “master” thread
The DEVICE

This is the GPU card, connected to the HOST through a PCIe X16 connection

The HOST (the master thread) calls the DEVICE to execute a KERNEL

When calling the KERNEL, the HOST also has to inform the DEVICE
how many threads should each execute the KERNEL

This is called “defining the execution configuration”
30
Calling a Kernel Function, Details

A kernel function must be called with an execution configuration:
__global__ void KernelFoo(...); // declaration
dim3 DimGrid(100, 50);
dim3 DimBlock(4, 8, 8);
// 5000 thread blocks
// 256 threads per block
KernelFoo<<< DimGrid, DimBlock>>>(...arg list here…);

Any call to a kernel function is asynchronous

By default, execution on host doesn’t wait for kernel to finish
31
Example

The host call below instructs the GPU to execute the function
(kernel) “foo” using 25,600 threads

Two arguments are passed down to each thread executing the kernel “foo”

In this execution configuration, the host instructs the device that it is
supposed to run 100 blocks each having 256 threads in it

The concept of block it’s important, since it represents the entity that
gets executed by an SMs
32
30,000 Feet Perspective
This is how your
C code looks like
This is how the code gets
executed on the hardware in
heterogeneous computing
33
34
More on the Execution Model

There is a limitation on the number of blocks in a grid:


The grid of blocks can be organized as a 2D structure: max of 65535 by 65535
grid of blocks (that is, no more than 4,294,836,225 blocks for a kernel call)
Threads in each block:


The threads can be organized as a 3D structure (x,y,z)
The total number of threads in each block cannot be larger than 512
35
Kernel Call Overhead

How much time is it burnt by the CPU calling the GPU?


No arguments in the kernel call



Values reported below are averages over 100,000 kernel calls
GT 8800 series, CUDA 1.1: 0.115305 milliseconds
Tesla C1060, CUDA 1.3: 0.088493 milliseconds
Arguments present in the kernel call


GT 8800 series, CUDA 1.1: 0.146812 milliseconds
Tesla C1060, CUDA 1.3: 0.116648 milliseconds
36
Languages Supported in CUDA

Note that everything is done in C

Yet minor extensions are needed to flag the fact that a function actually
represents a kernel, that there are functions that will only run on the device, etc.

Called “C with extensions”

FOTRAN is supported, ongoing project with the Portland Group (PGI)

There is support for C++ programming (operator overload, for instance)
37
CUDA Function Declarations
(the “C with extensions” part)
Executed
on the:
Only callable
from the:
__device__ float myDeviceFunc()
device
device
__global__ void
device
host
host
host
__host__

float myHostFunc()
__global__ defines a kernel function


myKernelFunc()
Must return void
For a full list, see CUDA Reference Manual
38
Block Execution Scheduling Issues
Who’s Executing Here?
[The Stream Multiprocessor (SM)]

The SM represents the quantum of scalability on NVIDIA’s architecture

My laptop: 4 SMs

The Tesla C1060: 30 SMs
Stream Multiprocessor

Stream Multiprocessor (SM)



8 Scalar Processors (SP)
2 Special Function Units (SFU)
It’s where a block lands for execution
Instruction L1
Instruction Fetch/Dispatch
Shared Memory
SP

Multi-threaded instruction dispatch


From 1 up to 1024 (!) threads active
Shared instruction fetch per 32 threads

16 KB shared memory + 16 KB of registers

DRAM texture and memory access
Data L1
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
40
Scheduling on the Hardware
Host

Device
Grid is launched on the SPA
Grid 1

Thread Blocks are serially distributed to
all the SMs




Kernel
1
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Potentially >1 Thread Block per SM
Each SM launches Warps of Threads
SM schedules and executes Warps that
are ready to run
As Warps and Thread Blocks complete,
resources are freed

Block
(0, 0)
Grid 2
Kernel
2
Block (1, 1)
SPA can launch next Block[s] in line
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)

NOTE: Two levels of scheduling:


For running [desirably] a large number of
blocks on a small number of SMs
(16/14/etc.)
For running up to 32 warps of threads on
the 8 SPs available on each SM
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
41
SM Executes Blocks
t0 t1 t2 … tm
SM 0 SM 1
MT IU
SP
t0 t1 t2 … tm
MT IU
Blocks
SP
Blocks

Threads are assigned to SMs in Block
granularity

Shared
Memory
Shared
Memory

Up to 8 Blocks to each SM (doesn’t mean
you’ll have eight though…)
One SM can take up to 1024 threads

TF


Texture L1

L2
This is 32 warps
Could be 256 (threads/block) * 4 blocks
Or 128 (threads/block) * 8 blocks, etc.
Threads run concurrently but time slicing is
involved


SM assigns/maintains thread id #s
SM manages/schedules thread execution
Memory
42

There is NO time slicing for block execution
Thread Scheduling/Execution

Each Thread Block is divided in 32-thread
Warps


This is an implementation decision, not part
of the CUDA programming model
Warps are the basic scheduling units in SM
Block 1 Warps
…
t0 t1 t2 … t31
…
If 3 blocks are assigned to an SM and each
Block has 256 threads, how many Warps
are there in an SM?



HK-UIUC
Each Block is divided into 256/32 = 8
Warps
There are 8 * 3 = 24 Warps
At any point in time, only *one* of the 24
Warps will be selected for instruction fetch
and execution.
t0 t1 t2 … t31
…
Streaming Multiprocessor
Instruction L1

Block 2 Warps
…
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
43
SM Warp Scheduling

SM hardware implements zero-overhead
Warp scheduling

SM multithreaded
Warp scheduler

time

warp 8 instruction 11
warp 1 instruction 42
warp 3 instruction 35
..
.
warp 8 instruction 12
Warps whose next instruction has its
operands ready for consumption are
eligible for execution
Eligible Warps are selected for execution
on a prioritized scheduling policy
All threads in a Warp execute the same
instruction when selected

4 clock cycles needed to dispatch the same
instruction for all threads in a Warp in G80

Side-comment:

Suppose your code has one global
memory access every four instructions

Then, a minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
warp 3 instruction 36
HK-UIUC
44
Review: The CUDA Programming Model

GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD)

What’s the overall software (application) development model?

CUDA integrated CPU + GPU application C program


Serial C code executes on CPU
Parallel Kernel C code executes on GPU thread blocks
CPU Serial Code
GPU Parallel Kernel
KernelA<<< nBlkA, nTidA >>>(args);
...
Grid 0
CPU Serial Code
GPU Parallel Kernel
KernelB<<< nBlkB, nTidB >>>(args);
...
Grid 1
45
The CPU perspective of the GPU…

The GPU is viewed as a compute device that:


Is a co-processor to the CPU or host
Runs many threads in parallel

Data-parallel portions of an application are executed on the device
as kernels which run in parallel on many threads

When a kernel is invoked, you will have to instruct the GPU how
many threads are supposed to run this kernel


You have to indicate the number of blocks of threads
You have to indicated how many threads are in each block
46
Caveats [1]

Flop rates for GPUs are reported for single precision operations

Double precision is supported but the rule of thumb is that you get about a
4X slowdown relative to single precision

Also, some small deviations from IEEE754 exist

Combinations of multiplication and addition in one operation is not compliant
47
Caveats [2]

There is no synchronization between threads that live in different blocks

If all threads need to synchronize, this is accomplished by getting out of the
kernel and invoking another one

Average overhead for kernel launch ¼ 90-110 microseconds (small…)

IMPORTANT: Global, constant, and texture memory spaces are persistent
across successive kernels calls made by the same application
48
CUDA Memory Spaces
49
The Memory Space

The memory space is the union of



Registers
Shared memory
Device memory, which can be




Remarks




Global memory
Constant memory
Texture memory
The constant memory is cached
The texture memory is cached
The global memory is NOT cached
Mem Bandwidth, Device Memory:

102 Gb/s
50
CUDA Runtime Partitioning
of the Memory Space
The device memory is split in global,
constant and texture memory

(Device) Grid
Block (0, 0)
Note the presence of local memory,
which is virtual memory





If too many registers are needed for
computation the data overflow is stored in
local memory
“Local” means that it’s local, or specific,
to one thread
In fact local memory is part of the global
memory
Host
Long access times for local mem
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
51
Block (1, 0)
Texture
Memory
Local
Memory
Local
Memory
CUDA Device Memory Space









(Device) Grid
Each thread can:
At thread level: R/W registers
At thread level: R/W local memory
At block level: R/W shared memory
At grid level: R/W global memory
At grid level: Read only constant memory
At grid level: Read only texture memory
The host can R/W global,
constant, and texture memories
NOTE: the texture, constant, and
global memory are persistent
across kernels called by the
same application
Host
Block (0, 0)
Block (1, 0)
Shared Memory
Registers
Registers
Shared Memory
Registers
Registers
Thread (0, 0) Thread (1, 0)
Thread (0, 0) Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
HK-UIUC
52
Access Times

Register – dedicated HW - single cycle

Shared Memory – dedicated HW - single cycle

Local Memory – DRAM, no cache - *slow*

Global Memory – DRAM, no cache - *slow*

Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending
on cache locality

Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on
cache locality

Instruction Memory (invisible) – DRAM, cached
HK-UIUC
53
Compute Capabilities, Things Change Fast…
Credit: NVIDIA
54
Most Common Programming Pattern
[interacting with the device memory space]

Sequence of steps most commonly used in GPU computing:

Step 1: Host allocates memory on the device

Step 2: Host copies data into the device

Step 3: Host invokes a kernel that gets executed in parallel and which
processes/uses data from the device memory for useful computation

Step 4: Host copies back results from the device
55
56
CUDA Device Memory Allocation

cudaMalloc()


Allocates object in the
device Global Memory
Requires two parameters



(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Address of a pointer to the
allocated object
Size of allocated object
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
cudaFree()

Frees object from device
Global Memory

Pointer to freed object
Host
Global
Memory
Constant
Memory
Texture
Memory
HK-UIUC
57
CUDA Host-Device Data Transfer

cudaMemcpy()


memory data transfer
Requires four parameters







HK-UIUC
Basically 8 Gb/s (each
way)
Block (1, 0)
Shared Memory
Host to Host
Host to Device
Device to Host
Device to Device
Things happen over a
PCIe 2.0 16X connection

Block (0, 0)
Pointer to source
Pointer to destination
Number of bytes copied
Type of transfer


(Device) Grid
Host
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Local
Memory
Local
Memory
Local
Memory
Local
Memory
Global
Memory
Constant
Memory
Texture
Memory
58
CUDA Host-Device Data Transfer (cont.)

Example:



Transfer a number of “size” bytes
M is in host memory and Md is in device memory
cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are
symbolic constants
cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice);
cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost);
59
CUDA GPU Programming
~ Resource Management Considerations ~
60
What Do I Mean By
“Resource Management”?

The GPU is a resourceful device

What do you have to do to make sure you capitalize on these resources?


In other words, how can you ensure that all the SPs are busy all the time?
To fully exploit the GPU’s potential it is important



How many threads you decide to use
What memory requirements are associated with a thread
How much shared memory gets allocated/used by one block of threads
61
Resource Management – The Key Actors:
Threads, Warps, Blocks
A collection of 32 Threads makes up a Warp


Warp is something virtual, it’s how the GPU groups the threads together for execution
A Block has at the most 512 threads, that is, 16 Warps



Threads are organized in a 3D fashion; each thread has an (Tx,Ty,Tz) unique thread ID
Threads in a block get to use together the shared memory
Each Block of threads is executed on a single SM


If you run an application with 100 blocks of threads and your GPU has 16 SMs (GTX
8800, for instance), chances are each SM will get to execute about 6 or 7 blocks
62
Resource Management – The Key Actors:
Threads, Warps, Blocks [Cntd.]
Host

A kernel is executed as a grid of blocks



Grid: up to 65535 X 65535 blocks
Each block has a unique (Bx, By) unique ID
Grid 1
Kernel
1
The threads that belong to the *same*
block can cooperate with each other by:

Synchronizing their execution




Device
For hazard-free shared memory
accesses
Efficiently sharing data through a low
latency shared memory
Shared memory is allocated per block
Threads from two different blocks
cannot cooperate!!!

This has important software design
implications
Block
(0, 0)
Block
(1, 0)
Block
(2, 0)
Block
(0, 1)
Block
(1, 1)
Block
(2, 1)
Grid 2
Kernel
2
Block (1, 1)
Thread Thread Thread Thread Thread
(0, 0)
(1, 0)
(2, 0)
(3, 0)
(4, 0)
Thread Thread Thread Thread Thread
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(4, 1)
Thread Thread Thread Thread Thread
(0, 2)
(1, 2)
(2, 2)
(3, 2)
(4, 2)
63
Execution Model, Key Observations [1 of 2]

Each block is executed on *one* Stream Multiprocessor (SM)

There is no time slicing when executing a block of threads

Each block is split into warps of threads executed one at a time by the eight
SPs of the SM (time slicing in warp execution is constantly done)
64
Execution Model, Key Observations [2 of 2]

A Stream Multiprocessor can execute multiple blocks concurrently

Shared memory and registers are partitioned among the threads of all
concurrent blocks


Decreasing shared memory usage (per block) and register usage (per thread)
increases number of blocks that can run concurrently (very desirable)
The shared memory “belongs” to the block, not to the threads (which
merely use it…)

The shared memory space resides in the on-chip shared memory and it
“spans” (or encompasses) a thread block
65
Some Hard Constraints

Max number of warps that one SM can service simultaneously:


[1 of 2]
32 (on the latest generation of GPUs)
Max number of blocks that one SM can process simultaneously:

8 (it’s been like this for a while)
66
Some Hard Constraints [2 of 2]

The number of registers available on each SM is limited:


16 Kb on latest NVIDIA hardware
The amount of shared memory available to each SM is limited

16 Kb today
67
The Concept of Occupancy

Ideally, you want to have 32 warps serviced at the same time by one SM

This keeps the SM busy and hides latencies associated with memory access

Examples:

Two blocks with 512 threads running together on one SM: 100% occupancy

Four blocks of 256 threads each running on one SM: 100% occupancy

16 blocks with 64 threads each – not good, can’t have more than 8 blocks running on a SM

Effectively this scenario gives you 50% occupancy
68
The Concept of Occupancy

[Cntd.]
What prevents you from getting high occupancy?

Many warps means many threads and possibly many blocks

Many blocks ) you can’t have too much shared mem allocated to each one of them


Total amount of shared memory in one SM: 16 Kb
Many threads ) you can’t have too many registers used by each thread

Size of the register file in one SM: 16 Kb
69
Examples, Occupancy of HW

Example 1: If each of your blocks gets assigned 20 Kb of shared
memory, the kernel will fail to launch

Not enough memory on the SM to run a block

Example 2: If your blocks each uses 5 Kb of shared mem, you can have
three blocks running on one SM (there will be some shared mem that will
go unused)

Example 3: Like Example 2 above, and you have 512 threads per block,
each thread uses 16 registers. Will one SM be able to handle 2 blocks?



Total number of registers ) 512 X 2 X 16 = 16,384 out of the 16,384 are used ) ok
Number of warps: 2 blocks X 512 threads = 1024 threads = 32 warps ) ok in CUDA 1.3
You actually have 100% occupancy, maxed out on registers, and lots of shared mem left
70
Resource Utilization

There is an “occupancy calculator” that can tell you what percentage
of the HW gets utilized by your kernel

Assumes the form of an Excel spreadsheet

Requires the following input




Threads per block
Registers per thread
Shared memory per block
Google “occupancy calculator cuda” to access it
71
72
CUDA GPU Code Development
73
Code Development Support

How do I compile?

How do I link?

How do I debug?

How do I profile?
74
The CUDA Way: Extended C

Declaration specifications:


__global__ void convolve (float *image)
threadIdx, blockIdx
region[threadIdx.x] = image[i];
__syncthreads()
...
Intrinsics

{
__shared__ float region[M];
...
Keywords


global, device, shared,
local, constant
__device__ float filter[N];
__syncthreads
image[j] = result;
}

Runtime API


HK-UIUC
For memory, symbol,
execution management
Kernel launch
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
75
Compiling CUDA

nvcc



C/C++ CUDA
Application
Compile driver
Invokes cudacc, gcc, cl, etc.
CPU Code
NVCC
PTX


Parallel Thread eXecution
Like assembly language
PTX Code
PTX to Target
ld.global.v4.f32
mad.f32
Compiler
{$f1,$f3,$f5,$f7}, [$r9+0];
$f1, $f5, $f3, $f1;
G80
…
GPU
Target code
Courtesy NVIDIA
76
More on the nvcc compiler
File suffix
How the nvcc compiler interprets the file
.cu
CUDA source file, containing host and device code
.cup
Preprocessed CUDA source file, containing host code
and device functions
.c
‘C’ source file
.cc, .cxx, .cpp
C++ source file
.gpu
GPU intermediate file (device code only)
.ptx
PTX intermediate assembly file (device code only)
.cubin
CUDA device only binary file
77
Compiling
CUDA
extended C
78
http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf
Gauging Memory Use on GPU

Compile with the “–keep” flag and investigate the .cubin file:
Use compile architecture {sm_10}
abiversion {1}
modname
{cubin}
code {
name = _Z21MatVecMulKernelShared6Matrix6VectorS0_
lmem = 0
smem = 1068
reg = 8
bar = 1
const {
segname = const
segnum = 1
offset = 0
bytes = 8
mem {
0x000000ff 0x0000042c
}
}
bincode {
0x10004209 0x0023c780 0xa000000d 0x04000780
0x1000c801 0x0423c780 0x301fce11 0xec300780
79
Debugging Using the Device Emulation Mode
An executable compiled in device emulation mode (nvcc -deviceemu)
runs entirely on the host using the CUDA runtime



No need of any device and CUDA driver

Each device thread is emulated with a host thread

In Developer Studio project select the “EmuDebug” or “EmuRelease” build configurations
When running in device emulation mode, one can:




Use host native debug support (breakpoints, variable QuickWatch and edit, etc.)
Access any device-specific data from host code and vice-versa
Call any host function from device code (e.g. printf) and vice-versa
Detect deadlock situations caused by improper usage of __syncthreads
80
Device Emulation Mode Pitfalls

HK-UIUC
[1/3]
Emulated device threads execute sequentially, so simultaneous
accesses of the same memory location by multiple threads could
produce different results
81
Device Emulation Mode Pitfalls

HK-UIUC
[2/3]
Dereferencing device pointers on the host or host pointers on the
device can produce correct results in device emulation mode, but will
generate an error in device execution mode
82
Device Emulation Mode Pitfalls

Results of floating-point computations will slightly differ because of:


Different compiler outputs, instruction sets
Use of extended precision for intermediate results

HK-UIUC
[3/3]
There are various options to force strict single precision on the host
83
Concluding Remarks
84
GPU Computing in Engineering

Who stands to benefit in the Engineering community?








FEA
Monte Carlo
Molecular Dynamics
Granular Dynamics
Image processing
Agent-based modeling
…
Generally, any application that fits the SIMD paradigm
85
Credit: NVIDIA Corporation
146X
36X
18X
50X
100X
Medical
Imaging
U of Utah
Molecular
Dynamics
U of Illinois,
Urbana
Video
Transcoding
Elemental Tech
MATLAB
Computing
AccelerEyes
Astrophysics
RIKEN
50x – 150x
149X
47X
20X
130X
30X
Financial
simulation
Oxford
Linear Algebra
Universidad
Jaime
3D
Ultrasound
Techniscan
Quantum
Chemistry
U of Illinois,
Urbana
Gene
Sequencing
U of Maryland
A Word on HPC beyond GPU

We are witnessing a very momentous transformation

Shift from sequential to parallel computing

The support for parallel computing is very homogeneous in structure

GPU not alone in this race of capitalizing on parallel computing for scientific apps
87
Parallel Computing, SW Side…

Other options for leveraging parallel computing in scientific applications

Threads (Posix, Windows)

OpenMP

MPI standard (see MPICH implementation)

Intel’s Thread Building Block (TBB) library

OpenCL standard for heterogeneous computing

AMD and NVIDIA provided implementations, Apple to follow up shortly
88
Parallel Computing, HW Side…

Hardware options for HPC

GPU (NVIDIA)

The “fusion” idea (Intel’s Larrabee, AMD’s Fusion)

Cell Blades

Cluster computing (IBM’s BlueGene/P, Q,…)

Cloud Computing
89
Sources of Information, GPU Computing

Read, in this order:





Lots of very good examples come with the CUDA SDK distribution




NVIDIA CUDA Development Tools 2.3: Getting Started (short doc, July 09)
NVIDIA CUDA Programming Guide 2.3 (July 09)
NVIDIA CUDA C Programming Best Practices Guide 2.3 (short doc, July 09)
NVIDIA CUDA Reference Manual 2.3 (comprehensive, July 09)
More than 25 applications ready to compile/run
Makefiles available, ready for use
Lots of good code available for reuse + templates for applications
Online material



NVIDIA website: code available for many application fields
Libs: thrust (http://code.google.com/p/thrust/), cudpp (http://gpgpu.org/developer/cudpp)
Course on GPU programming: http://sbel.wisc.edu/Courses/ME964/2008/index.htm
Conclusions

In the middle of a shift to parallel computing

Hardware changes at higher pace

CUDA – a bright spot in a software landscape otherwise pretty bleak

GPU computing not the silver bullet

GPU for right application can deliver amazing benefits at small time and
financial investments

In general, investing in parallel programming skills bound to pay off
91
Thank You.
92
Review, Execution Model

Move data to device, launch kernel, transfer relevant data back to host

Kernel is a C function executed on the device

Each thread executes the kernel, this happens in parallel
93
Review, Key Concepts





Kernel = GPU program executed by each parallel thread in a block
Block = a 3D collection of threads that can cooperate in using the
block’s shared memory and can synchronize during execution
Grid = 2D array of blocks of threads that execute a kernel
Device ´ GPU = set of stream multiprocessors (30 SMs)
Stream Multiprocessor = 8 scalar processors + shared mem + registers
Memory
Location
Cached
Access
Who
Local
Off-chip
No
Read/write
One thread
Shared
On-chip
N/A - resident
Read/write
All threads in a block
Global
Off-chip
No
Read/write
All threads + host
Constant
Off-chip
Yes
Read
All threads + host
Texture
Off-chip
Yes
Read
All threads + host
Off-chip means on-device; i.e., slow access time.
94
Performance
Vision of the Future
Growing gap!
Presentation Paul Petersen,
Sr. Principal Engineer, Intel
“SD”: Software Development
Multi-core Era
Frequency Era
Time
2007


“Parallelism for Everyone”
Parallelism changes the game
 A large percentage of people who provide applications are going
to have to care about parallelism in order to match the
capabilities of their competitors.
competitive pressures = demand for parallel applications
95
Download