Day 1 Session 2

advertisement
Training Program on
GPU Programming
with CUDA
31st July, 7th Aug, 14th Aug 2011
CUDA Teaching Center @ UoM
Training Program on
GPU Programming with CUDA
Day 1, Session 2
CUDA Programming Model
CUDA Threads
Sanath Jayasena
CUDA Teaching Center @ UoM
Outline for Day 1 Session 2
•
•
•
•
CUDA Programming Model, CUDA Threads
Data Parallelism
CUDA Program Structure
Memory Model & Data Transfer (Brief)
Kernel Functions & Threading
(Discussion with Example: Matrix Multiplication)
July-Aug 2011
CUDA Training Program
3
Data Parallelism
• Data Parallelism
– A problem/program property
– Many arithmetic operations can be safely
performed on the data structures simultaneously
– Example: matrix multiplication (next slide)
• CUDA devices can exploit data parallelism to
accelerate execution of applications
July-Aug 2011
CUDA Training Program
4
Example: Matrix Multiplication
P=M·N
N
width
width
M
P
width
July-Aug 2011
•Each element in P
is computed as dot
product between a
row of M and a
column of N
•All elements in P
can be computed
independently and
simultaneously
width
CUDA Training Program
5
CUDA Program Structure
• A CUDA program consists of one or more
phases executed on either the host (CPU) or a
device (GPU), supplied as a single source code
• Little or no data parallelism  host code
– ANSI C, compiled with standard compiler
• Significant data parallelism  device code
– Extended ANSI C to specify kernels, data structs
• NVIDIA C Complier separates the two and …
July-Aug 2011
CUDA Training Program
6
Execution of a CUDA Program
July-Aug 2011
CUDA Training Program
7
Execution of a CUDA Program
• Execution starts with host (CPU)
• When a kernel is invoked, execution moves to
the device (GPU)
– A large number of threads generated
– Grid : collection of all threads generated by kernel
– (Previous slide shows two grids of threads)
• Once all threads in a grid complete execution,
the grid terminates and execution continues
on the host
July-Aug 2011
CUDA Training Program
8
Example: Matrix Multiplication
int main (void)
{
1. // Allocate and initialize matrices M, N, P
// I/O to read the input matrices M and N
….
2. // M * N on the device
MatrixMulOnDevice (M, N, P, width);
3.
// I/O to write the output matrix P
// Free matrices M, N, P
…
return 0;
}
July-Aug 2011
CUDA Training Program
A simple
CUDA host
code
skeleton
for matrix
multiplication
9
CUDA Device Memory Model
• Host, devices have separate memory spaces
– E.g., hardware cards with their own DRAM
• To execute a kernel on a device
– Need to allocate memory on device
– Transfer data: host memory  device memory
• After device execution
– Transfer results: device memory  host memory
– Free device memory no longer needed
July-Aug 2011
CUDA Training Program
10
CUDA Device Memory Model
July-Aug 2011
CUDA Training Program
11
CUDA API : Memory Mgt.
July-Aug 2011
CUDA Training Program
12
CUDA API : Memory Mgt.
• Example
float *Md;
int size = Width * Width * sizeof(float);
cudaMalloc((void**)&Md, size);
…
cudaFree(Md);
July-Aug 2011
CUDA Training Program
13
CUDA API : Data Transfer
July-Aug 2011
CUDA Training Program
14
Example: Matrix Multiplication
July-Aug 2011
CUDA Training Program
15
Kernel Functions & Threading
• A kernel function specifies the code to be
executed by all threads of a parallel phase
– All threads of a parallel phase execute the same
code  single-program multiple-data (SPMD), a
popular programming style for parallel computing
• Need a mechanism to
– Allow threads to distinguish themselves
– Direct themselves to specific parts of data they
are supposed to work on
July-Aug 2011
CUDA Training Program
16
Kernel Functions & Threading
• Keywords “threadIdx.x” and “threadIdx.y”
– Thread indices of a thread
– Allow a thread to identify itself at runtime (by
accessing hardware registers associated with it)
• Can refer a thread as Thread threadIdx.x,threadIdx.y
• Thread indices reflect a multi-dimensional
organization for threads
July-Aug 2011
CUDA Training Program
17
Example: Matrix Multiplication Kernel
See next slide
for more details
on accessing
relevant data
July-Aug 2011
CUDA Training Program
18
Thread Indices & Accessing
Data Relevant to a Thread
tx
How matrix Pd would be laid out
in memory (as it is a 1-D array)
•Each thread uses tx, ty to identify
the relevant row of Md, column of Nd
and the element of Pd in the for loop
width
ty
tx
Md
July-Aug 2011
Pd
Nd
tx
width
ty
row 1
ty * width
x
y
row 0
Pd
width
•E.g., Thread2,3 will perform dot
product between row 2 of Md and
column 3 of Nd and write the result
into element (2,3) of Pd
CUDA Training Program
19
Threading & Grids
• When a kernel is invoked/launched, it is
executed as a grid of parallel threads
• A CUDA thread grid can have millions of
lightweight GPU threads per kernel invocation
– To fully utilize hardware  enough threads
required large data parallelism required
• Threads in a grid has a two-level hierarchy
– A grid consists of 1 or more thread blocks
– All blocks in a grid have same # of threads
July-Aug 2011
CUDA Training Program
20
CUDA Thread Organization
July-Aug 2011
CUDA Training Program
21
Threading with Grids & Blocks
• Each thread block has a unique 2-D coordinate
given by CUDA keywords “blockIdx.x” and
“blockIdx.y”
– All blocks must have the same structure, thread #
• Each block has a 3-D array of threads up to a
total of 1024 threads max
– Coordinates of threads in a block are defined by
indices: threadIdx.x, threadIdx.y, threadIdx.z
– (Not all apps will use all 3 dimensions)
July-Aug 2011
CUDA Training Program
22
Our Example: Matrix Multiplication
• The kernel is shown 5 slides before (slide 18)
– This can only use one thread block
– The block is organized as a 2D-array
• The code can compute a product matrix Pd of
only up to 1024 elements
– As a block can have a max of 1024 threads
– Each thread computes one element in Pd
– Is this sufficient / acceptable?
July-Aug 2011
CUDA Training Program
23
Our Example: Matrix Multiplication
• When host code invokes the kernel, the grid
and block dimensions are set by passing them
as parameters
• Example
// Setup the execution configuration
dim3 dimBlock(16, 16, 1); //Width=16, as example
dim3 dimGrid(1, 1, 1);
//last 1 ignored
// Launch the device computation threads!
MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16);
July-Aug 2011
CUDA Training Program
24
Here is an Exercise…
• Implement Matrix Multiplication
– Execute it with different matrix dimensions using
(a) CPU only, (b) GPUs and (c) GPUs with different
grid/block organizations
• Fill a table like the following
Dimensions (M, N)
CPU time
(s)
GPU time
(s)
Speedup
[400,800] , [400, 400]
[800,1600] , [800, 800]
….
….
[2400,4800] , [2400, 4800]
July-Aug 2011
CUDA Training Program
25
Conclusion
• We discussed CUDA Programming Model and
CUDA Thread Basics
– Data Parallelism
– CUDA Program Structure
– Memory Model & Data Transfer (briefly)
– Kernel Functions & Threading
– (Discussion with Example: Matrix Multiplication)
July-Aug 2011
CUDA Training Program
26
References for this Session
• Chapter 2 of: D. Kirk and W. Hwu,
Programming Massively Parallel Processors,
Morgan Kaufmann, 2010
• Chapters 4-5 of: E. Kandrot and J. Sanders,
CUDA by Example, Addison-Wesley, 2010
• Chapter 2 of: NVIDIA CUDA C Programming
Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011
July-Aug 2011
CUDA Training Program
27
Download