slides - CUDA Research Center at the University of Georgia

advertisement
CUDA and GPU Training: Sessions 1 & 2
April 16 & 23, 2012
University of Georgia CUDA Teaching Center
UGA CUDA Teaching Center
 UGA, through the efforts of Professor Thiab Taha, has been
selected by NVIDIA as a 2011-2012 CUDA Teaching Center
 Presenters: Jennifer Rouan, Shan-ho Tsai, John Kerry
 Visit us at http://cuda.uga.edu
2
Workshop Outline
 Introduction to GPUs and CUDA
 CUDA Programming Concepts
 Georgia Advanced Computing Resource Center (GACRC)
 “My First CUDA Program” – hands-on programming project
3
A Little Bit of GPU Background
 Graphics processing unit (GPU) evolution driven by market demand for
high-quality, real-time 3D graphics in computer applications, especially
video games
 Microsoft’s DirectX 10 API (2006), introduced a geometry shading stage,
which demanded an increase in operation rate, particularly floatingpoint operations
 Nvidia’s GeForce 8800 GPU (2006), introduced unified processors, which
mapped three separate graphics stages (vertex shading, geometry
processing, and pixel processing) to a single array of processors
 Scientists recognize the raw performance potential of this hardware and
develop General Purpose GPU computing (GPGPU)
4
Graphics Pipeline
 Logical pipeline:
vertex
shading
geometry
processing
pixel
processing
 Physical loop:
Unified Array of
Processors
 Load balancing – make maximum use of the hardware
 Can dedicate all resources to optimizing one piece of hardware
5
Compute Unified Device Architecture
(CUDA)
 Nvidia creates CUDA to facilitate the
development of parallel programs on GPUs (2007)
 The CUDA language is ANSI C extended with
very few keywords for labeling data-parallel
functions (kernels) and their associated data
 Because Nvidia technology benefits from
massive economies of scale in the gaming
market, CUDA-enabled cards are very
inexpensive for the performance they provide
6
Hardware Summary: CPUs and GPUs
 Central Processing Units (CPUs) are optimized to complete a large variety
of sequential tasks very quickly
 Graphics Processing Units (GPUs) are optimized to do one thing: to
perform floating point operations on a large amount of data at one time
 Compared to CPUs, GPUs dedicate very little chip area to memory in
exchange for more computing cores on the chip
memory
CPU
7
GPU
Why Program Massively Parallel
Processors?
 Potential to mass-market applications that are currently
considered supercomputing applications (or
“superapplications” [Kirk 2010]), such as biology research,
image processing, and 3D imaging and visualization
 Many of today’s medical imaging applications are still
running on microprocessor clusters and special-purpose
hardware, and could achieve size and cost improvement on a
GPU
 Market demand for even better user interfaces and still more
realistic gaming is not going to go away
8
Speed Tests on UGA Equipment
 Equipment:
 Barracuda: Nvidia GeForce GTX 480 GPU (480 cores)
 Z-Cluster: Nvidia Tesla C2075 GPU (448 cores)
 R-Cluster: Nvidia Tesla S1070 GPU (240 cores)
 CPU only (serial) : Intel Dual Processor Quad-core Xeon CPU
 Testing:
 Multiply two square matrices of single-precision floating point numbers,
ranging from 16 x 16 to 8192 x 8192
 Time to move data from host to device and back is included in GPU timing
 Conducted five rounds of tests and averaged the results
9
Small Problem Size
 Depending on the hardware configuration, the overhead to copy the
data may overwhelm the performance improvement of the GPU
2500
Time in Milliseconds (ms)
2000
1500
Barracuda
Z-Cluster
1000
R-Cluster
CPU Only
500
0
16
32
48
64
80
96
112
128
144
160
Matrix Size (N
10 x N)
176
192
208
224
240
256
Medium Problem Size
 GPU advantage becomes apparent as the matrix size
increases
4000
3500
Time in Seconds (s)
3000
2500
Barracuda
2000
Z-Cluster
R-Cluster
1500
CPU Only
1000
500
0
256
512
768
1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096
Matrix Size (N x N)
11
Large Problem Size
 The GPUs can still finish a job in a matter of seconds that
takes several hours on the CPU
9
8
6
5
Barracuda
4
Z-Cluster
3
R-Cluster
CPU Only
2
1
Matrix Size (N x N)
12
8192
7680
7168
6656
6144
5632
5120
4608
0
4096
Time in Hours (h)
7
16
32
48
64
80
96
112
128
144
160
176
192
208
224
240
256
512
768
1024
1536
1792
2048
2560
2816
3072
3584
4608
5120
5632
6144
6656
7168
7680
8192
Speed-up = GPU time / CPU time
Speedup Summary
1800
1600
1400
1200
1000
Barracuda
800
Z-Cluster
R-Cluster
600
400
200
0
13
CUDA Computing System
 A CUDA computing system consists of a host (CPU) and one
or more devices (GPUs)
 The portions of the program that can be evaluated in parallel
are executed on the device. The host handles the serial
portions and the transfer of execution and data to and from
the device
14
CUDA Program Source Code
 A CUDA program is a unified source code encompassing both
host and device code. Convention: program_name.cu
 NVIDIA’s compiler (nvcc) separates the host and device code
at compilation
 The host code is compiled by the host’s standard C compilers.
The device code is further compiled by nvcc for execution on
the GPU
15
CUDA Program Execution
 Execution of a CUDA program begins on the host CPU
 When a kernel function (or simply “kernel”) is launched,
execution is transferred to the device and a massive “grid” of
lightweight threads is spawned
 When all threads of a kernel have finished executing, the grid
terminates and control of the program returns to the host
until another kernel is launched
16
CUDA Program Execution
TIME
17
CUDA Program Structure example
int main(void) {
float *a_h, *a_d;
// pointers to host and device arrays
const int N = 10;
// number of elements in array
size_t size = N * sizeof(float);
// size of array in memory
// allocate memory on host and device for the array
// initialize array on host (a_h)
// copy array a_h to allocated device memory location (a_d)
// kernel invocation code – to have the device perform
// the parallel operations
// copy a_d from the device memory back to a_h
// free allocated memory on device and host
}
18
Data Movement and Memory
Management
 In CUDA, host and device have separate memory spaces
 To execute a kernel, the program must allocate memory on
the device and transfer data from the host to the device
 After kernel execution, the program needs to transfer the
resultant data back to the host memory and free the device
memory
 C functions: malloc(), free()
CUDA functions: cudaMalloc(), cudaMemcpy(), and
cudaFree()
19
Data Movement example
int main(void)
{
float *a_h, *a_d; const int N = 10;
size_t size = N * sizeof(float);
// size of array in memory
a_h = (float *)malloc(size);
// allocate array on host
cudaMalloc((void **) &a_d, size);
// allocate array on device
for (i=0; i<N; i++) a_h[i] = (float)i;
// initialize array
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// kernel invocation code
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d); free(a_h);
// free allocated memory
}
20
Execution Parameters and Kernel Launch
 A kernel is invoked by the host program with execution
parameters surrounded by ‘<<<’ and ‘>>>’ as in:
function_name <<< grid_dim, block_dim >>> (arg1, arg2);
 At kernel launch, a “grid” is spawned on the device. A grid
consists of a one- or two-dimensional array of “blocks”. In
turn, a block consists of a one-, two-, or three-dimensional
array of “threads”
 Grid and block dimensions are passed to the kernel function
at invocation as execution parameters
21
Execution Parameters and Kernel Launch
 gridDim and blockDim are CUDA built-in variables of
type dim3, essentially a C struct with three unsigned integer
fields, x, y, and z
 Since a grid is generally two-dimensional, gridDim.z is
ignored but should be set to 1 for clarity
dim3 grid_d = (n_blocks, 1, 1);
// this is still
dim3 block_d = (block_size, 1, 1);
// host code
function_name <<< grid_d, block_d >>> (arg1, arg2);
 For one-dimensional grids and blocks, scalar values can be
used instead of dim3 type
22
Execution Parameters and Kernel Launch
dim3 grid_dim = (2, 2, 1)
dim3 block_dim = (4, 2, 2)
23
Limits on gridDim and blockDim
 The maximum size of a block (blockDim.x *
blockDim.y * blockDim.z) is 512 threads, regardless of
dimension. You cannot increase the number of allowed
threads by adding another dimension
 Since a block is limited to 512 threads, one block per grid will
usually not be sufficient
 The values of gridDim.x and gridDim.y can range from
1 to 65,535
24
Kernel Invocation example
int main(void)
{
float *a_h, *a_d; const int N = 10;
size_t size = N * sizeof(float);
// size of array in memory
a_h = (float *)malloc(size);
// allocate array on host
cudaMalloc((void **) &a_d, size);
// allocate array on device
for (i=0; i<N; i++) a_h[i] = (float)i;
// initialize array
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
int block_size = 4;
// set up execution parameters
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
cudaFree(a_d); free(a_h);
// free allocated memory
}
25
Kernel Functions
 A kernel function specifies the code to be executed by all threads in
parallel – an instance of single-program, multiple-data (SPMD)
parallel programming.
 A kernel function declaration is a C function extended with one of
three keywords: “__device__”, “__global__”, or “__host__”.
Executed on the:
Only callable from the:
__device__ float DeviceFunc()
device
device
__global__ void KernelFunc()
device
host
host
host
__host__ float HostFunc()
26
CUDA Thread Organization
 Since all threads of a grid execute the same code, they rely on
a two-level hierarchy of coordinates to distinguish themselves:
blockIdx and threadIdx
 The example code fragment:
ID = blockIdx.x * blockDim.x + threadIdx.x;
will yield a unique ID for every thread across all blocks of a grid
27
Kernel Function and Threading example
CUDA kernel function:
__global__ void square_array(float *a, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
Compare with serial C version:
void square_array(float *a, int N) {
int i;
for (i = 0; i < N; i++) a[i] = a[i] * a[i];
}
28
CUDA Device Memory Types
 Global Memory and Constant Memory can be accessed by
the host and device. Constant Memory serves read-only data
to the device at high bandwith. Global Memory is read-write
and has a longer latency
 Registers, Local Memory, and Shared Memory are accessable
only to the device. Registers and Local Memory are available
only to their own thread. Shared Memory is accessable to all
threads within the same block
29
CUDA Memory Model
Host
30
CUDA Variable Type Qualifiers
Variable Declaration
Memory
Scope
Lifetime
Non-Array Automatic Variables
Register
Thread
Kernel
Automatic Array Variables
Local
Thread
Kernel
__device__, __shared__, int var1
Shared
Block
Kernel
__device__, int var2
Global
Grid
Application
__device__, __constant__, int var3
Constant
Grid
Application
31
Sychronization
 Threads in the same block can synchronize using
__syncthreads()
 When used in an if-then-else construction, all threads must
branch on the same path or they will wait on each other
forever, i.e., the __syncthreads() in the if branch is
distinct from the __syncthreads() in the else branch
 Threads in different blocks cannot perform barrier
synchronization with each other. This is a major constraint, but
it comes as part of a a big scalability trade-off
32
Transparent Scalability
 CUDA’s synchronization constraint allows blocks to be executed in
any order relative to each other, providing transparent scalability
 The exact same code can be executed on devices with different
execution resources. The execution time is inversely proportional to
the available resources
33
CUDA Atomic Operations
 Race condition review: When two or more concurrently
running threads access a shared data item and the result
depends on the order of execution.
Deposit Operation:
load
add
store
balance
amount
balance
Desired Action:
Balance: 100
Deposit: 10
Deposit: 200
Balance: 310
 We use “Atomics” to solve this problem
34
Possible Problem:
Balance: 100
load
100
add
10
load
100
add
200
store 110
store 300
Balance: 300
CUDA Atomic Operations
 Race conditions are exceptionally problematic in massively
parallel programs when thousands of threads access data
simultaneously
 CUDA provides many atomic functions for integers including
atomicAdd(), atomicSub(), atomicExch(),
atomicMin(), and atomicMax()
 Atomic operations can create bottlenecks which collapse your
parallel program to a serial program and significantly degrade
performance
 Use sparingly and use wisely
35
Example of using a CUDA atomic
wisely to find a global maximum
Naïve approach:
__global__ void global_max(int* values, int* gl_max) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
int val = values[i];
atomicMax(gl_max,val);
}
Better idea:
__global__ void global_max(int* values, int* max,
int* regional_maxes, int num_regions) {
// int i and val as before
if(atomicMax(&reg_max[region],val) < val) {
atomicMax(max,val);
}
}
36
Example of using a CUDA atomic
wisely to find a global maximum
Naïve approach:
__global__ void global_max(int* values, int* gl_max) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
int val = values[i];
atomicMax(gl_max,val);
}
Better idea:
__global__ void global_max(int* values, int* max,
int* regional_maxes, int num_regions) {
// int i and val as before
if(atomicMax(&reg_max[region],val) < val) {
atomicMax(max,val);
}
}
37
Georgia Advanced Computing Resource
Center (GACRC)
 GACRC resources
 Requesting an account
 Setting up the user environment (i.e.: path variables, etc.)
 Compiling a CUDA program using nvcc
 Creating a submission shell script
 Submitting a job to the queue
38
My First CUDA Program
 squares.c and squares.cu are identical serial C programs
 Edit squares.cu with CUDA keywords to port to parallel
program (leave squares.c clean to refer back to if necessary)
 Compile with makefile
 Create submission shell script
 Submit to queue
39
More CUDA Training Resources

University of Georgia CUDA Teaching Center: http://cuda.uga.edu

Nvidia training and education site: http://developer.nvidia.com/cuda-education-training

Stanford University course on iTunes U: http://itunes.apple.com/us/itunesu/programming-massively-parallel/id384233322

University of Illinois: http://courses.engr.illinois.edu/ece498/al/Syllabus.html

University of California, Davis: https://smartsite.ucdavis.edu/xsl-portal/site/1707812c4009-4d91-a80e-271bde5c8fac/page/de40f2cc-40d9-4b0f-a2d3-e8518bd0266a

University of Wisconsin: http://sbel.wisc.edu/Courses/ME964/2011/me964Spring2011.pdf

University of North Carolina at Charlotte:
http://coitweb.uncc.edu/~abw/ITCS6010S11/index.html
40
References

Kirk, D., & Hwu, W. (2010). Programming Massively Parallel Processors: A Hands-on
Approach, 1 – 75

Tarjan, D. (2010). Introduction to CUDA, Stanford University on iTunes U

Atallah, M. J. (Ed.), (1998). Algorithms and theory of computation handbook. Boca Raton,
FL: CRC Press

von Neumann, J. (1945). First draft of a report on the EDVAC. Contract No. W-670-ORD4926, U.S. Army Ordnance Department and University of Pennsylvania

Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue, 3(7),
54 – 62

Stratton, J. A., Stone, S. S., & Hwu, W. W. (2008). MCUDA: And efficient implementation
of CUDA kernels for multi-core CPUs. Canada: Edmonton

Vandenbout, Dave (2008). My First Cuda Program,
http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/
41
Download