Basic Concepts - Learning CUDA to Solve Scientific

advertisement
Table of Contents
Basic Concepts
1
Objectives
2
Introduction
3
CUDA Program Structure
4
Programming Model
Kernel
Thread Hierarchy
Memory Hierarchy
Function Qualifiers
Variable Qualifiers
Launching Kernels
Synchronization Function
Final Recap
Learning CUDA to Solve Scientific Problems.
Miguel Cárdenas Montes
Centro de Investigaciones Energéticas Medioambientales y Tecnológicas,
Madrid, Spain
miguel.cardenas@ciemat.es
2010
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
1 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
2 / 58
2010
4 / 58
Objectives
To understand the main differences between the CUDA Programming
Model and the normal C.
Introduction
To recognize the main features of the CUDA Programming Model.
Technical Issues
Kernel.
Thread Hierarchy.
Memory Hierarchy.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
3 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Introduction II
Introduction I
During the last years, the programmable Graphic Processing Unit or
GPU has evolved into highly parallel, multithreaded, manycore
processor with tremendous computational power and very high
memory bandwidth.
These GPU can be used in order to tackle scientific problems
requiring high computational effort.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Specifically, GPU is well-suited to address problems that can be
expressed as data-parallel computations (the same
program/algorithm/calculation is executed on many data elements in
parallel).
Because the same program is executed for each data element, there is
a lower requirement for sophisticated flow control; and because it is
executed on many data elements and it has high arithmetic intensity.
Data-parallel processing maps data elements to parallel processing
threads. Many applications that process large data sets can use a
data-parallel programming model to speed up the computations.
2010
5 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
6 / 58
Introduction IV
In November 2006, NVIDIA introduced CUDA, a general purpose
parallel computing architecture – with a new parallel programming
model and instruction set architecture – that leverages the parallel
compute engine in NVIDIA GPUs to solve many complex
computational problems in a more efficient way than on a CPU.
Introduction III
On the contrary, GPU architecture is not suitable for any kind of
heterogeneous calculation.
CUDA comes with a software environment that allows to the
developers to use C as a high-level programming language.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
7 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
8 / 58
CUDA Program Structure
CUDA Program Structure I
A CUDA program consists of one or more phases that are executed on
either the host (CPU) or a device such as a GPU.
CUDA Program Structure
The phases that exhibit rich amount of data parallelism are
implemented in the device code.
The program supplies a single source code encompassing both host
and device code.
The device code is written using ANSI C extended with keywords for
labeling data-parallel functions, called kernels, and their associated
data structures.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
9 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
CUDA Program Structure
CUDA Program Structure
CUDA Program Structure II
CUDA Program Structure III
The kernel functions generate a large number of threads to exploit
data parallelism.
The execution starts with host execution.
When a kernel function is invoked, the execution is moved to a device,
where a large number of threads are generated to take advantage of
abundant data parallelism.
All the threads that are generated by a kernel during an invocation are
collectively called a grid.
When all threads of a kernel complete their execution, the
corresponding grid terminates, the execution continues on the host
until another kernel is invoked.
T1. Basic Concepts.
2010
10 / 58
What runs on a CUDA device?
The CUDA threads are much lighter than the CPU threads.
CUDA program structure (schema):
M. Cárdenas (CIEMAT)
2010
11 / 58
The device is suited for computations that can be run in parallel.
That is, data parallelism is optimally handled on the device. This
typically involves arithmetic on large data sets (such as matrices),
where the same operation can be performed across thousands or
millions of elements at the same time.
There should be some coherence in memory access by a kernel.
Certain memory access patterns enable the hardware to coalesce
groups of data items to be written and read in one operation. Data
can not be laid out so as to enable coalescing will not enjoy much of
a performance lift when used in computations on CUDA.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
12 / 58
CUDA Program Structure
CUDA Program Structure IV
What runs on a CUDA device?
Traffic along the PCI bus should be minimized. To use CUDA, data
values must be transferred from the host to the device. These
transfers data are costly in terms of performance and so they should
be minimized.
Programming Model
The complexity of operations should justify the cost of moving data to
the device. Code that transfers data for brief use by a small number of
threads will see little or no performance lift.
Data should be kept on the device as long as possible. Because
transfers should be minimized, programs that run multiple kernels on
the same data should favor leaving the data on the device between
kernel calls, rather than transferring intermediate results to the host
and then sending them back to the device for subsequent calculations.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
13 / 58
Programming Model
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
14 / 58
Programming Model
Programming Model II
Data-parallel, compute intensive functions should be off-loaded to the
device.
Programming Model I
Now the challenge is to develop applications that transparently scales
its parallelism to leverage the increasing number of processor cores.
A function compiled for the device is called a kernel. The kernel is
executed on the device as many different threads.
At its core are three key abstractions: a hierarchy of thread groups,
shared memories, and barrier synchronization.
Functions that are executed many times, but independently on
different data, are prime candidates to become kernels, i.e. body of
for-loops.
Both host (CPU) and device (GPU) manage their own memory, host
memory and device memory. Data can be copied between them.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
15 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
16 / 58
Kernel I
Kernel II. Example
Kernels I
C for CUDA extends C by allowing the programmer to define C
functions, called kernels, that, when called, are executed N times in
parallel by N different CUDA threads, as opposed to only once like
regular C functions.
A kernel is defined using a declaration specifier and the number of
CUDA threads for each call is specified using a new syntax.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
17 / 58
Kernel III
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
}
int main()
{
// Kernel invocation
VecAdd<<<1, N>>>(A, B, C);
}
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
18 / 58
Thread Hierarchy I
threadID
threadID
Each of the threads that
execute a kernel is given a
unique thread ID that is
accessible within the kernel
through the built-in
threadIdx variable.
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
int i = threadIdx.x;
C[i] = A[i] + B[i];
}
int main()
{
// Kernel invocation
VecAdd<<<1, N>>>(A, B, C);
}
Example: code adds two vectors A and B of size N and
This provides a natural way to
invoke computation across the
elements in a domain such as a
vector, matrix, or field.
stores the result into vector C.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
For convenience, threadIdx is a
3-component vector, so that
threads can be identified using a
one-dimensional,
two-dimensional, or
three-dimensional thread index,
forming a one-dimensional,
two-dimensional, or
three-dimensional thread block.
2010
19 / 58
M. Cárdenas (CIEMAT)
// Kernel definition
__global__ void MatAdd(float A[N][N],
float B[N][N],
float C[N][N])
{
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(N, N);
MatAdd<<<1, dimBlock>>>(A, B, C);
}
Example: code adds two matrices A and B of size NxN
and stores the result into matrix C.
T1. Basic Concepts.
2010
20 / 58
Thread Hierarchy II
Thread Hierarchy III
The index of a thread and its thread ID relate to each other in a
straightforward way:
For a one-dimensional block, they are the same;
for a two-dimensional block of size (Dx, Dy ), the thread ID of a thread
of index (x, y ) is (x + yDx);
for a three-dimensional block of size (Dx, Dy , Dz), the thread ID of a
thread of index (x, y , z) is (x + yDx + zDxDy ).
On current GPUs, a thread block may contain up to 512 threads.
However, a kernel can be executed by multiple equally-shaped thread
blocks, so that the total number of threads is equal to the number of
threads per block times the number of blocks.
These multiple blocks are organized into a one-dimensional or
two-dimensional grid of thread blocks.
threadID
These multiple blocks
are organized into a
one-dimensional or
two-dimensional grid of
thread blocks.
The dimension of the
grid is specified by the
first parameter of the
<<< ... >>> syntax.
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(16, 16);
dim3 dimGrid((N + dimBlock.x - 1) / dimBlock.x,
(N + dimBlock.y - 1) / dimBlock.y);
MatAdd<<<dimGrid, dimBlock>>>(A, B, C);
}
Example: code adds two matrices A and B of size NxN and stores the
result into matrix C.
Each block has an unique block ID.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
21 / 58
Thread Hierarchy IV
threadID
Each block within the
grid can be identified by
a one-dimensional or
two-dimensional index.
M. Cárdenas (CIEMAT)
2010
22 / 58
Recap I
Multiple levels of parallelism.
Thread Block
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
// Kernel invocation
dim3 dimBlock(16, 16);
dim3 dimGrid((N + dimBlock.x - 1) / dimBlock.x,
(N + dimBlock.y - 1) / dimBlock.y);
MatAdd<<<dimGrid, dimBlock>>>(A, B, C);
}
Up to 512 threads per block.
Communicate via shared memory.
Threads guaranteed to be resident.
threadIdx, blockIdx.
Grid of thread blocks.
kernel<<<N,T>>>(a, b, c)
Communicate via global memory.
blockId and threadId provide a means
to distinguish the threads among
themselves when are executing the
same kernel.
Example: code adds two matrices A and B of size NxN and stores the
result into matrix C.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
T1. Basic Concepts.
2010
23 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
24 / 58
Recap II
Example I
The computational grid consist of a
grid of blocks.
Each thread executes the kernel.
The kernel invocation specifies the grid
and block dimensions (mandatory); and
the number of threads (optional).
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx] ;
}
// Number of elements in arrays
const int N = 1600;
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
The grid layouts can be 1, 2, or
3-dimensional.
Each block has an unique block ID.
Each thread has an unique thread ID
(within the block).
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
25 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
26 / 58
2010
28 / 58
Example II
CUDA Code
C Code
void add_matrix
( float* a, float* b, float* c, int N ) {
int index;
for ( int i = 0; i < N; ++i )
for ( int j = 0; j < N; ++j ) {
index = i + j*N;
c[index] = a[index] + b[index];
}
}
int main() {
add_matrix( a, b, c, N );
}
M. Cárdenas (CIEMAT)
__global__ add_matrix
( float* a, float* b, float* c, int N ) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int index = i + j*N;
if ( i < N && j < N )
c[index] = a[index] + b[index];
}
Memory Hierarchy
int main() {
dim3 dimBlock( blocksize, blocksize );
dim3 dimGrid( N/dimBlock.x, N/dimBlock.y );
add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N );
}
T1. Basic Concepts.
2010
27 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Memory Hierarchy
Memory Hierarchy
cudaMalloc
cudaMalloc(void ** pointer, size\_t nbytes)
CPU and GPU have separate memory spaces
Host (CPU) code manages device (GPU) memory:
Called from the host code to allocated a piece of global memory for
an object.
cudaMemcpy
Allocate / free
Copy data to and from device
Applies to global device memory (DRAM)
cudaMemcpy(destination, source, size, movement direction)
Copy information from one location to another. It can not be used to
copy between different GPUs in multi-GPU systems.
cudaFree
cudaFree(void* pointer)
Release memory.
int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
29 / 58
Copy Data
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
30 / 58
Data Movement Example I
Code example of data allocation, movement between host and device,
copy and release.
The commands for move data between host and device are:
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyHostToHost
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
31 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
32 / 58
Data Movement Example II
Data Movement Example III
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
}
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
33 / 58
Data Movement Example IV
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
34 / 58
2010
36 / 58
Data Movement Example V
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
}
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
35 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Data Movement Example VI
Data Movement Example VII
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}
}
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
37 / 58
Data Movement Example VIII
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
38 / 58
2010
40 / 58
Data Movement Example IX
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data
int N = 14, nBytes, i ;
nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
for (i=0, i<N; i++) a_h[i] = 100.f + i;
cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
return 0;
}
}
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
39 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Data Handle. Recap. I
Data Handle. Recap. II
In CUDA, host and devices have separate memory spaces.
GPU card has their our Dynamic Random Access Memory (DRAM).
In order to execute a kernel on a device, the programmer needs to
allocate memory on the device and transfer the pertinent data from
the host memory to the allocated device memory.
After device execution, the programmer needs to transfer result data
from device back to the host and free up the device memory that is
no longer needed.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
41 / 58
Function cudaMalloc() can be called from the host code to allocate a
piece of Global Memory for an object.
The first parameter for the cudaMalloc() function is the address of a
pointer that needs to point to the allocated object after a piece of
Global Memory is allocated to it.
The second parameter gives the size of the object to be allocated.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
42 / 58
2010
44 / 58
Data Handle. Recap. III
Once a program has allocated device memory for the data object, it
can request that data can be transfered from the host to the device
memory.
The cudaMemcpy() function requires four parameters:
Function Qualifiers
The first parameter points to the destination location for the copy
operation.
The second parameter is a pointer to the source data object to be
copied.
The third parameter specifies the number of bytes to be copied.
The fourth parameter indicates the types of memory involved in the
copy.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
43 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Function Qualifiers
Kernels designated by function qualifier:
__global__
Function called from host and executed on device.
Must return void.
Variable Qualifiers
Other CUDA function qualifiers:
__device__
Function called from device and run on device.
Cannot be called from host code.
__host__
Function called from host and executed on host (default).
__host__
and
__device__
qualifiers can be combined to generate both CPU and GPU code.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
45 / 58
Variable Qualifiers I
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
46 / 58
Variable Qualifiers II
Variable qualifier:
__device__
Stored in global memory (large, high latency, no cache).
Allocated with cudaMalloc:
__device__
qualifier implied.
Accessible by all threads.
Lifetime: application.
__constant__
Read-only access by the device.
Provides faster and more parallel data access paths for CUDA kernel
execution than the global memory.
Lifetime: application.
Variable declaration
Automatic
Automatic array variables
device
shared int SharedVar;
device int GlobalVar;
device
constant int ConstVar;
Memory
register
global
shared
global
constant
Scope
thread
thread
block
grid
grid
Lifetime
kernel
kernel
kernel
application
application
__shared__
Stored in on-chip shared memory (very low latency).
Specified by execution configuration or at compile time.
Accessible by all threads in the same thread block.
Lifetime: thread block.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
47 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
48 / 58
Launching Kernels
When a kernel is invoked, it is executed as grid of parallel threads.
Modified C function call syntax:
kernel<<<dim3 dG, dim3 dB>>>(...)
Execution Configuration:
Launching Kernels
<<< >>>
dG - dimension and size of grid in blocks
Two-dimensional: x and y
Blocks launched in the grid: dG.x * dG.y
Hardware-imposed limit 65,535 blocks per dimension.
dB - dimension and size of blocks in threads:
Three-dimensional: x, y, and z
Threads per block: dB.x * dB.y * dB.z
Examples: (512,1,1), (8,16,2) or (16,16,2)
Not allowed: (32,32,1)
Hardware-imposed limit 65,535 blocks per dimension.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
49 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
50 / 58
Synchronization Function I
Synchronization Function
Device Runtime Component: Synchronization Function
Kernel launches is always asynchronous.
cudaMemcpy is synchronous by default, although it exists an
asynchronous version of the instruction.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
51 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
52 / 58
Synchronization Function II
Synchronization Function III
Synchronization Functions
void syncthreads();
Synchronization Functions
Once all threads have reached this point, execution resumes normally.
cudaThreadSynchronize ( void ) ;
Blocks until the device has completed all preceding requested tasks.
Synchronizes all threads in a block. Threads in different blocks
cannot synchronize with each other.
The only safe way for threads in different blocks to synchronize with
each other is to terminate the kernel and start a new kernel for the
activities after the synchronization point.
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
53 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
54 / 58
2010
56 / 58
A typical CUDA program
Final Recap. A typical CUDA program
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
55 / 58
//allocate memory space in global device memory for input data
cudaMalloc(...);
//copy input data from host to the allocated device space
cudaMemcpy(...);
//allocate memory space in global device memory for the output
cudaMalloc(...);
//define block and grid size for the kernel;
dim3 grid (x,y);
dim3 block (x,y,z);
// launch kernel
CUDA_kernel<<<grid,block>>>(...);
//copy output data from device memory to the host
cudaMemcpy(...);
//free all device allocated memory (inputs and outputs)
cudaFree(...);
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
A typical CUDA kernel
Thanks
void CUDA_kernel (...){
//declare a shared memory array (optional)
__shared__ array_s[...];
//figure out index into different arrays in terms of blockIdx, threadIdx, and block_size
int index = ...;
//bring in data from global memory (into registers, or shared memory)
...
//Do the computation
...
//Copy data back to global memory (from registers or global memory)
...
}
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
Thanks
Questions?
More questions?
2010
57 / 58
M. Cárdenas (CIEMAT)
T1. Basic Concepts.
2010
58 / 58
Download