Table of Contents Basic Concepts 1 Objectives 2 Introduction 3 CUDA Program Structure 4 Programming Model Kernel Thread Hierarchy Memory Hierarchy Function Qualifiers Variable Qualifiers Launching Kernels Synchronization Function Final Recap Learning CUDA to Solve Scientific Problems. Miguel Cárdenas Montes Centro de Investigaciones Energéticas Medioambientales y Tecnológicas, Madrid, Spain miguel.cardenas@ciemat.es 2010 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 1 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 2 / 58 2010 4 / 58 Objectives To understand the main differences between the CUDA Programming Model and the normal C. Introduction To recognize the main features of the CUDA Programming Model. Technical Issues Kernel. Thread Hierarchy. Memory Hierarchy. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 3 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. Introduction II Introduction I During the last years, the programmable Graphic Processing Unit or GPU has evolved into highly parallel, multithreaded, manycore processor with tremendous computational power and very high memory bandwidth. These GPU can be used in order to tackle scientific problems requiring high computational effort. M. Cárdenas (CIEMAT) T1. Basic Concepts. Specifically, GPU is well-suited to address problems that can be expressed as data-parallel computations (the same program/algorithm/calculation is executed on many data elements in parallel). Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control; and because it is executed on many data elements and it has high arithmetic intensity. Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. 2010 5 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 6 / 58 Introduction IV In November 2006, NVIDIA introduced CUDA, a general purpose parallel computing architecture – with a new parallel programming model and instruction set architecture – that leverages the parallel compute engine in NVIDIA GPUs to solve many complex computational problems in a more efficient way than on a CPU. Introduction III On the contrary, GPU architecture is not suitable for any kind of heterogeneous calculation. CUDA comes with a software environment that allows to the developers to use C as a high-level programming language. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 7 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 8 / 58 CUDA Program Structure CUDA Program Structure I A CUDA program consists of one or more phases that are executed on either the host (CPU) or a device such as a GPU. CUDA Program Structure The phases that exhibit rich amount of data parallelism are implemented in the device code. The program supplies a single source code encompassing both host and device code. The device code is written using ANSI C extended with keywords for labeling data-parallel functions, called kernels, and their associated data structures. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 9 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. CUDA Program Structure CUDA Program Structure CUDA Program Structure II CUDA Program Structure III The kernel functions generate a large number of threads to exploit data parallelism. The execution starts with host execution. When a kernel function is invoked, the execution is moved to a device, where a large number of threads are generated to take advantage of abundant data parallelism. All the threads that are generated by a kernel during an invocation are collectively called a grid. When all threads of a kernel complete their execution, the corresponding grid terminates, the execution continues on the host until another kernel is invoked. T1. Basic Concepts. 2010 10 / 58 What runs on a CUDA device? The CUDA threads are much lighter than the CPU threads. CUDA program structure (schema): M. Cárdenas (CIEMAT) 2010 11 / 58 The device is suited for computations that can be run in parallel. That is, data parallelism is optimally handled on the device. This typically involves arithmetic on large data sets (such as matrices), where the same operation can be performed across thousands or millions of elements at the same time. There should be some coherence in memory access by a kernel. Certain memory access patterns enable the hardware to coalesce groups of data items to be written and read in one operation. Data can not be laid out so as to enable coalescing will not enjoy much of a performance lift when used in computations on CUDA. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 12 / 58 CUDA Program Structure CUDA Program Structure IV What runs on a CUDA device? Traffic along the PCI bus should be minimized. To use CUDA, data values must be transferred from the host to the device. These transfers data are costly in terms of performance and so they should be minimized. Programming Model The complexity of operations should justify the cost of moving data to the device. Code that transfers data for brief use by a small number of threads will see little or no performance lift. Data should be kept on the device as long as possible. Because transfers should be minimized, programs that run multiple kernels on the same data should favor leaving the data on the device between kernel calls, rather than transferring intermediate results to the host and then sending them back to the device for subsequent calculations. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 13 / 58 Programming Model M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 14 / 58 Programming Model Programming Model II Data-parallel, compute intensive functions should be off-loaded to the device. Programming Model I Now the challenge is to develop applications that transparently scales its parallelism to leverage the increasing number of processor cores. A function compiled for the device is called a kernel. The kernel is executed on the device as many different threads. At its core are three key abstractions: a hierarchy of thread groups, shared memories, and barrier synchronization. Functions that are executed many times, but independently on different data, are prime candidates to become kernels, i.e. body of for-loops. Both host (CPU) and device (GPU) manage their own memory, host memory and device memory. Data can be copied between them. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 15 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 16 / 58 Kernel I Kernel II. Example Kernels I C for CUDA extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. A kernel is defined using a declaration specifier and the number of CUDA threads for each call is specified using a new syntax. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 17 / 58 Kernel III // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { } int main() { // Kernel invocation VecAdd<<<1, N>>>(A, B, C); } M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 18 / 58 Thread Hierarchy I threadID threadID Each of the threads that execute a kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. // Kernel definition __global__ void VecAdd(float* A, float* B, float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { // Kernel invocation VecAdd<<<1, N>>>(A, B, C); } Example: code adds two vectors A and B of size N and This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or field. stores the result into vector C. M. Cárdenas (CIEMAT) T1. Basic Concepts. For convenience, threadIdx is a 3-component vector, so that threads can be identified using a one-dimensional, two-dimensional, or three-dimensional thread index, forming a one-dimensional, two-dimensional, or three-dimensional thread block. 2010 19 / 58 M. Cárdenas (CIEMAT) // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(N, N); MatAdd<<<1, dimBlock>>>(A, B, C); } Example: code adds two matrices A and B of size NxN and stores the result into matrix C. T1. Basic Concepts. 2010 20 / 58 Thread Hierarchy II Thread Hierarchy III The index of a thread and its thread ID relate to each other in a straightforward way: For a one-dimensional block, they are the same; for a two-dimensional block of size (Dx, Dy ), the thread ID of a thread of index (x, y ) is (x + yDx); for a three-dimensional block of size (Dx, Dy , Dz), the thread ID of a thread of index (x, y , z) is (x + yDx + zDxDy ). On current GPUs, a thread block may contain up to 512 threads. However, a kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks. These multiple blocks are organized into a one-dimensional or two-dimensional grid of thread blocks. threadID These multiple blocks are organized into a one-dimensional or two-dimensional grid of thread blocks. The dimension of the grid is specified by the first parameter of the <<< ... >>> syntax. // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(16, 16); dim3 dimGrid((N + dimBlock.x - 1) / dimBlock.x, (N + dimBlock.y - 1) / dimBlock.y); MatAdd<<<dimGrid, dimBlock>>>(A, B, C); } Example: code adds two matrices A and B of size NxN and stores the result into matrix C. Each block has an unique block ID. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 21 / 58 Thread Hierarchy IV threadID Each block within the grid can be identified by a one-dimensional or two-dimensional index. M. Cárdenas (CIEMAT) 2010 22 / 58 Recap I Multiple levels of parallelism. Thread Block // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { // Kernel invocation dim3 dimBlock(16, 16); dim3 dimGrid((N + dimBlock.x - 1) / dimBlock.x, (N + dimBlock.y - 1) / dimBlock.y); MatAdd<<<dimGrid, dimBlock>>>(A, B, C); } Up to 512 threads per block. Communicate via shared memory. Threads guaranteed to be resident. threadIdx, blockIdx. Grid of thread blocks. kernel<<<N,T>>>(a, b, c) Communicate via global memory. blockId and threadId provide a means to distinguish the threads among themselves when are executing the same kernel. Example: code adds two matrices A and B of size NxN and stores the result into matrix C. M. Cárdenas (CIEMAT) T1. Basic Concepts. T1. Basic Concepts. 2010 23 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 24 / 58 Recap II Example I The computational grid consist of a grid of blocks. Each thread executes the kernel. The kernel invocation specifies the grid and block dimensions (mandatory); and the number of threads (optional). __global__ void square_array(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx] * a[idx] ; } // Number of elements in arrays const int N = 1600; // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); The grid layouts can be 1, 2, or 3-dimensional. Each block has an unique block ID. Each thread has an unique thread ID (within the block). M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 25 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 26 / 58 2010 28 / 58 Example II CUDA Code C Code void add_matrix ( float* a, float* b, float* c, int N ) { int index; for ( int i = 0; i < N; ++i ) for ( int j = 0; j < N; ++j ) { index = i + j*N; c[index] = a[index] + b[index]; } } int main() { add_matrix( a, b, c, N ); } M. Cárdenas (CIEMAT) __global__ add_matrix ( float* a, float* b, float* c, int N ) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; int index = i + j*N; if ( i < N && j < N ) c[index] = a[index] + b[index]; } Memory Hierarchy int main() { dim3 dimBlock( blocksize, blocksize ); dim3 dimGrid( N/dimBlock.x, N/dimBlock.y ); add_matrix<<<dimGrid, dimBlock>>>( a, b, c, N ); } T1. Basic Concepts. 2010 27 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. Memory Hierarchy Memory Hierarchy cudaMalloc cudaMalloc(void ** pointer, size\_t nbytes) CPU and GPU have separate memory spaces Host (CPU) code manages device (GPU) memory: Called from the host code to allocated a piece of global memory for an object. cudaMemcpy Allocate / free Copy data to and from device Applies to global device memory (DRAM) cudaMemcpy(destination, source, size, movement direction) Copy information from one location to another. It can not be used to copy between different GPUs in multi-GPU systems. cudaFree cudaFree(void* pointer) Release memory. int n = 1024; int nbytes = 1024*sizeof(int); int *a_d = 0; cudaMalloc( (void**)&a_d, nbytes ); cudaMemset( a_d, 0, nbytes); cudaFree(a_d); M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 29 / 58 Copy Data M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 30 / 58 Data Movement Example I Code example of data allocation, movement between host and device, copy and release. The commands for move data between host and device are: int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyHostToHost M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 31 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 32 / 58 Data Movement Example II Data Movement Example III int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } } M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 33 / 58 Data Movement Example IV M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 34 / 58 2010 36 / 58 Data Movement Example V int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } } M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 35 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. Data Movement Example VI Data Movement Example VII int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; } } M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 37 / 58 Data Movement Example VIII M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 38 / 58 2010 40 / 58 Data Movement Example IX int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); int main(void) { float *a_h, *b_h; // host data float *a_d, *b_d; // device data int N = 14, nBytes, i ; nBytes = N*sizeof(float); a_h = (float *)malloc(nBytes); b_h = (float *)malloc(nBytes); cudaMalloc((void **) &a_d, nBytes); cudaMalloc((void **) &b_d, nBytes); for (i=0, i<N; i++) a_h[i] = 100.f + i; cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice); cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice); cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); for (i=0; i< N; i++) assert( a_h[i] == b_h[i] ); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d); return 0; return 0; } } M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 39 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. Data Handle. Recap. I Data Handle. Recap. II In CUDA, host and devices have separate memory spaces. GPU card has their our Dynamic Random Access Memory (DRAM). In order to execute a kernel on a device, the programmer needs to allocate memory on the device and transfer the pertinent data from the host memory to the allocated device memory. After device execution, the programmer needs to transfer result data from device back to the host and free up the device memory that is no longer needed. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 41 / 58 Function cudaMalloc() can be called from the host code to allocate a piece of Global Memory for an object. The first parameter for the cudaMalloc() function is the address of a pointer that needs to point to the allocated object after a piece of Global Memory is allocated to it. The second parameter gives the size of the object to be allocated. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 42 / 58 2010 44 / 58 Data Handle. Recap. III Once a program has allocated device memory for the data object, it can request that data can be transfered from the host to the device memory. The cudaMemcpy() function requires four parameters: Function Qualifiers The first parameter points to the destination location for the copy operation. The second parameter is a pointer to the source data object to be copied. The third parameter specifies the number of bytes to be copied. The fourth parameter indicates the types of memory involved in the copy. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 43 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. Function Qualifiers Kernels designated by function qualifier: __global__ Function called from host and executed on device. Must return void. Variable Qualifiers Other CUDA function qualifiers: __device__ Function called from device and run on device. Cannot be called from host code. __host__ Function called from host and executed on host (default). __host__ and __device__ qualifiers can be combined to generate both CPU and GPU code. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 45 / 58 Variable Qualifiers I M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 46 / 58 Variable Qualifiers II Variable qualifier: __device__ Stored in global memory (large, high latency, no cache). Allocated with cudaMalloc: __device__ qualifier implied. Accessible by all threads. Lifetime: application. __constant__ Read-only access by the device. Provides faster and more parallel data access paths for CUDA kernel execution than the global memory. Lifetime: application. Variable declaration Automatic Automatic array variables device shared int SharedVar; device int GlobalVar; device constant int ConstVar; Memory register global shared global constant Scope thread thread block grid grid Lifetime kernel kernel kernel application application __shared__ Stored in on-chip shared memory (very low latency). Specified by execution configuration or at compile time. Accessible by all threads in the same thread block. Lifetime: thread block. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 47 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 48 / 58 Launching Kernels When a kernel is invoked, it is executed as grid of parallel threads. Modified C function call syntax: kernel<<<dim3 dG, dim3 dB>>>(...) Execution Configuration: Launching Kernels <<< >>> dG - dimension and size of grid in blocks Two-dimensional: x and y Blocks launched in the grid: dG.x * dG.y Hardware-imposed limit 65,535 blocks per dimension. dB - dimension and size of blocks in threads: Three-dimensional: x, y, and z Threads per block: dB.x * dB.y * dB.z Examples: (512,1,1), (8,16,2) or (16,16,2) Not allowed: (32,32,1) Hardware-imposed limit 65,535 blocks per dimension. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 49 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 50 / 58 Synchronization Function I Synchronization Function Device Runtime Component: Synchronization Function Kernel launches is always asynchronous. cudaMemcpy is synchronous by default, although it exists an asynchronous version of the instruction. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 51 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 52 / 58 Synchronization Function II Synchronization Function III Synchronization Functions void syncthreads(); Synchronization Functions Once all threads have reached this point, execution resumes normally. cudaThreadSynchronize ( void ) ; Blocks until the device has completed all preceding requested tasks. Synchronizes all threads in a block. Threads in different blocks cannot synchronize with each other. The only safe way for threads in different blocks to synchronize with each other is to terminate the kernel and start a new kernel for the activities after the synchronization point. M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 53 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 54 / 58 2010 56 / 58 A typical CUDA program Final Recap. A typical CUDA program M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 55 / 58 //allocate memory space in global device memory for input data cudaMalloc(...); //copy input data from host to the allocated device space cudaMemcpy(...); //allocate memory space in global device memory for the output cudaMalloc(...); //define block and grid size for the kernel; dim3 grid (x,y); dim3 block (x,y,z); // launch kernel CUDA_kernel<<<grid,block>>>(...); //copy output data from device memory to the host cudaMemcpy(...); //free all device allocated memory (inputs and outputs) cudaFree(...); M. Cárdenas (CIEMAT) T1. Basic Concepts. A typical CUDA kernel Thanks void CUDA_kernel (...){ //declare a shared memory array (optional) __shared__ array_s[...]; //figure out index into different arrays in terms of blockIdx, threadIdx, and block_size int index = ...; //bring in data from global memory (into registers, or shared memory) ... //Do the computation ... //Copy data back to global memory (from registers or global memory) ... } M. Cárdenas (CIEMAT) T1. Basic Concepts. Thanks Questions? More questions? 2010 57 / 58 M. Cárdenas (CIEMAT) T1. Basic Concepts. 2010 58 / 58