Easy GPU Parallelism with OpenACC By Rob Farber, June 11, 2012 2 Comments An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU This is the first in a series of articles by Rob Farber on OpenACC directives, which enable existing C/C++ and Fortran code to run with high performance on massively parallel devices such as GPUs. The magic in OpenACC lies in how it extends the familiar face of OpenMP pragma programming to encompass coprocessors. As a result, OpenACC opens the door to scalable, massively parallel GPU — accelerating millions of lines of legacy application code without requiring a new language such as CUDA or OpenCL, or fork application source tree to support multiple languages OpenACC is a set of standardized, high-level pragmas that enables C/C++ and Fortran programmers to utilize massively parallel coprocessors with much of the convenience of OpenMP. A pragma is a form of code annotation that informs the compiler of something about the code. In this case, it identifies the succeeding block of code or structured loop as a good candidate for parallelization. OpenMP is a well-known and widely supported standard that defines pragmas programmers have used since 1997 to parallelize applications on shared memory multicore processors. The OpenACC standard has generated excitement because it preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer as well as loop and block parallelism. The success of GPU computing in recent years has motivated compiler vendors to extend the OpenMP shared memory pragma programming approach to coprocessors. Approved by the OpenACC standards committee in November 2011, the OpenACC version 1.0 standard creates a unified syntax and prevents a "tower of babel" proliferation of incompatible pragmas. Adoption has been rapid by companies such as NVIDIA, PGI (The Portland Group), CAPS Enterprise, and Cray. Make Your Life Simple Pragmas and high-level APIs are designed to provide software functionality. They hide many details of the underlying implementation to free a programmer's attention for other tasks.A colleague humorously refers to pragma-based programming as a negotiation that occurs between the developer and the compiler. Note that pragmas are informational statements provided by the programmer to the assist the compiler. This means that pragmas are not subject to the same level of syntax, type, and sanity checking as the rest of the source code. The compiler is free to ignore any pragma for any reason including: it does not support the pragma, syntax errors, code complexity, unresolved (or potentially unresolved) dependencies, edge cases where the compiler cannot guarantee that vectors or matrices do not overlap, use of pointers, and many others. Profiling tools and informational messages from the compiler about parallelization, or an inability to parallelize, are essential to a successful to achieving high performance. An OpenACC pragma for C/C++ can be identified from the string "#pragma acc" just like an OpenMP pragma can be identified from "#pragma omp". Similarly, Fortran pragmas can be identified by "! $acc". Always ensure that these strings begin all OpenACC (or OpenMP) pragmas. Moreover, it is legal to mix OpenMP, OpenACC, and other pragmas in a single source file. OpenACC Syntax OpenACC provides a fairly rich pragma language to annotate data location, data transfer, and loop or code block parallelism. The syntax of OpenACC pragmas (sometimes referred to as OpenACC directives) is: "#pragma acc directive-name [clause [[,] clause]…] new-line" C/C++: "!$acc directive-name [clause [[,] clause]…] new-line" Fortran: OpenACC pragmas in C/C++ are somewhat more concise than their Fortran counterparts as the compiler can determine a code block from the curly bracket "{}" notation. The OpenACC specification also requires that the _OPENACC preprocessor macro be defined when compiling OpenACC applications. This macro can be used for the conditional compilation of OpenACC code. The _OPENACC macro name will have a value yyyymm where yyyy is the year and mm is the month designation of the version of the OpenACC directives supported by the implementation. Table 1 shows a list of OpenACC version 1.0 pragmas and clauses. !$acc kernels !$acc parallel #pragma acc kernels #pragma acc parallel Clauses Clauses if() if() async() async() !$acc data !$acc loop #pragma #pragma #pragma acc data acc loop acc wait Clauses Clauses if() collapse() async() within kernels region copy() num_gangs() gang() copyin() num_workers() worker() copyout() vector_length() vector() create() reduction() seq() present() copyin() copyin() private() present_or_copy() copyout() copyout() reduction() present_or_copyin() create() create() present_or_copyout() present() present() present_or_create() deviceptr() present_or_copy() in .c deviceptr() present_or_copyin() deviceptr() in .f present_or_copyout() present_or_create() !$acc wait !$acc kernels !$acc parallel !$acc data !$acc loop !$acc wait deviceptr() private() firstprivate() Table 1. Currently supported OpenACC pragmas. Two OpenACC environment variables, user: ACC_DEVICE_TYPE and ACC_DEVICE_NUM can be set by the ACC_DEVICE_TYPE: Controls the default device type to use when executing accelerator parallel and kernels regions, when the program has been compiled to use more than one different type of device. The allowed values of this environment variable are implementation-defined. Examples include ACC_DEVICE_TYPE=NVIDIA. ACC_DEVICE_NUM: Specifies the default device number to use when executing accelerator regions. The value of this environment variable must be a nonnegative integer between zero and the number of devices of the desired type attached to the host. If the value is zero, the implementation-defined default is used. If the value is greater than the number of devices attached, the behavior is implementation-defined. On multi-GPU systems, this variable will avoid the TDR (Timeout Detection and Recovery) watchdog reset for long-running GPU applications by running on the GPU that is not used for the display. (Consult the vendor driver information to see how to modify the TDR time for your operation system. acc_get_num_devices(), acc_set_device_type(), acc_get_device_type(), acc_set_device_num(), acc_get_device_num(), acc_async_test(), acc_async_test_all(), acc_async_wait(), acc_async_wait_all(), acc_init(), acc_shutdown(), acc_on_device(), acc_malloc(), acc_free(). In addition, OpenACC provides several runtime routines: Vendor specific information can be found on the Nvidia, PGI, CAPS, and Cray websites. Building, Running and Profiling a First Program This tutorial uses The Portland Group (PGI) Accelerator C and Fortran compilers release 12.5 with OpenACC support. PGI has been deeply involved in developing pragma-based programming for coprocessors since 2008, plus they are a founding member of the OpenACC standards body. The PGI OpenACC compilers currently target NVIDIA GPUs, but it is important to note that OpenACC can support other coprocessors (such as AMD GPUs and Intel MIC) as well. More information about the PGI compilers is available on the company's website. An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU How To Try Out OpenACC An extended 30-day trial license for the PGI software can be obtained by registering with NVIDIA. The Portland Group also provides a free 15 day OpenACC trial license, which can be obtained by following the following three steps: 1. Download any of the available software packages for your operating system. 2. Review the PGI Installation Guide [PDF] or the PGI Visual Fortran Installation Guide [PDF] and configure your environment. 3. Generate the trial license keys. Note the trial keys and all executable files compiled using them will cease operating at the end of the trial period. The following set of examples multiply two matrices a and b and store the result in matrix c. They utilize a useful set of basic OpenACC data transfer, parallelization, and memory creation/access clauses. A C-language OpenMP matrix multiply is also provided to show the similarity between OpenACC and OpenMP and provide CPU and GPU performance comparisons. While the PGI matrix multiplication performance is good, please look to the highly optimized BLAS (Basic Linear Algebra Subroutines) packages such as CUBLAS and phiGEMM for production GPU and hybrid CPU + GPU implementations. Following is our first OpenACC program, matix-acc-check.c. This simple code creates a static set of square matrices (a,b,c,seq), initializes them, and then performs a matrix multiplication on the OpenACC device. The test code then performs the matrix multiplication sequentially on the host processor and double-checks the OpenACC result. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 /* matrix-acc-check.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; float seq[SIZE][SIZE]; int main() { int i,j,k; // Initialize for (i = 0; i for (j = 0; a[i][j] = b[i][j] = c[i][j] = } } matrices. < SIZE; ++i) { j < SIZE; ++j) { (float)i + j; (float)i - j; 0.0f; // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } // **************** // double-check the OpenACC result sequentially on the host // **************** // Initialize the seq matrix for(i = 0; i < SIZE; ++i) for(j = 0; j < SIZE; ++j) seq[i][j] = 0.f; // Perform the multiplication for (i = 0; i < SIZE; ++i) for (j = 0; j < SIZE; ++j) for (k = 0; k < SIZE; ++k) seq[i][j] += a[i][k] * b[k][j]; 39 40 41 42 43 44 45 46 47 48 } 49 50 51 52 53 54 55 // check all the OpenACC matrices for (i = 0; i < SIZE; ++i) for (j = 0; j < SIZE; ++j) if(c[i][j] != seq[i][j]) { printf("Error %d %d\n", i,j); exit(1); } printf("OpenACC matrix multiplication test was successful!\n"); return 0; Example 1: matrix-acc-check.c source code. The OpenACC pragma tells the compiler the following: #pragma acc: This is an OpenACC pragma. kernels: A kernels region. No jumps are allowed into/out of the kernels region. Loops will be sent to the OpenACC device. The scope of the kernels region code block is denoted by the curly brackets in a C program. copyin(): copy the contiguous region of memory from the host to the device. The variables, arrays or subarrays in the list have values in the host memory that need to be copied to the device memory. If a subarray is specified, then only that subarray of the array needs to be copied. copy(): copy the contiguous memory region from the host to the device and back again. The variables, arrays or subarrays in the list have values in the host memory that need to be copied to the device memory. If a subarray is specified, then only that subarray of the array needs to be copied. The data is copied to the device memory before entry to the kernles region, and data copied back to the host memory when the code block is complete. The source code is compiled with the pgcc compiler and a successful test is indicated after the application runs as shown below: ? 1 pgcc -acc -fast -Minfo matrix-acc-check.c -o matrix-acc-check 2 ./matrix-acc-check 3 OpenACC matrix multiplication test was successful! The source code for matrix-acc.c was created by removing the italicized code from matric-acc-check.c to simplify the following discussion. ? 1 /* matrix-acc.c */ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; int main() { int i,j,k; // Initialize for (i = 0; i for (j = 0; a[i][j] = b[i][j] = c[i][j] = } } matrices. < SIZE; ++i) { j < SIZE; ++j) { (float)i + j; (float)i - j; 0.0f; // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; } Example 2: matrix-acc.c source code. Note the similarity between matrix-acc.c and the following OpenMP implementation, matrix-omp.c. Only the pragmas are different as the OpenACC pragma includes copy operations that are not required in the OpenMP implementation. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 /* matrix-omp.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; int main() { int i,j,k; // Initialize for (i = 0; i for (j = 0; a[i][j] = b[i][j] = c[i][j] = } } matrices. < SIZE; ++i) { j < SIZE; ++j) { (float)i + j; (float)i - j; 0.0f; // Compute matrix multiplication. #pragma omp parallel for default(none) shared(a,b,c) private(i,j,k) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; 26 } 27 28 29 30 Example 3: matrix-omp.c source code. Fortran programmers will find the corresponding source code in Example 4. Again, the OpenACC pragmas annotate data movement with the copy() and copyin() clauses. Note that the C-based pragmas know the extent of the code block due to the use of curly brackets while the Fortran version must explicitly specify the end of the scope of the pragma with "!$acc end …". ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 ! matrix-acc.f program example1 parameter ( n_size=1000 ) real*4, dimension(:,:) :: a(n_size,n_size) real*4, dimension(:,:) :: b(n_size,n_size) real*4, dimension(:,:) :: c(n_size,n_size) ! Initialize matrices (values differ from C version) do i=1, n_size do j=1, n_size a(i,j) = i + j; b(i,j) = i - j; c(i,j) = 0.; enddo enddo !$acc data copyin(a,b) copy(c) !$acc kernels loop ! Compute matrix multiplication. do i=1, n_size do j=1, n_size do k = 1, n_size c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$acc end data end program example1 Example 4: matrix-acc.f source code. An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU The following commands compile the source code for each application with the PGI C and Fortran compilers. These commands assume the source code has been saved to the file name provided in the comment at the beginning of each example. pgcc -fast -mp -Minfo -Mconcur=allcores matrix-omp.c -o matrix-omp pgcc -fast -acc -Minfo matrix-acc.c -o matrix-acc-gpu pgfortran -fast -acc -Minfo matrix-acc.f -o matrix-acc-gpuf The command line arguments to the PGI C compiler (pgcc) and Fortran compiler (pgfortran) are: -fast: Chooses generally optimal flags for the target platform. -mp: Interpret OpenMP pragmas to explicitly parallelize regions of code for execution by multiple threads on a multi-processor system. -acc: Interpret OpenACC pragmas. -Minfo: Emit useful information to stderr. -Mconcur: Instructs the compiler to enable auto-concurrentization of loops. The Portland Group also provides a profiling capability that can be enabled via the PGI_ACC_TIME environment variable. By default, profiling is not enabled. Setting PGI_ACC_TIME to a positive integer value enables profiling while a negative value will disable it. The profiling overhead is minimal because the runtime only reports information collected by the GPU hardware performance counters. The wealth of information gathered by the runtime profiler can be seen in the output generated by matrix-acc-gpu after setting PGI_ACC_TIME=1: ? 1 2 rmfarber@bd:~/PGI/example1$ ./matrix-acc-gpu 3 Accelerator Kernel Timing data 4 /home/rmfarber/PGI/example1/matrix-acc.c 5 main 21: region entered 1 time 6 time(us): total=139658 init=88171 region=51487 7 kernels=43848 data=7049 8 w/o init: total=51487 max=51487 min=51487 avg=51487 9 25: kernel launched 1 times 10 grid: [63x63] block: [16x16] 11 time(us): total=43848 max=43848 min=43848 avg=43848 12 Example 5: Runtime profile output when PGI_ACC_TIME=1 for matrix-acc-gpu. This output from the PGI runtime profiling tells us that the application spent 7 milliseconds transferring data and 43 milliseconds computing the matrix multiply kernel. It is possible to create a timeline plot using the NVIDIA Visual Profiler (nvvp), which runs on Windows, Linux and Mac computers. (The nvvp application was previously known as computeprof.) The timeline is a new feature in the CUDA 4.2 release and is extremely useful! [Click image to view at full size] Figure 1: nvvp timeline for matrix-acc-gpu. Notice that there are: Three host to device data transfers at the start of the computation. These transfers correspond to the copyin() clauses for matrices a and b plus the copy() clause for matrix c. A GPU computation that requires 39.1% of the time for kernel main_24_gpu. A helpful feature of the PGI OpenACC compiler is that it intelligently labels the kernel with the routine name and line number to make these timelines intelligible. A single data transfer back from the device to the host, which was required by the copy clause for matrix c at the end of the kernel. The visual profiler provides an interactive display of the timeline. A larger screenshot would show the calls to the driver API for the CUDA context setup and the data transfers along with a host of other information. In addition, the nvvp profiler will analyze the application and provide automated suggestions. This requires running the application many times. It is recommended to look at the timeline first as this only requires running the application once. For example, the following screenshot shows the initial analysis of the timeline shown in Figure 1: [Click image to view at full size] Figure 2: Automated analysis performed by the NVIDIA Visual Profiler. Matrix Multiply Is an Ideal Case Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra Subprograms) library. BLAS is the de facto programming interface for basic linear algebra. BLAS is structured according to three different levels with increasing data and runtime requirements. 1. Level-1: Vector-vector operations that require O(N) data and O(N) work. Examples include taking the inner product of two vectors, or scaling a vector by a constant multiplier. 2. Level-2: Matrix-vector operations that require O(N2) data and O(N2) work. Examples include matrix-vector multiplication or a single right-hand-side triangular solve. 3. Level-3 Matrix-vector operations that require O(N2) data and O(N3) work. Examples include dense matrix-matrix multiplication. The following table describes the amount of work that is performed by each BLAS level assuming that N floating-point values are transferred from the host to the device. This table does not take into account the time required to transfer the data back to the host. BLAS level Data Work Work per Datum 1 O(N) O(N) O(1) 2 O(N2) O(N2) O(1) 3 O(N2) O(N2) O(N) Table 2: Work per datum for the three BLAS levels. Matrix multiply is an ideal example for OpenACC acceleration because the data transfers become less important as the size of the matrices increase. Matrix multiply is a level-3 BLAS operation that performs O(N) work for every floating-point value transferred to the device. The effect of this high computational density can be seen in the following plot of wall clock time on a dedicated system as the problem size increases. Multiplying 1k by 1k square matrices results in a 1.7 speedup of matrix-acc.c over matrix-omp.c when running on an NVIDIA C2050 GPU compared with a 2.65 GHz quad-core Intel Xeon E5630 processor. Increasing the matrix sizes to 11k by 11k shows a 6.4x speedup over OpenMP. This empirically demonstrates the high work per datum runtime behavior of matrix multiply. Similar speedups will occur for other high work-per-datum computations as well. Figure 3: Runtime behavior by matrix size of OpenACC and OpenMP implementations (lower is better). The Three Rules of Coprocessor Programming Matrix multiply is an excellent teaching tool but most real-world calculations do not exhibit such ideal behavior. Instead, the programmer must be creative and pay close attention to data transfers and computational density on the OpenACC device(s). High performance can be achieved when the compute intensive portions of the application conform to the following three rules of high-performance coprocesser programming. If not, expect application performance to be either PCIe or device memory bandwidth limited. 1. Transfer the data across the PCIe bus onto the device and keep it there. 2. Give the device enough work to do. 3. Focus on data reuse within the coprocessor(s) to avoid memory bandwidth bottlenecks. Using the Create Clause to Allocate on the OpenACC Device It is easy to create the matrices on the OpenACC device and initialize them. Creating and initializing data on the OpenACC device conforms to the first rule and avoids data transfers. The following example, matrix-acccreate.c demonstrates the use of the create() clause in a kernels region. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 /* matrix-acc-create.c */ #define SIZE 1000 float a[SIZE][SIZE]; float b[SIZE][SIZE]; float c[SIZE][SIZE]; int main() { int i,j,k; #pragma acc kernels create(a,b) copyout(c) { // start of kernels // Initialize matrices. for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; 15 c[i][j] = 0.0f; 16 } 17 } 18 19 // Compute matrix multiplication. 20 for (i = 0; i < SIZE; ++i) { 21 for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { 22 c[i][j] += a[i][k] * b[k][j]; 23 } 24 } 25 } 26 } // end of kernels 27 28 return 0; 29 } 30 31 32 33 Example 6: matrix-acc-create.c Following is the nvvp timeline showing that two kernels are now running on the GPU. [Click image to view at full size] Figure 4: High resolution nvvp timeline showing two kernels. The Visual Profiler shows that only one data transfer occurs at the end of the data region as required by the copyout() clause. [Click image to view at full size] Figure 5: lower resolution nvvp timeline showing two kernels and copy at the end of the data region. Removal of the data transfers speeds the OpenACC performance over the OpenMP version: Speedup with copyin() Speedup with create() and copyout() and copy() clauses over OpenMP: 6.4x clauses over OpenMP: 6.9x An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU Multidimensional Dynamic Arrays The previous examples utilized static 2D globally accessible C arrays. Most applications utilize dynamic allocation of all data structures including multidimensional arrays that are frequently passed to functions and subroutines. A particular challenge for C/C++ programmers is that OpenACC transfers occur between contiguous regions of host and device memory. The use of non-contiguous multidimensional arrays (such as float ** arrays) is not recommended because they require individual transfers of each contiguous memory region. The following example, matrix-acc-func.c, dynamically allocates the 2D matrices for the test and passes them to doTest(), which performs the matrix initializations and multiplication. This test utilizes the Clanguage (as of C99) restrict keyword that indicates the matrices do not overlap. For convenience, the 2D nature of the arrays was defined in the function to make array accesses straightforward: ? 1 int doTest(restrict float a[][SIZE], restrict float b[][SIZE], 2 restrict float c[][SIZE], int size) 3{ 4 … c[i][j] = 0.0f; 5 } 6 Example 7: Code snippet for straightforward 2D array indexing. Of course, the programmer can pass the pointer to the contiguous region of memory and manually calculate the offsets into the multidimensional array as will be demonstrated in the next example. ? 1 int doTest(restrict float *a, restrict float *b, restrict float *c, int size) 2{ 3 … c[i*size+j] = 0.0f; 4 5} Example 8: Code snippet demonstrate manual calculation of the 2D array offset. The matrix-acc-func.c example also demonstrates the use of the OpenACC pragma "#pragma acc loop independent". The independent clause tells the compiler to ignore its own dependency analysis and trust that the programmer knows the loops have no dependencies. Incorrect and non-deterministic program behavior can happen if the programmer is mistaken. Conversely, the OpenACC pragma "#pragma acc loop seq" tells the compiler to generate code that will execute sequentially on the device. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 /* matrix-acc-func.c */ #include <stdio.h> #include <stdlib.h> #define SIZE 1000 int doTest(restrict float a[][SIZE], restrict float b[][SIZE], restrict float c[][SIZE], int size) { int i,j,k; #pragma acc kernels create(a[0:size][0:size], b[0:size][0:size]) \ copyout(c[0:size][0:size]) { // Initialize matrices. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { a[i][j] = (float)i + j; b[i][j] = (float)i - j; c[i][j] = 0.0f; } } // Compute matrix multiplication. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { #pragma acc loop seq for (k = 0; k < size; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } } } int main() { int i,j,k; int size=SIZE; float *a= (float*)malloc(sizeof(float)*size*size); float *b= (float*)malloc(sizeof(float)*size*size); float *c= (float*)malloc(sizeof(float)*size*size); doTest(a,b,c, size); free(a); 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 } 72 73 74 75 76 77 78 79 80 81 82 free(b); // **************** // double-check the OpenACC result sequentially on the host // **************** float *seq= (float*)malloc(sizeof(float)*size*size); // Initialize the seq matrix for(i = 0; i < size; ++i) for(j = 0; j < size; ++j) seq[i*SIZE+j] = 0.f; // Perform the multiplication for (i = 0; i < size; ++i) for (j = 0; j < size; ++j) for (k = 0; k < size; ++k) seq[i*size+j] += (i+k) * (k-j); // check all the OpenACC matrices for (i = 0; i < size; ++i) for (j = 0; j < size; ++j) if(c[i*size+j] != seq[i*size+j]) { printf("Error (%d %d) (%g, %g)\n", i,j, c[i*size+j], seq[i*size+j]); exit(1); } free(c); free(seq); printf("OpenACC matrix multiplication test was successful!\n"); return 0; Example 9: matrix-acc-func.c source code. Using Data Allocated on the Device OpenACC also provides the ability to use previously allocated device memory with the deviceptr() clause. The following example matrix-acc-alloc.c demonstrates how to allocate memory in main() with the OpenACC runtime method acc_malloc(). The pointer is then passed to doTest() where it is accessed via deviceptr(). The copyout() clause also includes the size of the contiguous region of memory. For timing purposes, this code utilizes a size specified by the user on the command-line. ? 1 2 3 4 5 6 7 8 9 /* matrix-acc-alloc.c */ #include <stdio.h> #include <stdlib.h> #include <openacc.h> int doTest(restrict float *a, restrict float *b, restrict float *c, int size) { int i,j,k; #pragma acc kernels deviceptr(a, b) copyout(c[0:size*size-1]) 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 { // Initialize #pragma acc loop for (i = 0; i #pragma acc loop for (j = 0; a[i*size+j] = b[i*size+j] = c[i*size+j] = } } matrices. independent < size; ++i) { independent j < size; ++j) { (float)i + j; (float)i - j; 0.0f; // Compute matrix multiplication. #pragma acc loop independent for (i = 0; i < size; ++i) { #pragma acc loop independent for (j = 0; j < size; ++j) { #pragma acc loop seq for (k = 0; k < size; ++k) { c[i*size+j] += a[i*size+k] * b[k*size+j]; } } } } } int main(int argc, char *argv[]) { int i,j,k; if(argc < 2) { fprintf(stderr,"Use: size (for size x size) matrices\n"); return -1; } int size=atoi(argv[1]); float *a = (float *)acc_malloc(sizeof(float)*size*size); float *b = (float *)acc_malloc(sizeof(float)*size*size); float *c= (float*)malloc(sizeof(float)*size*size); printf("size = %d\n",size); doTest(a,b,c, size); acc_free(a); acc_free(b); free(c); printf("OpenACC matrix multiplication test was successful!\n"); return 0; } Example 10: Source code for matrix-acc-alloc.c. Conclusion OpenACC has been designed to provide OpenMP-style programmers with an easy transition to GPU programming. Following the common sense adage, "Make your life easy and use the highest level API first," OpenACC provides a natural starting point to transition any C or Fortran code to massive parallelism. For legacy code, OpenACC can be the only viable route to massively parallel coprocessors because it eliminates the need for a total rewrite of the software and Fortran is supported. As a result, OpenACC opens the door to scalable, massively parallel GPU (or, more generically, coprocessor) acceleration of millions of lines of legacy application code. Currently, OpenACC is supported by compilers that must be purchased from either PGI or CAPS Enterprise. The PGI compiler used in this article is free for evaluation but continued use after the trial period expires requires a license that must be purchased. As with OpenMP, it is assumed that opensource compilers will eventually provide free OpenACC support. Profiling and informational compiler messages play a key role in achieving high performance in pragmabased programming. Instead of having to blindly add pragmas and then guess at the impact of each might have on an application, free tools like the NVIDIA Visual Profiler let the developer actually see what is happening during runtime on Windows, Linux, and Mac computers. Being able to see what effect OpenACC pragmas have on runtime behavior greatly speeds the OpenACC learning process as well as application acceleration. My next article in this series will discuss the OpenACC memory and execution model including the gang and worker clauses plus more sophisticated ways to handle data. Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.