Easy GPU Parallelism with OpenACC

advertisement
Easy GPU Parallelism with OpenACC
By Rob Farber, June 11, 2012
2 Comments
An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU
This is the first in a series of articles by Rob Farber on OpenACC directives, which enable existing C/C++ and
Fortran code to run with high performance on massively parallel devices such as GPUs. The magic in OpenACC
lies in how it extends the familiar face of OpenMP pragma programming to encompass coprocessors. As a result,
OpenACC opens the door to scalable, massively parallel GPU — accelerating millions of lines of legacy
application code without requiring a new language such as CUDA or OpenCL, or fork application source tree to
support multiple languages
OpenACC is a set of standardized, high-level pragmas that enables C/C++ and Fortran programmers to utilize
massively parallel coprocessors with much of the convenience of OpenMP. A pragma is a form of code
annotation that informs the compiler of something about the code. In this case, it identifies the succeeding
block of code or structured loop as a good candidate for parallelization. OpenMP is a well-known and widely
supported standard that defines pragmas programmers have used since 1997 to parallelize applications on
shared memory multicore processors. The OpenACC standard has generated excitement because it preserves
the familiarity of OpenMP code annotation while extending the execution model to encompass devices that
reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and
transfer as well as loop and block parallelism.
The success of GPU computing in recent years has motivated compiler vendors to extend the OpenMP shared
memory pragma programming approach to coprocessors. Approved by the OpenACC standards committee in
November 2011, the OpenACC version 1.0 standard creates a unified syntax and prevents a "tower of babel"
proliferation of incompatible pragmas. Adoption has been rapid by companies such as NVIDIA, PGI (The
Portland Group), CAPS Enterprise, and Cray.
Make Your Life Simple
Pragmas and high-level APIs are designed to provide software functionality. They hide many details of the
underlying implementation to free a programmer's attention for other tasks.A colleague humorously refers to
pragma-based programming as a negotiation that occurs between the developer and the compiler. Note that
pragmas are informational statements provided by the programmer to the assist the compiler. This means that
pragmas are not subject to the same level of syntax, type, and sanity checking as the rest of the source code.
The compiler is free to ignore any pragma for any reason including: it does not support the pragma, syntax
errors, code complexity, unresolved (or potentially unresolved) dependencies, edge cases where the compiler
cannot guarantee that vectors or matrices do not overlap, use of pointers, and many others. Profiling tools and
informational messages from the compiler about parallelization, or an inability to parallelize, are essential to a
successful to achieving high performance.
An OpenACC pragma for C/C++ can be identified from the string "#pragma acc" just like an OpenMP
pragma can be identified from "#pragma omp". Similarly, Fortran pragmas can be identified by "!
$acc". Always ensure that these strings begin all OpenACC (or OpenMP) pragmas. Moreover, it is legal to
mix OpenMP, OpenACC, and other pragmas in a single source file.
OpenACC Syntax
OpenACC provides a fairly rich pragma language to annotate data location, data transfer, and loop or code block
parallelism. The syntax of OpenACC pragmas (sometimes referred to as OpenACC directives) is:

"#pragma acc directive-name [clause [[,]
clause]…] new-line"
C/C++:

"!$acc directive-name [clause [[,] clause]…]
new-line"
Fortran:
OpenACC pragmas in C/C++ are somewhat more concise than their Fortran counterparts as the compiler can
determine a code block from the curly bracket "{}" notation. The OpenACC specification also requires that the
_OPENACC preprocessor macro be defined when compiling OpenACC applications. This macro can be used for
the conditional compilation of OpenACC code. The _OPENACC macro name will have a value yyyymm where
yyyy is the year and mm is the month designation of the version of the OpenACC directives supported by the
implementation.
Table 1 shows a list of OpenACC version 1.0 pragmas and clauses.
!$acc kernels
!$acc parallel
#pragma acc kernels
#pragma acc parallel
Clauses
Clauses
if()
if()
async()
async()
!$acc data
!$acc loop
#pragma
#pragma
#pragma
acc data
acc loop
acc wait
Clauses
Clauses
if()
collapse()
async()
within
kernels
region
copy()
num_gangs()
gang()
copyin()
num_workers()
worker()
copyout()
vector_length()
vector()
create()
reduction()
seq()
present()
copyin()
copyin()
private()
present_or_copy()
copyout()
copyout()
reduction()
present_or_copyin()
create()
create()
present_or_copyout() present()
present()
present_or_create()
deviceptr()
present_or_copy()
in .c
deviceptr()
present_or_copyin()
deviceptr()
in .f
present_or_copyout()
present_or_create()
!$acc wait
!$acc kernels
!$acc parallel
!$acc data
!$acc loop
!$acc wait
deviceptr()
private()
firstprivate()
Table 1. Currently supported OpenACC pragmas.
Two OpenACC environment variables,
user:


ACC_DEVICE_TYPE and ACC_DEVICE_NUM can be set by the
ACC_DEVICE_TYPE: Controls the default device type to use when executing
accelerator parallel and kernels regions, when the program has been compiled to
use more than one different type of device.
The allowed values of this environment variable are implementation-defined.
Examples include ACC_DEVICE_TYPE=NVIDIA.
ACC_DEVICE_NUM: Specifies the default device number to use when executing
accelerator regions. The value of this environment variable must be a nonnegative
integer between zero and the number of devices of the desired type attached to
the host.
If the value is zero, the implementation-defined default is used.
If the value is greater than the number of devices attached, the behavior is
implementation-defined.
On multi-GPU systems, this variable will avoid the TDR (Timeout Detection and
Recovery) watchdog reset for long-running GPU applications by running on the
GPU that is not used for the display. (Consult the vendor driver information to see
how to modify the TDR time for your operation system.
acc_get_num_devices(),
acc_set_device_type(), acc_get_device_type(), acc_set_device_num(),
acc_get_device_num(), acc_async_test(), acc_async_test_all(),
acc_async_wait(), acc_async_wait_all(), acc_init(), acc_shutdown(),
acc_on_device(), acc_malloc(), acc_free().
In addition, OpenACC provides several runtime routines:
Vendor specific information can be found on the Nvidia, PGI, CAPS, and Cray websites.
Building, Running and Profiling a First Program
This tutorial uses The Portland Group (PGI) Accelerator C and Fortran compilers release 12.5 with OpenACC
support. PGI has been deeply involved in developing pragma-based programming for coprocessors since 2008,
plus they are a founding member of the OpenACC standards body. The PGI OpenACC compilers currently target
NVIDIA GPUs, but it is important to note that OpenACC can support other coprocessors (such as AMD GPUs and
Intel MIC) as well. More information about the PGI compilers is available on the company's website.
An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU
How To Try Out OpenACC
An extended 30-day trial license for the PGI software can be obtained by registering with NVIDIA. The Portland
Group also provides a free 15 day OpenACC trial license, which can be obtained by following the following three
steps:
1. Download any of the available software packages for your operating system.
2. Review the PGI Installation Guide [PDF] or the PGI Visual Fortran Installation Guide [PDF] and configure your
environment.
3. Generate the trial license keys. Note the trial keys and all executable files compiled using them will cease
operating at the end of the trial period.
The following set of examples multiply two matrices a and b and store the result in matrix c. They utilize a
useful set of basic OpenACC data transfer, parallelization, and memory creation/access clauses. A C-language
OpenMP matrix multiply is also provided to show the similarity between OpenACC and OpenMP and provide CPU
and GPU performance comparisons. While the PGI matrix multiplication performance is good, please look to the
highly optimized BLAS (Basic Linear Algebra Subroutines) packages such as CUBLAS and phiGEMM for
production GPU and hybrid CPU + GPU implementations.
Following is our first OpenACC program, matix-acc-check.c. This simple code creates a static set of square
matrices (a,b,c,seq), initializes them, and then performs a matrix multiplication on the OpenACC device.
The test code then performs the matrix multiplication sequentially on the host processor and double-checks the
OpenACC result.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/* matrix-acc-check.c */
#define SIZE 1000
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];
float seq[SIZE][SIZE];
int main()
{
int i,j,k;
// Initialize
for (i = 0; i
for (j = 0;
a[i][j] =
b[i][j] =
c[i][j] =
}
}
matrices.
< SIZE; ++i) {
j < SIZE; ++j) {
(float)i + j;
(float)i - j;
0.0f;
// Compute matrix multiplication.
#pragma acc kernels copyin(a,b) copy(c)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
// ****************
// double-check the OpenACC result sequentially on the host
// ****************
// Initialize the seq matrix
for(i = 0; i < SIZE; ++i)
for(j = 0; j < SIZE; ++j)
seq[i][j] = 0.f;
// Perform the multiplication
for (i = 0; i < SIZE; ++i)
for (j = 0; j < SIZE; ++j)
for (k = 0; k < SIZE; ++k)
seq[i][j] += a[i][k] * b[k][j];
39
40
41
42
43
44
45
46
47
48 }
49
50
51
52
53
54
55
// check all the OpenACC matrices
for (i = 0; i < SIZE; ++i)
for (j = 0; j < SIZE; ++j)
if(c[i][j] != seq[i][j]) {
printf("Error %d %d\n", i,j);
exit(1);
}
printf("OpenACC matrix multiplication test was successful!\n");
return 0;
Example 1: matrix-acc-check.c source code.
The OpenACC pragma tells the compiler the following:

#pragma acc: This is an OpenACC pragma.

kernels: A kernels region.
No jumps are allowed into/out of the kernels region.
Loops will be sent to the OpenACC device.
The scope of the kernels region code block is denoted by the curly brackets in a C
program.
copyin(): copy the contiguous region of memory from the host to the device.
The variables, arrays or subarrays in the list have values in the host memory that
need to be copied to the device memory.
If a subarray is specified, then only that subarray of the array needs to be copied.
copy(): copy the contiguous memory region from the host to the device and
back again.
The variables, arrays or subarrays in the list have values in the host memory that
need to be copied to the device memory.
If a subarray is specified, then only that subarray of the array needs to be copied.
The data is copied to the device memory before entry to the kernles region, and
data copied back to the host memory when the code block is complete.


The source code is compiled with the pgcc compiler and a successful test is indicated after the application runs
as shown below:
?
1 pgcc -acc -fast -Minfo matrix-acc-check.c -o matrix-acc-check
2 ./matrix-acc-check
3 OpenACC matrix multiplication test was successful!
The source code for matrix-acc.c was created by removing the italicized code from matric-acc-check.c to
simplify the following discussion.
?
1 /* matrix-acc.c */
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#define SIZE 1000
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];
int main()
{
int i,j,k;
// Initialize
for (i = 0; i
for (j = 0;
a[i][j] =
b[i][j] =
c[i][j] =
}
}
matrices.
< SIZE; ++i) {
j < SIZE; ++j) {
(float)i + j;
(float)i - j;
0.0f;
// Compute matrix multiplication.
#pragma acc kernels copyin(a,b) copy(c)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return 0;
}
Example 2: matrix-acc.c source code.
Note the similarity between matrix-acc.c and the following OpenMP implementation, matrix-omp.c. Only the
pragmas are different as the OpenACC pragma includes copy operations that are not required in the OpenMP
implementation.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* matrix-omp.c */
#define SIZE 1000
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];
int main()
{
int i,j,k;
// Initialize
for (i = 0; i
for (j = 0;
a[i][j] =
b[i][j] =
c[i][j] =
}
}
matrices.
< SIZE; ++i) {
j < SIZE; ++j) {
(float)i + j;
(float)i - j;
0.0f;
// Compute matrix multiplication.
#pragma omp parallel for default(none) shared(a,b,c) private(i,j,k)
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
return 0;
26 }
27
28
29
30
Example 3: matrix-omp.c source code.
Fortran programmers will find the corresponding source code in Example 4. Again, the OpenACC pragmas
annotate data movement with the copy() and copyin() clauses. Note that the C-based pragmas know
the extent of the code block due to the use of curly brackets while the Fortran version must explicitly specify
the end of the scope of the pragma with "!$acc end …".
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
!
matrix-acc.f
program example1
parameter ( n_size=1000 )
real*4, dimension(:,:) :: a(n_size,n_size)
real*4, dimension(:,:) :: b(n_size,n_size)
real*4, dimension(:,:) :: c(n_size,n_size)
!
Initialize matrices (values differ from C version)
do i=1, n_size
do j=1, n_size
a(i,j) = i + j;
b(i,j) = i - j;
c(i,j) = 0.;
enddo
enddo
!$acc data copyin(a,b) copy(c)
!$acc kernels loop
!
Compute matrix multiplication.
do i=1, n_size
do j=1, n_size
do k = 1, n_size
c(i,j) = c(i,j) + a(i,k) * b(k,j)
enddo
enddo
enddo
!$acc end data
end program example1
Example 4: matrix-acc.f source code.
An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the GPU
The following commands compile the source code for each application with the PGI C and Fortran compilers.
These commands assume the source code has been saved to the file name provided in the comment at the
beginning of each example.
pgcc -fast -mp -Minfo -Mconcur=allcores matrix-omp.c -o matrix-omp
pgcc -fast -acc -Minfo matrix-acc.c -o matrix-acc-gpu
pgfortran -fast -acc -Minfo matrix-acc.f -o matrix-acc-gpuf
The command line arguments to the PGI C compiler (pgcc) and Fortran compiler (pgfortran) are:





-fast: Chooses generally optimal flags for the target platform.
-mp: Interpret OpenMP pragmas to explicitly parallelize regions of code for
execution by multiple threads on a multi-processor system.
-acc: Interpret OpenACC pragmas.
-Minfo: Emit useful information to stderr.
-Mconcur: Instructs the compiler to enable auto-concurrentization of loops.
The Portland Group also provides a profiling capability that can be enabled via the PGI_ACC_TIME
environment variable. By default, profiling is not enabled. Setting PGI_ACC_TIME to a positive integer value
enables profiling while a negative value will disable it. The profiling overhead is minimal because the runtime
only reports information collected by the GPU hardware performance counters. The wealth of information
gathered by the runtime profiler can be seen in the output generated by matrix-acc-gpu after setting
PGI_ACC_TIME=1:
?
1
2 rmfarber@bd:~/PGI/example1$ ./matrix-acc-gpu
3 Accelerator Kernel Timing data
4 /home/rmfarber/PGI/example1/matrix-acc.c
5
main
21: region entered 1 time
6
time(us): total=139658 init=88171 region=51487
7
kernels=43848 data=7049
8
w/o init: total=51487 max=51487 min=51487 avg=51487
9
25: kernel launched 1 times
10
grid: [63x63] block: [16x16]
11
time(us): total=43848 max=43848 min=43848 avg=43848
12
Example 5: Runtime profile output when PGI_ACC_TIME=1 for matrix-acc-gpu.
This output from the PGI runtime profiling tells us that the application spent 7 milliseconds transferring data
and 43 milliseconds computing the matrix multiply kernel.
It is possible to create a timeline plot using the NVIDIA Visual Profiler (nvvp), which runs on Windows, Linux
and Mac computers. (The nvvp application was previously known as computeprof.) The timeline is a new feature
in the CUDA 4.2 release and is extremely useful!
[Click image to view at full size]
Figure 1: nvvp timeline for matrix-acc-gpu.
Notice that there are:



Three host to device data transfers at the start of the computation. These
transfers correspond to the copyin() clauses for matrices a and b plus the
copy() clause for matrix c.
A GPU computation that requires 39.1% of the time for kernel main_24_gpu. A
helpful feature of the PGI OpenACC compiler is that it intelligently labels the kernel
with the routine name and line number to make these timelines intelligible.
A single data transfer back from the device to the host, which was required by the
copy clause for matrix c at the end of the kernel.
The visual profiler provides an interactive display of the timeline. A larger screenshot would show the calls to
the driver API for the CUDA context setup and the data transfers along with a host of other information. In
addition, the nvvp profiler will analyze the application and provide automated suggestions. This requires running
the application many times. It is recommended to look at the timeline first as this only requires running the
application once.
For example, the following screenshot shows the initial analysis of the timeline shown in Figure 1:
[Click image to view at full size]
Figure 2: Automated analysis performed by the NVIDIA Visual Profiler.
Matrix Multiply Is an Ideal Case
Most computationally oriented scientists and programmers are familiar with BLAS (the Basic Linear Algebra
Subprograms) library. BLAS is the de facto programming interface for basic linear algebra.
BLAS is structured according to three different levels with increasing data and runtime requirements.
1. Level-1: Vector-vector operations that require O(N) data and O(N) work.
Examples include taking the inner product of two vectors, or scaling a vector by a
constant multiplier.
2. Level-2: Matrix-vector operations that require O(N2) data and O(N2) work.
Examples include matrix-vector multiplication or a single right-hand-side triangular
solve.
3. Level-3 Matrix-vector operations that require O(N2) data and O(N3) work.
Examples include dense matrix-matrix multiplication.
The following table describes the amount of work that is performed by each BLAS level assuming that N
floating-point values are transferred from the host to the device. This table does not take into account the time
required to transfer the data back to the host.
BLAS level
Data
Work
Work per Datum
1
O(N)
O(N)
O(1)
2
O(N2)
O(N2)
O(1)
3
O(N2)
O(N2)
O(N)
Table 2: Work per datum for the three BLAS levels.
Matrix multiply is an ideal example for OpenACC acceleration because the data transfers become less important
as the size of the matrices increase. Matrix multiply is a level-3 BLAS operation that performs O(N) work for
every floating-point value transferred to the device. The effect of this high computational density can be seen in
the following plot of wall clock time on a dedicated system as the problem size increases. Multiplying 1k by 1k
square matrices results in a 1.7 speedup of matrix-acc.c over matrix-omp.c when running on an NVIDIA C2050
GPU compared with a 2.65 GHz quad-core Intel Xeon E5630 processor. Increasing the matrix sizes to 11k by
11k shows a 6.4x speedup over OpenMP. This empirically demonstrates the high work per datum runtime
behavior of matrix multiply. Similar speedups will occur for other high work-per-datum computations as well.
Figure 3: Runtime behavior by matrix size of OpenACC and OpenMP implementations (lower is
better).
The Three Rules of Coprocessor Programming
Matrix multiply is an excellent teaching tool but most real-world calculations do not exhibit such ideal behavior.
Instead, the programmer must be creative and pay close attention to data transfers and computational density
on the OpenACC device(s). High performance can be achieved when the compute intensive portions of the
application conform to the following three rules of high-performance coprocesser programming. If not, expect
application performance to be either PCIe or device memory bandwidth limited.
1. Transfer the data across the PCIe bus onto the device and keep it there.
2. Give the device enough work to do.
3. Focus on data reuse within the coprocessor(s) to avoid memory bandwidth
bottlenecks.
Using the Create Clause to Allocate on the OpenACC Device
It is easy to create the matrices on the OpenACC device and initialize them. Creating and initializing data on the
OpenACC device conforms to the first rule and avoids data transfers. The following example, matrix-acccreate.c demonstrates the use of the create() clause in a kernels region.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
/* matrix-acc-create.c */
#define SIZE 1000
float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];
int main()
{
int i,j,k;
#pragma acc kernels create(a,b) copyout(c)
{ // start of kernels
// Initialize matrices.
for (i = 0; i < SIZE; ++i) {
for (j = 0; j < SIZE; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
15
c[i][j] = 0.0f;
16
}
17
}
18
19
// Compute matrix multiplication.
20
for (i = 0; i < SIZE; ++i) {
21
for (j = 0; j < SIZE; ++j) {
for (k = 0; k < SIZE; ++k) {
22
c[i][j] += a[i][k] * b[k][j];
23
}
24
}
25
}
26 } // end of kernels
27
28 return 0;
29 }
30
31
32
33
Example 6: matrix-acc-create.c
Following is the nvvp timeline showing that two kernels are now running on the GPU.
[Click image to view at full size]
Figure 4: High resolution nvvp timeline showing two kernels.
The Visual Profiler shows that only one data transfer occurs at the end of the data region as required by the
copyout() clause.
[Click image to view at full size]
Figure 5: lower resolution nvvp timeline showing two kernels and copy at the end of the data region.
Removal of the data transfers speeds the OpenACC performance over the OpenMP version:
Speedup with
copyin()
Speedup with
create() and copyout()
and
copy() clauses over OpenMP: 6.4x
clauses over OpenMP: 6.9x
An emerging standard uses pragmas to move parallel computations in C/C++ and Fortran to the
GPU
Multidimensional Dynamic Arrays
The previous examples utilized static 2D globally accessible C arrays. Most applications utilize dynamic
allocation of all data structures including multidimensional arrays that are frequently passed to functions
and subroutines. A particular challenge for C/C++ programmers is that OpenACC transfers occur between
contiguous regions of host and device memory. The use of non-contiguous multidimensional arrays (such as
float ** arrays) is not recommended because they require individual transfers of each contiguous
memory region.
The following example, matrix-acc-func.c, dynamically allocates the 2D matrices for the test and passes
them to doTest(), which performs the matrix initializations and multiplication. This test utilizes the Clanguage (as of C99) restrict keyword that indicates the matrices do not overlap. For convenience, the
2D nature of the arrays was defined in the function to make array accesses straightforward:
?
1 int doTest(restrict float a[][SIZE], restrict float b[][SIZE],
2
restrict float c[][SIZE], int size)
3{
4 …
c[i][j] = 0.0f;
5
}
6



Example 7: Code snippet for straightforward 2D array indexing.
Of course, the programmer can pass the pointer to the contiguous region of memory and manually
calculate the offsets into the multidimensional array as will be demonstrated in the next example.
?
1 int doTest(restrict float *a, restrict float *b, restrict float *c, int size)
2{
3 …
c[i*size+j] = 0.0f;
4
5}
Example 8: Code snippet demonstrate manual calculation of the 2D array offset.
The matrix-acc-func.c example also demonstrates the use of the OpenACC pragma "#pragma acc
loop independent". The independent clause tells the compiler to ignore its own dependency analysis
and trust that the programmer knows the loops have no dependencies. Incorrect and non-deterministic
program behavior can happen if the programmer is mistaken. Conversely, the OpenACC pragma
"#pragma acc loop seq" tells the compiler to generate code that will execute sequentially on the
device.
?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* matrix-acc-func.c */
#include <stdio.h>
#include <stdlib.h>
#define SIZE 1000
int doTest(restrict float a[][SIZE],
restrict float b[][SIZE],
restrict float c[][SIZE], int size)
{
int i,j,k;
#pragma acc kernels create(a[0:size][0:size], b[0:size][0:size]) \
copyout(c[0:size][0:size])
{
// Initialize matrices.
#pragma acc loop independent
for (i = 0; i < size; ++i) {
#pragma acc loop independent
for (j = 0; j < size; ++j) {
a[i][j] = (float)i + j;
b[i][j] = (float)i - j;
c[i][j] = 0.0f;
}
}
// Compute matrix multiplication.
#pragma acc loop independent
for (i = 0; i < size; ++i) {
#pragma acc loop independent
for (j = 0; j < size; ++j) {
#pragma acc loop seq
for (k = 0; k < size; ++k) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
}
}
int main()
{
int i,j,k;
int size=SIZE;
float *a= (float*)malloc(sizeof(float)*size*size);
float *b= (float*)malloc(sizeof(float)*size*size);
float *c= (float*)malloc(sizeof(float)*size*size);
doTest(a,b,c, size);
free(a);
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71 }
72
73
74
75
76
77
78
79
80
81
82
free(b);
// ****************
// double-check the OpenACC result sequentially on the host
// ****************
float *seq= (float*)malloc(sizeof(float)*size*size);
// Initialize the seq matrix
for(i = 0; i < size; ++i)
for(j = 0; j < size; ++j)
seq[i*SIZE+j] = 0.f;
// Perform the multiplication
for (i = 0; i < size; ++i)
for (j = 0; j < size; ++j)
for (k = 0; k < size; ++k)
seq[i*size+j] += (i+k) * (k-j);
// check all the OpenACC matrices
for (i = 0; i < size; ++i)
for (j = 0; j < size; ++j)
if(c[i*size+j] != seq[i*size+j]) {
printf("Error (%d %d) (%g, %g)\n", i,j, c[i*size+j], seq[i*size+j]);
exit(1);
}
free(c);
free(seq);
printf("OpenACC matrix multiplication test was successful!\n");
return 0;
Example 9: matrix-acc-func.c source code.
Using Data Allocated on the Device
OpenACC also provides the ability to use previously allocated device memory with the deviceptr()
clause. The following example matrix-acc-alloc.c demonstrates how to allocate memory in main() with
the OpenACC runtime method acc_malloc(). The pointer is then passed to doTest() where it is
accessed via deviceptr(). The copyout() clause also includes the size of the contiguous region of
memory. For timing purposes, this code utilizes a size specified by the user on the command-line.
?
1
2
3
4
5
6
7
8
9
/* matrix-acc-alloc.c */
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int doTest(restrict float *a, restrict float *b,
restrict float *c, int size)
{
int i,j,k;
#pragma acc kernels deviceptr(a, b) copyout(c[0:size*size-1])
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
// Initialize
#pragma acc loop
for (i = 0; i
#pragma acc loop
for (j = 0;
a[i*size+j] =
b[i*size+j] =
c[i*size+j] =
}
}
matrices.
independent
< size; ++i) {
independent
j < size; ++j) {
(float)i + j;
(float)i - j;
0.0f;
// Compute matrix multiplication.
#pragma acc loop independent
for (i = 0; i < size; ++i) {
#pragma acc loop independent
for (j = 0; j < size; ++j) {
#pragma acc loop seq
for (k = 0; k < size; ++k) {
c[i*size+j] += a[i*size+k] * b[k*size+j];
}
}
}
}
}
int main(int argc, char *argv[])
{
int i,j,k;
if(argc < 2) {
fprintf(stderr,"Use: size (for size x size) matrices\n");
return -1;
}
int size=atoi(argv[1]);
float *a = (float *)acc_malloc(sizeof(float)*size*size);
float *b = (float *)acc_malloc(sizeof(float)*size*size);
float *c= (float*)malloc(sizeof(float)*size*size);
printf("size = %d\n",size);
doTest(a,b,c, size);
acc_free(a);
acc_free(b);
free(c);
printf("OpenACC matrix multiplication test was successful!\n");
return 0;
}

Example 10: Source code for matrix-acc-alloc.c.
Conclusion
OpenACC has been designed to provide OpenMP-style programmers with an easy transition to GPU
programming. Following the common sense adage, "Make your life easy and use the highest level API first,"
OpenACC provides a natural starting point to transition any C or Fortran code to massive parallelism. For
legacy code, OpenACC can be the only viable route to massively parallel coprocessors because it eliminates
the need for a total rewrite of the software and Fortran is supported. As a result, OpenACC opens the door
to scalable, massively parallel GPU (or, more generically, coprocessor) acceleration of millions of lines of
legacy application code. Currently, OpenACC is supported by compilers that must be purchased from either
PGI or CAPS Enterprise. The PGI compiler used in this article is free for evaluation but continued use after
the trial period expires requires a license that must be purchased. As with OpenMP, it is assumed that opensource compilers will eventually provide free OpenACC support.
Profiling and informational compiler messages play a key role in achieving high performance in pragmabased programming. Instead of having to blindly add pragmas and then guess at the impact of each might
have on an application, free tools like the NVIDIA Visual Profiler let the developer actually see what is
happening during runtime on Windows, Linux, and Mac computers. Being able to see what effect OpenACC
pragmas have on runtime behavior greatly speeds the OpenACC learning process as well as application
acceleration.
My next article in this series will discuss the OpenACC memory and execution model including the gang and
worker clauses plus more sophisticated ways to handle data.
Rob Farber is an analyst who writes frequently on High-Performance Computing hardware topics.
Download