A Source-to-Source OpenACC compiler for CUDA

advertisement
A Source-to-Source OpenACC
compiler for CUDA
Akihiro Tabuchi†1
Masahiro Nakao†2
Mitsuhisa Sato†1
†1. Graduate School of Systems and Information Engineering,
University of Tsukuba
†2. Center for Computational Sciences, University of Tsukuba
Outline
•
•
•
•
•
Background
OpenACC
Compiler Implementation
Performance Evaluation
Conclusion & Future Work
Background
• Accelerator programming model
– CUDA (for NVIDIA GPU)
– OpenCL (for various accelerators)
• Accelerator programming is complex
– memory management, kernel function, …
– low productivity & low portability
• OpenACC is proposed to solve these problems
OpenACC
• The directive-based programming model for
accelerators
– support C, C++ and Fortran
• Offloading model
– offload a part of code to an accelerator
• High productivity
– only adding directives
• High portability
– run on any accelerators
as long as the compiler supports it
Example of OpenACC
int main(){
int i;
int a[N], b[N], c[N];
/* initialize array ‘a’ and ‘b’ */
#pragma acc parallel loop copyin(a,b) copyout(c)
for(i = 0; i < N; i++){
c[i] = a[i] + b[i];
}
}
This directive specifies data transfers and loop
offloading and parallelization
Purpose of Research
• Designing and implementing an open source
OpenACC compiler
– Target language
:C
– Target accelerator : NVIDIA GPU
– Source-to-source approach
• C + OpenACC → C + CUDA API
• This approach enables to leave detailed machinespecific code optimization to the mature CUDA
compiler by NVIDIA
– The result of compilation is a executable file
Related Work
• Commercial compiler
– PGI Accelerator compiler
– CAPS HMPP
– Cray compiler
• Open source compiler
– accULL
•
•
•
•
developed at University of La Laguna in Spain
Source-to-source translation
Backend is CUDA and OpenCL
Output is codes and a Makefile
OpenACC directives
•
•
•
•
•
•
•
parallel
kernels
loop
data
host_data
update
wait
•
•
•
•
cache
declare
parallel loop
kernels loop
(OpenACC specification 1.0)
data construct
int a[4];
#pragma acc data copy(a)
{
/* some codes using ‘a’ */
host memory
device memory
computation
on device
}
Data management on Accelerator
If an array is specified in “copy” clause …
1. Device memory allocation
at the beginning of region
2. Data transfer from host to device
3. Data transfer from device to host
at the end of region
4. Device memory release
Translation of data construct
int a[4];
#pragma acc data copy(a)
{
/* some codes using ‘a’ */
}
•
•
•
•
host address
device address
size
….
int a[4];
{
allocate ‘a’ on GPU
void *_ACC_DEVICE_ADDR_a,*_ACC_HOST_DESC_a;
_ACC_gpu_init_data(&_ACC_HOST_DESC_a, &_ACC_DEVICE_ADDR_a, a,
4*sizeof(int)); _ACC_gpu_copy_data(_ACC_HOST_DESC_a, 400);
copy ‘a’ to GPU from host
{
/* some codes using ‘a’ */
}
copy ‘a’ to host from GPU
_ACC_gpu_copy_data(_ACC_HOST_DESC_a, 401);
_ACC_gpu_finalize_data(_ACC_HOST_DESC_b);
free ‘a’ on GPU
}
parallel construct
#pragma acc parallel num_gangs(1) vector_length(128)
{
/* codes in parallel region */
}
• Codes in parallel region are executed on device
• Three levels of parallelism
– gang
– worker
– vector
• The number of gang or worker
or vector length can be
specified by clauses
OpenACC CUDA
gang
thread block
worker
(warp)
vector
thread
Translation of parallel construct
#pragma acc parallel num_gangs(1) vector_length(128)
{
/* codes in parallel region */
}
__global__ static void _ACC_GPU_FUNC_0_DEVICE( ... )
{
GPU kernel
/* codes in parallel region */
function
}
extern "C” void _ACC_GPU_FUNC_0( … )
{
dim3 _ACC_block(1, 1, 1), _ACC_thread(128, 1, 1);
kernel
_ACC_GPU_FUNC_0_DEVICE<<<_ACC_block,_ACC_thread>>>( .. launch
. );
function
_ACC_GPU_M_BARRIER_KERNEL();
}
loop construct
/* inside parallel region */
#pragma acc loop vector
for(i = 0; i < 256; i++){
a[i]++;
}
• Loop construct describes parallelism of loop
– Distribute loop iteration among gang, worker or vector
– Two or more parallelisms can be specified for a loop
• Loops with no loop directive in parallel region is
basically executed serially.
Translation of loop construct (1/3)
/* inner parallel region */
#pragma acc loop vector
for(i = 0; i < N; i++){
a[i]++;
}
1. A virtual index which is
the same length as loop
iteration is prepared
0
1 2 3 4 5 6 7
8 9 1 1 1 1 1 1
0 1 2 3 4 5
2. The virtual index is divided
and distributed among
blocks and/or threads
0
1
2
3
8
9 10 11
4
5
6
7
12 13 14 15
3. Each thread calculates
the value of loop variable
from the virtual index
and executes loop body
Translation of loop construct (2/3)
/* inner parallel region */
#pragma acc loop vector
for(i = 0; i < N; i++){
a[i]++;
}
virtual index:_ACC_idx
virtual index range : _ACC_init, cond, step
calculate the range of virtual index
/* inner gpu kernel code */
virtual index
int i, _ACC_idx;
range variables
int _ACC_init, _ACC_cond, _ACC_step;
_ACC_gpu_init_thread_x_iter(&_ACC_init, &_ACC_cond, &_ACC_step, 0, N, 1);
for(_ACC_idx = _ACC_init; _ACC_idx < _ACC_cond; _ACC_idx += _ACC_step){
_ACC_gpu_calc_idx(_ACC_idx, &i, 0, N, 1);
a[i]++;
calculate ‘i’ from virtual index
loop body
}
Translation of loop construct(3/3)
• Our compiler supports 2D blocking for nested
loops
– Nested loops are distributed among the 2D blocks
in the 2D grid in CUDA (default block size is 16x16)
– But it’s not allowed in OpenACC 2.0 and
2D Grid
“tile” clause is provided instead
#pragma acc loop gang vector
for( i = 0; i < N; i++)
#pragma acc loop gang vector
for(j = 0; j < N; j++)
/* … */
distribute
2D Block
Compiler Implementation
• Our compiler translates C with OpenACC
directives to C with CUDA API
– read C code with directives and output translated
code
– using Omni compiler infrastructure
• Omni compiler infrastructure
– a set of programs for a source-to-source compiler with
code analysis and transformation
– supports C and Fortran95
Flow of Compilation
sample.c
C with OpenACC
directives
a.out
Omni
Frontend
sample.xml
OpenACC
translator
XcodeML
Omni compiler
infrastructure
acc runtime
sample_tmp.o
C
compiler
sample
_tmp.c
C with ACC API
sample.gpu.o
nvcc
sample.cu
CUDA
Performance Evaluation
• Benchmark
– Matrix multiplication
– N-body problem
– NAS Parallel Benchmarks – CG
• Evaluation environment
– 1 node of Cray XK6m-200
• CPU : AMD Opteron Processor 6272 (2.1GHz)
• GPU : NVIDIA X2090 (MatMul, N-body)
: NVIDIA K20 (NPB CG)
Performance Comparison
• Cray compiler
• Our compiler
• Hand written CUDA
– The code is written in CUDA and compiled by NVCC
– The code doesn’t use shared memory of GPU
• Our compiler (2D-blocking)
– The code uses 2D blocking and is compiled by our
compiler
– This is applied to only matrix multiplication
Relative performance against
CPU
Matrix multiplication
6
5
5.5x
4.6x
1.4x
Cray compiler
41.5x
3
Hand-written CUDA
2
Our compiler
1
Our compiler, 2DBlocking
0
1K
2K
4K
Matrix size
8K
The performance of our compiler using 2D-blocking
and hand-written CUDA are slightly lower
Matrix multiplication
• Our compiler achieves better performance than that of
Cray compiler
– The PTX code directly generated by Cray compiler has more
operations in the innermost loop
– Our compiler outputs CUDA code, and NVCC generates more
optimized PTX code
• 2D-blocking is lower performance
– default 2D block size (16x16) is not
adequate to this program
– the best block size was 512x2
– Hand-written CUDA code also uses
16x16 block
Relative performance against
CPU
N-body
35
1.2x
31x
30
25
20
Cray compiler
15
10
0.95x
5
Hand-written CUDA
5.4x
Our compiler
0
1K
2K
4K
8K
16K
The number of particles
32K
At the small problem size, the performance of our
compiler is lower than that of Cray compiler
N-body
• At small problem size, the performance became worse
– Decline in the utilization of Streaming
Multiprocessors(SMs)
• A kernel is executed by SMs per thread block
– If the number of blocks is smaller than that of SMs, the
performance of the kernel becomes low.
• Default block size
– Cray compiler : 128 threads / block
– Our compiler : 256 threads / block
Relative perfomance against CPU
NPB CG
12
9.7x
10
8
2.1x
6
4
0.66x
Cray compiler
2
0.74x
0
Our compiler
Class(Matrix size)
the performance is lower than
that of CPU and Cray compiler
NPB CG
• At class S, the performance of GPU is lower than that of
CPU
– Overheads are larger compared with kernel execution time
• launching kernel functions
• synchronization with device
• data allocation / release / transfer
• The overhead is larger than
that of Cray compiler
– large overhead of reduction
• The performance of GPU kernels
are better than that of Cray compiler
Conclusion
• We implemented a source-to-source OpenACC
compiler for CUDA
– C with OpenACC directives → C with CUDA API
– Using Omni compiler infrastructure
• In most case, the performance of GPU code by our
compiler is higher than that of CPU single core
– Speedup of up to 31 times at N-body
• Our compiler makes use of CUDA backend successfully
by source-to-source approach
– the performance is often better than that of Cray compiler
• There is room for performance improvement
– using suitable grid size and block size
– reducing overhead of synchronization and reduction
Future Work
• Optimization
– tuning block size at compile time
– reducing overhead from synchronization and
reduction
• Support the full set of directives for
conforming to OpenACC specification in our
compiler
– We will release our compiler at next SC
Download