What could go wrong?

advertisement
OpenACC for Fortran
PGI Compilers for Heterogeneous Supercomputing
Advanced GPU Programming
Data management API routines
Multiple devices
Atomic operations
Derived types
Managed memory
Conditional GPU code
Multicore as a target
Interoperability with OpenMP
Interoperability with CUDA C and CUDA Libraries
Interoperability with CUDA Fortran
Data Management API
acc_copyin(a(:))
acc_create(b(:))
acc_copyout(a(:))
acc_delete(b(:))
acc_is_present(a(2:n-1))
acc_update_host(a(2:n-1))
acc_update_device(b(2:n))
acc
acc
acc
acc
enter data copyin
enter data create
exit data copyout
exit data delete
acc update host
acc update device
Multiple Devices
Environment Variable ACC_DEVICE_NUM
API routine acc_set_device_num
- call acc_set_device_num( 1, acc_device_nvidia )
OpenMP, based on thread number
- nd = acc_get_num_devices( acc_device_nvidia )
- ign = mod(omp_get_thread_num(),nd)
- call acc_set_device_num( ign, acc_device_nvidia )
MPI, based on rank
-
nd = acc_get_num_devices( acc_device_nvidia )
call mpi_comm_rank( mpi_comm_world, irank, ierror )
ign = mod(irank,nd)
call acc_set_device_num( ign, acc_device_nvidia )
Multiple devices with OpenMP
nd = acc_get_num_devices( acc_device_nvidia )
!$omp parallel private(ign)
ign = mod(omp_get_thread_num(),nd)
call acc_set_device_num( ign, acc_device_nvidia )
!$omp end parallel
...
!$acc data copy(a(:,:))
!$omp parallel do
do j = 1, n
!$acc parallel loop
do i = 1, n
a(i,j) = ...
enddo
enddo
!$acc end data
What could go wrong?
Multiple devices with OpenMP
...
!$omp parallel
!$acc data copy( a(:,:) )
!$omp do
do j = 1, n
!$acc parallel loop present(a)
do i = 1, n
a(i,j) = ...
enddo
enddo
!$acc end data
!$omp end parallel
What could go wrong?
Multiple devices with one thread
nd = acc_get_num_devices( acc_device_nvidia )
nchunk = (n+nd-1)/nd
do ign = 0, nd-1
call acc_set_device_num( ign, acc_device_nvidia )
jlow = ign*nchunk + 1
jhigh = max(n, (ign+1)*nchunk)
!$acc enter data copyin(a(:,jlow:jhigh)) async
!$acc parallel loop async
do j = jlow, jhigh
do i = 1, n
a(i,j) = ...
enddo
enddo
!$acc exit data copyout(a(:,jlow:jhigh)) async
enddo
!$acc wait
What could go wrong?
Multiple devices with MPI
No sharing between ranks, even on same GPU
Can run out of memory (no virtual memory on GPU)
Atomic Operations
OpenACC atomic construct, like OpenMP atomic construct
- some constructs will generate hardware atomic operations
!$acc atomic update
x = x + a(i)
!$acc atomic update
y = min(y,b(i))
!$acc atomic capture
ix = ix + 1
ime = ix
!$acc atomic
Fortran Derived Types
Arrays of derived type work just like arrays
Derived type with fixed size array members, should just work
Derived type with allocatable array members
- Deep copy not implemented (or defined)
- Workaround for PGI
type mdt
integer :: n
real, dimension(:), allocatable :: xm
end type
type(dt) :: x
...
!$acc enter data copyin(x)
!$acc enter data copyin(x%xm)
....
!$acc exit data copyout(x%xm)
!$acc exit data delete(x)
type mdt
integer :: n
real, dimension(:), allocatable :: xm
end type
type(dt), allocatable :: x(:)
...
!$acc enter data copyin(x)
do i = 1, n
!$acc enter data copyin(x(i)%xm)
enddo
....
Managed Memory
Compile and link with –ta=tesla:managed
Allocate statements will allocate in CUDA Unified Memory
Advantages
-
Most data clauses can be skipped, and in fact are ignored
If locality works, most data stays on the GPU
Data transfers use fast pinned data transfers
Good for initial porting
Derived type allocatable members automatically work
Managed Memory
Disadvantages
- All managed memory is moved to the GPU for each kernel launch
- No prefetch, no asynchronous data movement
- Only works for dynamically allocated memory
- local variables, module variables, static symbols are not managed
-
Limited to memory size of the GPU
Allocate and Deallocate are expensive
Kepler only
Only one device
Your program can segfault(!) if the host code accesses managed data GPU is busy
Conditional GPU code
if clause on acc parallel / acc kernels
acc_on_device(acc_device_...)
subroutine host_or_device( a, ongpu )
real, dimension(:) :: a
logical :: ongpu
!$acc parallel loop if(ongpu) default(present)
do i = 1, ubound(a,1)
if( acc_on_device( acc_device_nvidia) )then
a(i) = hostfoo( a(i) )
else
a(i) = devfoo( a(i) )
endif...
Compile for GPU and Host
–ta=tesla,host
- compiles each compute region for Tesla and sequential host code
ACC_DEVICE_TYPE
- nvidia or host
acc_set_device_type( acc_device_nvidia | acc_device_host )
Compile for Multicore
–ta=multicore
- compiles each compute region for parallel multicore host execution
- –ta=tesla,multicore will work in 2016
Currently being beta tested
- only one outer parallel loop is run in parallel
- no tuning for multicore execution (yet)
- no data movement (data clauses ignored)
Useful for initial code development
Useful for multi-target code deployment
Interoperability with OpenMP
–acc –mp to enable OpenACC and OpenMP
Threads can share a GPU
Shared data on host will be shared on the GPU as well
Data regions can overlap for shared data
- data created / copied in at entry to first data region
- data copied out / deleted at exit from last data region, even if different thread
Interoperability with OpenMP 4
No existing implementation of OpenMP 4 and OpenACC (Cray?)
Data management of both are coherent
- copy == map(inout), copyin == map(in), copyout == map(out), create == map(alloc)
- OpenACC defines two copies kept coherent by program
- OpenMP defines mapping a single copy from host to device and back
Parallelism management is very different
-
OpenMP has teams, threads, SIMD lanes
OpenACC has gangs, workers, vector lanes
OpenMP is strictly prescriptive, parallel loop is a loop that runs in parallel
OpenACC is more descriptive, parallel loop is a real parallel loop
Same runtime should be able to handle both in a single program
Interoperability with OpenMP 4
No existing implementation of OpenMP 4 and OpenACC (Cray?)
Data management of both are coherent
- copy == map(inout), copyin == map(in), copyout == map(out), create == map(alloc)
- OpenACC defines two copies kept coherent by program
- OpenMP defines mapping a single copy from host to device and back
Parallelism management is very different
-
OpenMP has teams, threads, SIMD lanes
OpenACC has gangs, workers, vector lanes
OpenMP is strictly prescriptive, parallel loop is a loop that runs in parallel
OpenACC is more descriptive, parallel loop is a real parallel loop
Same runtime should be able to handle both in a single program
Interoperability with CUDA
The device kernels are CUDA kernels, the data is CUDA data
Data interoperability
-
Calling OpenACC C with data from CUDA C
Calling OpenACC Fortran with data from CUDA C
Calling OpenACC Fortran with data from CUDA Fortran
Calling CUDA C with OpenACC data
Calling CUDA Fortran with OpenACC data
Compute interoperability
- Calling CUDA C device routines from OpenACC
- Calling CUDA Fortran device routines from OpenACC
OpenACC data in CUDA C
#pragma acc data copyin(a[0:n]) copy(x[0:n])
{
...
#pragma acc host_data use_device(a)
{
cuda_routine( a );
}
...
#pragma acc parallel loop
for( j = 0; j < n; ++j ) a[j] = ...
}
CUDA C data in OpenACC
float *a;
cudaMalloc( &a, sizeof(float)*n );
...
openacc_routine( a );
...
void openacc_routine( float* a ){
...
#pragma acc parallel loop deviceptr(a)
for( j = 0; j < n; ++j ) a[j] = ...
}
CUDA C data in OpenACC
float *a;
cudaMalloc( &a, sizeof(float)*n );
...
openacc_routine_( a );
...
subroutine openacc_routine( a )
real a(*)
!$acc parallel loop deviceptr(a)
do j = 1, n
a(j) = ...
enddo
end subroutine
CUDA Fortran data in OpenACC
real, allocatable, device :: a(:)
allocate(a(n))
...
call openacc_routine(a)
...
subroutine openacc_routine( a )
real, device :: a(*)
!$acc parallel loop
do j = 1, n
a(j) = ...
enddo
end subroutine
OpenACC data in CUDA Fortran
real, allocatable :: a(:)
allocate(a(n))
!$acc data copyin(a)
...
call cuf_routine(a)
...
!$acc end data
...
subroutine cuf_routine( a )
real, device :: a(*)
!$cuf kernels do<<<*,64>>>
do i = 1, n
....
OpenACC data in CUDA Fortran
real, allocatable :: a(:)
allocate(a(n))
!$acc data copyin(a)
...
call cuf_kernel<<<n/64,64>>>(a)
...
!$acc end data
...
attributes(global) subroutine cuf_routine( a )
real, device :: a(*)
....
CUDA device routines in OpenACC
interface
subroutine cudadev( a, i, x ) bind(c)
real a(*)
real, value :: x
integer, value :: i
!$acc routine seq
end subroutine
end interface
...
!$acc parallel loop gang vector present(a)
do i = 1, n
call cudadev( a, i, x )
enddo
CUDA device routines in OpenACC
__device__ void cudadev( float* a, int i, float x ){
a[i] *= x;
}
CUDA device routines in OpenACC
module mm
contains
attributes(device) subroutine cudadev( a, i, x )
real a(*)
real, value :: x
integer, value :: i
a(i) = x*a(i)
end subroutine
end module
use mm
!$acc parallel loop gang vector present(a)
do i = 1, n
call cudadev( a, i, x )
enddo
CUDA Fortran and OpenACC
Data with device attribute can be used in OpenACC
regions
Data transfers with pinned attribute will be faster
OpenACC compute regions may call CUDA library
OpenACC compute regions may call user device
procedures
OpenACC data may be passed to arguments with device
attribute
CUDA Libraries
-Mcudalib=cublas|cufft|curand|cusparse
CUBLAS
- use cublas or use cublasxt or use openacc_cublas
CUFFT
- use cufft
- cufftSetStream(plan,acc_get_cuda_stream(acc_async_sync))
CURAND
- use curand or use openacc_curand
THRUST
- interface blocks, acc_get_cuda_stream(acc_async_sync)
OpenACC and OpenMP 4
OpenACC
OpenMP
Focused on accelerated computing
General purpose parallelism
More agile
More measured
Performance Portability
Performance Portability a challenge
Descriptive
Prescriptive
Parallel loops
Loops that run across threads
Extensive interoperability
Limited interoperability
More mature for accelerators
More mature for multi-core
Modern HPC Node
X86 CPU
X86 CPU
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
$
HT/QPI
Shared Cache
Shared Cache
High
Capacity
Memory
High
Capacity
Memory
Modern HPC Node
X86 CPU
GPU Accelerator
$
$
$
$
$
$
$
$
$
$
$
$
PCIe 3
Shared Cache
$
$
$
$
$
$
$
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
$
Latency- vs Throughput-Optimized Cores
CPU, LOC
Fast clock (2.5-3.5 GHz)
More work per clock
-
deep pipelining
3-5 wide multiscalar instruction issue
4-16 wide SIMD instructions
4-24 cores
Fewer stalls
-
Large 10-24MB cache
Complex branch prediction
Out-of-order execution
2-4 wide multithreading
Accelerator, GPU, TOC
Slow clock (.8-1.2 GHz)
More work per clock
-
shallow pipelining
1-2 wide multiscalar instruction issue
16-64 wide SIMD instructions
24-72 cores
Fewer stalls
-
Small .25-2MB cache
Little branch prediction
In-order execution
15-32 wide multithreading
Modern HPC Node
X86 CPU
Xeon Phi
$
$
$
$
$
$
$
$
$
$
$
$
PCIe 3
Shared Cache
$
$
$
$
$
$
$
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
$
Modern HPC Node
APU
$
$
$
$
$
$
Shared Cache
High Capacity Memory
Modern HPC Node
Knights Landing
$
$
$
$
$
$
$
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
$
Modern HPC Node
APU
$
$
$
$
$
$
High Bandwidth
Memory
Shared Cache
High Capacity Memory
Modern HPC Node
POWER CPU
$
$
$
$
$
$
$
$
$
$
$
$
Tesla Accelerator
NVLink
Shared Cache
$
$
$
$
$
$
$
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
$
Modern HPC Node
ARM CPU
Tesla Accelerator
$
$
$
$
$
$
$
$
PCIe 3
Shared Cache
$
$
$
$
$
$
$
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
$
Modern HPC Node
ARM CPU
Tesla Accelerator
$
$
$
$
$
$
$
$
NVLink
Shared Cache
$
$
$
$
$
$
$
Shared Cache
High
Capacity
Memory
High Bandwidth
Memory
$
Performance Portability
The same program
runs and runs well
across multiple targets
Performance Portability
program
seq(s)
gpu(s)
speedup
m-core(s)
speedup
clvrleaf
2698
161.73
16.7
511.1
5.2
13463
115.43
116.0
400.03
33.6
1062
146.13
7.3
319.93
3.3
449
305.97
1.4
95.75
4.7
13835
60.13
230.0
2276.57
6.0
537
83.98
6.4
121.20
4.4
md
minighost
olbm
ostencil
swim
These are SPECAccel benchmark estimates
dual-processor Intel Haswell (32 cores) with NVIDIA K80 GPU
OpenACC Course – Starts Oct 1st
A Free Online Course
Experienced Instructors
OpenACC Toolkit
GPU Access
4 Classes
4 Office Hours
Hands-on Labs
Register at https://developer.nvidia.com/openacc_course
Performance Portable Programming
Challenges and Opportunities
- high core count devices
- large system memories, smaller high-bandwidth memories
OpenACC is demonstrating performance portability
Data management: as important as parallelism
- data location as well as data layout
Parallelism: Expose, Express, Exploit
- performance, not parallelism
https://www.pgroup.com/userforum
https://developer.nvidia.com/openacc
https://developer.nvidia.com/openacc_course
openacc@nvidia.com
OpenACC 1, 2, 2.5, 3, ...
OpenACC 1.0: data region, compute region, update, async
OpenACC 2.0: +routine, +atomics, +enter data/exit data
OpenACC 2.5: +default(present), -present_or_, +profile interface
OpenACC 3.0 (planned): deep copy, shared memory options
Backup, backup, backup
backup
`
!$acc data copyin(a(:,:), v(:)) copy(x(:))
!$acc parallel
!$acc loop gang
do j = 1, n
sum = 0.0
!$acc loop vector reduction(+:sum)
do i = 1, n
sum = sum + a(i,j) + v(i)
enddo
x(j) = sum
enddo
!$acc end parallel
!$acc end data
`
!$acc data copyin(a(:,:), v(:)) copy(x(:))
call matvec( a, v, x, n )
!$acc end data
...
subroutine matvec( m, v, r, n )
real :: m(:,:), v(:), r(:)
!$acc parallel present(a,v,r)
!$acc loop gang
do j = 1, n
sum = 0.0
!$acc loop vector reduction(+:sum)
do i = 1, n
sum = sum + m(i,j) + v(i)
enddo
r(j) = sum
enddo
!$acc end parallel
end subroutine
`
!$acc data copyin(a(:,:), v(:)) copy(x(:))
call matvec( a, v, x, n )
!$acc end data
...
subroutine matvec( m, v, r, n )
real :: m(:,:), v(:), r(:)
!$acc parallel default(present)
!$acc loop gang
do j = 1, n
sum = 0.0
!$acc loop vector reduction(+:sum)
do i = 1, n
sum = sum + m(i,j) + v(i)
enddo
r(j) = sum
enddo
!$acc end parallel
end subroutine
call init( v, n )
call fill( a, n )
!$acc data copy( x )
do iter = 1, niter
call matvec( a, v, x, n )
call interp( b, x, n )
!$acc update host( x )
write(...) x
call exch( x )
!$acc update device( x )
enddo
!$acc end data
...
subroutine init( v, n )
real, allocatable :: v(:)
allocate(v(n))
v(1) = 0
do i = 2, n
v(i) = ....
enddo
!$acc enter data copyin(v)
end subroutine
Download