Dmitry Pekurovsky
San Diego Supercomputer Center
UC San Diego dmitry@sdsc.edu
Presented at XSEDE’13 , July 22-25, San Diego
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Introduction: Fast Fourier Transforms and related spectral transforms
• Project scope: algorithms operating on structured grids in three dimensions that are computationally demanding, and process data in each dimension independently of others.
• Examples: Fourier, Chebyshev, finite difference highorder compact schemes
• Heavily used in many areas of computational science
• Computationally demanding
• Typically not a cache-friendly algorithm
• Memory bandwidth is stressed
• Communication intense
• All-to-all exchange is an expensive operation, stressing bisection bandwidth of the host’s network
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
1D decomposition 2D decomposition z y x
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Algorithm scalability
• 1D decomposition: concurrency is limited to N
(linear grid size).
• Not enough parallelism for O(10 4 )-O(10 5 ) cores
• This is the approach of most libraries to date
(FFTW 3.3, PESSL)
• 2D decomposition: concurrency is up to N 2
• Scaling to ultra-large core counts is possible
• The answer to the petascale challenge
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Need for general-purpose scalable library for spectral transforms
Requirements for the library:
1. Scalable to large core counts on significantly large problem sizes (implies 2D decomposition)
2. Achieves performance and scalability reasonably close to upper hardware capability
3. Has a simple user interface
4. Is sufficiently versatile to be of use to many research groups
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT
• Open source library for efficient, highly scalable spectral transforms on parallel platforms
• Uses 2D decomposition
• Includes 1D option.
• Available at http://code.google.com/p/p3dfft
• Historically grew out of a Teragrid Advanced User
Support Project (now called ECSS)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT 2.6.1: features
• Currently implements:
1. Real-to-complex (R2C) and complex-to-real (C2R) 3D FFT transforms
2. Complex-to-complex 3D FFT transforms
3. Cosine/sine/Chebyshev transforms in third dimension (FFT in first two dimensions)
4. Empty transform in the third dimension
• User can substitute their custom algorithm
• Fortran and C interfaces
• Single or double precision
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT 2.6.1 features (contd)
• Arbitrary dimensions
• Handles many uneven cases (N i
M j) does not have to be a factor of
• Can do either in-place or out-of-place transform
• Can do pruned input/output (when only a subset of output or input modes is needed). This can save substantial time, as shown later.
• Includes installation instructions, extensive documentation, example programs in Fortran and C
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
3D FFT algorithm with
2D decomposition
Z
Y
X
X-Y plane exchange in row subgroups
Y-Z plane exchange in column subgroups
Perform 1D FFT in Z Perform 1D FFT in X Perform 1D FFT in Y
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT implementation
• Baseline version implemented in Fortran90 with MPI
• 1D FFT: call FFTW or ESSL
• Transpose implementation in 2D decomposition:
• Set up 2D cartesian subcommunicators, using
MPI_COMM_SPLIT (rows and columns)
• Two transposes are needed: 1. in rows
2. in columns
• Baseline version: exchange data using MPI_Alltoall or
MPI_Alltoallv
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Computation performance
• 1D FFT, three times: 1. Stride-1
2. Small stride
3. Large stride (out of cache)
• Strategy:
• Use an established library (ESSL, FFTW)
• An option to keep data in original layout, or transpose so that the stride is always 1
• The results are then laid out as (Z,Y,X) instead of (X,Y,Z)
• Use loop blocking to optimize cache use
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Communication performance
• A large portion of total time (up to 80%) is all-to-all
• Highly dependent on optimal implementation of
MPI_Alltoall (varies with vendor)
• Buffers for exchange are close in size
• Good load balance, predictable pattern
• Performance can be sensitive to choice of 2D virtual processor grid (M
1
,M
2
)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Performance dependance on processor grid shape M
1 xM
2
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Communication scaling and networks
• All-to-all exchanges are directly affected by bisection bandwidth of the interconnect
• Increasing P decreases buffer size
• Expect 1/P scaling on fat-trees and other networks with full bisection bandwidth (until buffer size gets below the latency threshold).
• On torus topology (Cray XT5, XE6) bisection bandwidth scales as P 2/3
• Expect P -2/3 scaling
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Strong scaling on Cray XT5 (Kraken) at
NICS/ORNL
4096 3 grid, double precision, best M
1
/M
2 combination
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Weak Scaling (Kraken)
N 3 grid, double precision
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
2D vs. 1D decomposition
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Applications of P3DFFT
P3DFFT has already been applied in a number of codes, in science fields including the following:
• Turbulence
• Astrophysics
• Oceanography
Other potential areas include
• Material Science
• Chemistry
• Aerospace engineering
• X-ray crystallography
• Medicine
• Atmospheric science
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
DNS turbulence
• Direct Numerical Simulations (DNS) code from Georgia Tech (P.K.Yeung
et al.) to simulate isotropic turbulence on a cubic periodic domain
• Characterized by disorderly, nonlinear fluctuations in 3D space and time that span a wide range of interacting scales
• DNS is an important tool for first-principles understanding of turbulence in great detail
• Vital for new concepts and models as well as improved engineering devices
• Areas of application include aeronautics, environment, combustion, meteorology, oceanography
• One of three Model Problems for NSF’s Track 1 solicitation
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
DNS algorithm
• It is crucial to simulate grids with high resolution to minimize discretization effects, and study a wide range of length scales.
• Uses Runge-Kutta 2 nd or 4 th order for time-stepping
• Uses pseudospectral method to solve Navier-Stokes eqs.
• 3D FFT is the most time-consuming part
• 2D decomposition based on P3DFFT framework has been implemented.
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
DNS performance (Cray XT5)
8192 3
4096 3
N cores
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT Development: Motivation
• 3D FFT is a very common algorithm. In pseudospectral algorithms simulating turbulence it is mostly used for cases with isotropic conditions and in homogeneous domains with periodic boundary conditions. In this case only real-to-complex and complex-to-real 3D FFT are needed.
• For many researchers it is more interesting to study inhomogeneous systems, for example wall-bounded flows
(Dirichlet or other non-periodic boundary conditions in one dimension). Chebyshev transforms or finite difference higherorder compact schemes are more appropriate.
• In simulating compressible turbulence , again higher-order compact schemes are used.
• In other applications, a complex-to-complex transform may be needed; or a custom user transform.
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT Development: Motivation (cont’d)
• Many CFD/turbulence codes use 2/3 dealiasing technique, where only 2/3 of the modes in each dimension are kept after the forward Fourier Transform. Potential time- and memorysaving opportunity.
• Many codes have several independent arrays (variables) that need to be transformed. This can be implemented in a staggered fashion so as to overlap communication with computation.
• Some codes employ 3D rather than 2D domain decomposition. They need utilities to go between 2D and 3D.
• In some cases, usage scenario does not fall into the common fold, and the user might need access to isolated transposes.
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT - Ongoing and planned work
Part 1: Interface and Flexibility
1. Added other types of transform (e.g. complex-tocomplex, Chebyshev, empty) – DONE in P3DFFT
2.5.1
2. Added pruned input/output feature (allows to implement 2/3 dealiasing) – DONE in P3DFFT 2.6.1
3. Expanding the memory layout options, including 3D decomposition utilities.
4. Adding ability to isolate transposes so the user can design their own transform
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
100
10
P3DFFT 2.6.1 performance for a large problem (8192 3 )
Input N 3 , output M 3
N=8192, M variable
P3DFFT 2.6.1, Blue Waters
N
N*2/3
N/2
N/4
N/8
N/16
1
16384 32768
N cores
SAN DIEGO SUPERCOMPUTER CENTER
65536 131072
UNIVERSITY OF CALIFORNIA, SAN DIEGO
P3DFFT - Ongoing and planned work
Part 2: Performance improvements
1. One-sided/nonblocking communication
• MPI-2, MPI-3
• OpenSHMEM
• Co-Array Fortran
2. Communication/computation overlap – requires
RDMA
3. Hybrid MPI/OpenMP implementation
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Default
Comm. 1 Comm. 2 Comm. 3
Idle
Comm. 4
Comp. 1 Comp. 2 Comp. 3 Comp. 4
Overlap
Comm. 1
Idle
Comm. 2
Comp. 1
Comm. 3
Comp. 2
Comm. 4
Comp. 3
Comp. 4
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Coarse-grain overlap
• Suitable for computing several FFTs at once
• Independent variables, e.g. velocity components
• Overlap communication stage of one variable with computation stage of another variable
• Uses large send buffers due to message aggregation
• Uses pairwise exchange algorithm, implemented through either MPI-2, SHMEM or Co-Array Fortran
• Alternatively, as of recently, MPI-3 nonblocking collectives have become available
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Coarse-grain overlap, results on Mellanox
ConnectX-2 cluster (64 and 128 cores)
K.Kandalla, H.Subramoni, K.Tomko, D. Pekurovsky, S.Sur, D.Panda “ High-Performance and
Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with
Parallel 3D FFT ”, ISC’11, Germany. Computer Science – Research and Development, v. 26, i.3,
237-246 (2011)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hybrid MPI/OpenMP preliminary results (Kraken)
4096 nodes Kraken, 8 cores/node
P = (Thr *M
1
)*M
2
3
2.5
2
1.5
1
0.5
0
4.5
4
3.5
4096
SAN DIEGO SUPERCOMPUTER CENTER
M
2
2048
1 Thread
2 Threads
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Conclusions
• P3DFFT is an efficient, scalable and versatile library (available as open source at http://code.google.com/p/p3dfft )
• Performance consistent with hardware capability is achieved on leading platforms
• Great potential for enabling petascale science
• An excellent testing tool for future platforms’ capabilities
• Bisection bandwidth
• MPI implementation
• One-sided protocols implementation
• MPI/OpenMP hybrid performance
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
• An example of project that came out of an Advanced
User Support Collaboration, now benefiting a wider community
• Incorporated into a number of codes (~25 citations as of today, hundreds of downloads)
• A future XSEDE community code
• Work under way to expand capability and improve parallel performance even further
WHAT ARE YOUR PETASCALE ALGORITHMIC NEEDS?
Send me an e-mail: dmitry@sdsc.edu
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
• P.K.Yeung
• D.A.Donzis
• G. Chukkappalli
• J. Goebbert
• G. Brethouser
• N. Prigozhina
• K. Tomko
• K. Kandalla
• H. Subramoni
• S. Sur
• D. Panda
Acknowledgements
Work supported by XSEDE, NSF grants OCI-0850684 and CCF-0833155
Benchmarks run on Teragrid resources Ranger (TACC), Kraken (NICS), and DOE resources Jaguar (NCSS/ORNL), Hopper(NERSC) , Blue Waters (NCSA)
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO