Porting Scientific Applications to OpenPOWER Dirk Pleiter Forschungszentrum Jülich / JSC #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 JSC’s HPC Strategy IBM Blue Gene/L JUBL, 45 TFlop/s IBM Power 6 JUMP, 9 TFlop/s IBM Blue Gene/P JUGENE, 1 PFlop/s Intel Nehalem JUROPA 300 TFlop/s File Server IBM Blue Gene/Q JUQUEEN 5.9 PFlop/s Lustre GPFS JURECA ~ 2 PFlop/s + Booster ~ 10 PFlop/s General-Purpose Cluster Highly Scalable System Join the conversation at #OpenPOWERSummit 2 Achieving Scalability Need for Research on Research on architectures and technologies Research on applications and algorithms Ingredients for HPC co-design Provide incentives to users: High-Q Club Showcase for codes that can utilize a 28-rack Blue Gene/Q at JSC Selected Club members • • • • dynQCD: simulation of particle theories KKRnano: DFT-based condensed matter PEPC: Tree-based N-body code ... Join the conversation at #OpenPOWERSummit 3 Why OpenPOWER? Answer from a customer point of view Increasing share of Top500 are based on CPUs from single vendor • Pure market observation, no statement about technology Lack of competition in processor technologies • • Usually higher prices Less incentive for innovations Need for promoting alternative technologies • OpenPOWER Join the conversation at #OpenPOWERSummit 4 Why OpenPOWER? Answer from an architectural point of view Tight integration of high-performance processor and low-clocked, highly parallel compute devices • • • Enable drastic improvement of power efficiency Preserve usability at tremendously increased level of parallelism Opportunity to improve overall balance of system Integration of non-volatile memory into fat compute nodes • Increased reliability though reduced number of components and support of resilience Addresses exascale challenges Join the conversation at #OpenPOWERSummit 5 POWER Acceleration and Design Center PADC is a collaboration between IBM R&D Labs in Böblingen and Zürich Forschungszentrum Jülich NVIDIA Europe Mission statement Support scientists and engineers to target the grand challenges facing society using OpenPOWER technologies Grand challenges • • • Energy and environment, e.g. plasma physics Information, e.g. condensed matter physics Healthcare, e.g. brain research http://www.fz-juelich.de/ias/jsc/padc Join the conversation at #OpenPOWERSummit 6 Applications PADC takes application driven approach Builds on previous work in NVIDIA Application Lab Previously targeted applications • • • • Regional Flood Model B-CALM PANDA ... Ongoing and future applications • • • KKRnano BigBrain ... Join the conversation at #OpenPOWERSummit 7 Performance Analysis Performance analysis POWER8 memory hierarchy Performance analysis GPU-GPU data transport Join the conversation at #OpenPOWERSummit 8 Performance Characterization Characterization of applications on given hardware Methodology • • • • • Identification of performance critical kernels Optimization of kernel at best effort with given constraints Performance characterization Measurement of extensive performance metrics Architectural analysis Question addressed in architectural analysis • • • How does performance change with clock speed? How does it depend on memory hierarchy? … Join the conversation at #OpenPOWERSummit 9 Performance Characterization Example: Regional Flood Model Key kernel: Solver for Saint-Venant equations • Compute particle flow in 2 dimensions Selected performance metrics (on K20x) • • • Arithmetic intensity AIacc(T) = 0.5 Memory rd/wr bandwidth = 80/86 GByte/s Warp execution efficiency εwarp = 80% Example analysis for changing boost clock Join the conversation at #OpenPOWERSummit 10 Performance Modelling Semi-empirical performance modelling methodology Methodology • • • On basis of prior knowledge formulate scaling formulae describing dependence of execution time ∆t(W) as function of work-load W Measure ∆t(W ) for different W and fit scaling formulae to result Check fitted parameters for plausibility Considered example: B-CALM • 1-dimensionally parallelized Finite Difference Time Domain approach for electro-magnetic simulations Join the conversation at #OpenPOWERSummit 11 Performance Modelling Semi-empirical performance modelling for B-CALM Model ansatz • • • • Calculation of boundary sites ∆tbnd ~ Nx Ny Calculation of bulk sites ∆tbulk ~ Nx Ny (Nz / P) Communication of boundary ∆tnet ~ Nx Ny Overlapping calculations and communications: ∆t = ∆tbnd + max(∆tbulk, ∆tnet) Weak scaling measured for fixed Nx Ny using P=2 GPUs attached to single processor Non-optimized MPI Join the conversation at #OpenPOWERSummit 12 Future opportunities Challenging applications with large memory capacity and high bandwidth requirements • • Example: BigBrain project at FZ Jülich • • • • High bandwidth, smaller capacity memory attached to GPU Large capacity, smaller bandwidth memory attached to CPU Goal: 3d brain model reconstructed from 2d slices Computational challenge: image registration Compute intensive computation of mutual information metric Large capacity required for storing high-resolution images Significant slow-down found using host [A. Adinets et al., HeteroPar 2013] memory on today’s architectures Large benefit expected from NVLink Join the conversation at #OpenPOWERSummit 13 Conclusions OpenPOWER opens important opportunities for HPC infrastructure providers • Exascale challenges are addressed No problems porting GPU-enabled applications to OpenPOWER • Room for optimizations Support for porting more applications required • • Based on performance characterization and modelling Optimization and code restructuring within PADC Join the conversation at #OpenPOWERSummit 14