The PRACE project and the Application Development Programme (WP8-2IP) Claudio Gheller (ETH-CSCS) PRACE - Partnership for Advanced Computing in Europe • PRACE and has the aim of creating a European Research Infrastructure providing world class systems and services and coordinating their use throughout Europe. 2 PRACE History – An Ongoing Success Story HPC part of the ESFRI Roadmap; creation of a vision involving 15 European countries Creation of the Scientific Case HPCEUR 2004 2005 Signature of the MoU HET 2006 PRACE Initiative 2007 Creation of the PRACE Research Infrastructure PRACE RI 2008 2009 2010 2011 2012 2013 PRACE Preparatory PRACE-1IP PRACE-3IP Phase Project PRACE-2IP 3 PRACE-2IP • 22 partners (21 countries), funding 18 Million € • Preparation/Coordination: FZJ/JSC/PRACE PMO • 1.9.2011 – 31.8.2013, extended to 31.8.2014 (only selected WPs) • Main objectives: – Provision of HPC resources access – Refactoring and scaling of major user codes – Tier-1 Integration (DEISA PRACE) – Consolidation of the Research Infrastructure 5 PRACE-3IP • Funding 20 Million € • Started: summer 2013 • Objectives – Provision of HPC resources access – Planned: Pre-commercial procurement exercise – Planned: Industry application focus 6 Access to Tier-0 supercomputers Open Call for Proposals ~ 2 Months Technical Peer Review Priorisation + Resource Allocation Scientific Peer Review ~ 3 Months Project + Final Report ~ 1 year Technical Researchers with experts in expertise in PRACE systems scientific and software field of proposal Access Committee Researcher PRACE director decides on the proposal of the Access Committee Distribution of resources 10% Astrophysics 20% 3% Chemistry and Materials Earth Sciences and Environment 8% Engineering and Energy 28% 4% Fundamental Physics Mathematics and Computer Science Medicine and Life Sciences 27% 8 PRACE-2IP WP8: Enabling Scientific Codes to the Next Generation of HPC Systems 9 PRACE 2IP workpackages – – – – – – – – – – – – WP1 Management WP2 Framework for Resource Interchange WP3 Dissemination WP4 Training WP5 Best Practices for HPC Systems Commissioning WP6 European HPC Infrastructure Operation and Evolution WP7 Scaling Applications for Tier-0 and Tier-1 Users WP8 Community Code Scaling ETH leading the WP WP9 Industrial Application Support WP10 Advancing the Operational Infrastructure WP11 Prototyping WP12 Novel Programming Techniques 10 WP8: involved centers 11 WP8 objectives • Initiate a sustainable program in application development for coming generation of supercomputing architectures with a selection of community codes targeted at problems of high scientific impact that require HPC. • Refactoring of community codes in order to optimally map applications to future supercomputing architectures. • Integrate and validate these new developments into existing applications communities. 12 WP8 principles • scientific communities, with their high-end research challenges, are the main drivers for software development; • synergy between HPC experts and application developers from the communities; • Supercomputer have to recast their service activities in order to support, guide and enable scientific program developers and researchers in refactoring codes and re-engineering algorithms • strong commitment from the scientific community has to be granted 13 WP8 workflow Task 1 Scientific Domains and Communities Selection Scientific Communities Engagement Codes screening Codes Performance Analysis and Model Communities build-up Codes and kernels selection Task 2 Codes Refactoring Communities consolidation Prototypes experimentation Code Validation and reintegration Task 3 14 Communities selection (task 1) • the candidate community must have high impact on science and/or society; • the candidate community must rely on and leverage high performance computing; • WP8 can have a high impact on the candidate community; • the candidate community must be willing to actively invest in software refactoring and algorithm re-engineering. o Astrophysics o Climate o Material Science o Particle Physics o Engineering 15 Codes and kernels selection methodology (task 1) • Performance Modelling methodology • Objective and quantitative way to select code and estimate possible performance improvements – Performance modelling goal is gaining insight into application’s performance on a given computer system. an – achieved first by measurement and analysis, and then by the synthesis of the application and computing system characteristics – also represents a predictive tool, estimating the behaviour on a different computing architecture identifying the most promising areas for performance improvement. 16 Selected codes and institution in charge (task 1) RAMSES PFARM EAF-PAMR OASIS I/O ICON NEMO Fluidity/ICOM ABINIT QuantumESPRESSO YAMBO SIESTA OCTOPUS EXCITING/ELK PLQCD ELMER CODE_SATURN ALYA ZFS ETH STFC UC-LCA CEA ICHEC ETH STFC STFC CEA CINECA UC-LCA BSC UC-LCA ETH CASTORC VSB-TUO STFC BSC HLRS 17 Codes Refactoring (task 2) • Still running (last few weeks) • Specific codes’ kernels are being re-designed and reimplemented according to the workplans defined in task 1 • Each group works independently • Check points at Face to Face workshops and AllHands meetings • Specific Wiki Web site implemented for report progresses, collect and exchange information and documents and to manage and release implemented code: http://prace2ip-wp8.hpcforge.org 18 Codes validation and re-introduction (task 3) • Collaborative work (daily basis) involving developers and HPC experts • Dedicated workshops • Face to Face meetings • Participation and contribution to conferences code This way, no actual need of a special/specific reintegration procedure was needed 19 Case study: RAMSES • The RAMSES code was developed to study the evolution of the large-scale structure of the universe and the process of galaxy formation. • adaptive mesh refinement (AMR) multispecies code (baryons – hydrodynamics – plus dark matter – N-Body) • Gravity couples the two components. Solved by multigrid approach • Other components supported (e.g. MHD, radiative transfer), but not subject of our anaysis 20 Performance analysis example Parallel Profiling, large test (5123 base grid 9 refinement levels – 250 GB): strong scaling For this test Communication becomes the most relevant part, and it is dominated by synchronizations, due to the difficulties in load balancing the AMR-Multigrid algorithms Strong improvements can be obtained tuning the load balance among computational elements (nodes?) 21 Performance analysis: conclusions The performance analysis identified the critical kernels of the code: •Hydro: all the functions needed to solve the hydrodynamic problem are included. Within these functions, we have those that collect from grids at different resolutions the data necessary to update each single cell, those that calculate fluxes to solve conservation equations, Riemann solvers, finitevolume solvers. •Gravity: this group comprises functions needed to calculate the gravitational potential at different resolutions using a multigrid-relaxation approach. •MPI: comprises all the communication related MPI calls (data tranfer, synchronisation, management) 22 HPC architectures model Performance improvements Two main objectives 1.hybrid OpenMP+MPI parallelization, to exploit systems with distributed nodes, each accounting for cores with shared memory 2.Exploitation of accelerators, in particular GPUs, adopting different paradigms (CUDA, OpenCL, directives) From the analysis of the performance and of the characteristics of the kernels under investigation we can say that: • The Hydro kernel is suitable for both approaches. Specific care must be posed to memory access issues. • The Gravity kernel can benefit from the hybrid implementation. • Due to the multigrid structure, however, an efficient GPU version can be particularly challenging, so it will be considered only if time and resources permit. 24 Performance modeling • Hybrid version (trivial modeling): THYBRID = TMPI eMPI,NTOT / (eOMP,Ncores eMPI,Nnodes) • GPU version TTOT = TCPU + TCPU-GPU + TGPU-GPU + TGPU 25 Performance model example 26 Results: Hybrid code (OpenMP+MPI) 27 GPU implementation – approach 1 GPU implementation – approach 1 Step 2: solve Hydro equations for cell i,j,k Step 3: compose results array New Hydro variables Step 4: copy results back to the CPU Copy to the CPU Results Sedov Blast wave test (hydro only, unigrid): ACC OFF - 1 Pe ON ON ON ON ON NVECTOR 10 512 1024 2048 4096 8192 Ttot 94.54 55.83 45.66 42.08 41.32 41.19 Times in seconds Tacc Ttransf 0 0 38.22 9.2 29.27 9.2 25.36 9.2 23.2 9.2 23.15 9.2 Eff. Speed-up 2.012820513 2.669969252 3.068611987 3.293965517 3.304535637 20 GB tranferred in/out (constant overhead) Performance pitfalls • Amount of transferred data –Overhead increasing linearly with data size • Data structure, irregular data distribution –PREVENTS any asynchronous operation: NO overlap of computation and data transfer. –Ineffective memory access –Prevents coalesced memory access • Low flops per byte ratio –this is intrinsic to the algorithm… • Asynchronous operations not permitted –See above… Claudio Gheller 31 © CSCS 2013 - Hydro variables Gravitational forces Other quantities Step 1: compose data chunks on the CPU Data chunks are the basic building block of the RAMSES’ AMR hierarchy: OCTs and their refinements CPU memory GPU implementation – approach 2 • Data is moved to and from the GPU in chunks • Data transfer and computation can be overlapped Step 2: copy multiple data chunks to the GPU Step 3: solve Hydro equations for chuncks N, M… Step 4: compose results array New Hydro variables Copy to the CPU Advantages over previous implementation • Data is regularly distributed in each chunk and its access is efficient. Improved flop per byte ratio • Effective usage of the GPU computing architecture • Data re-organization is performed on the CPU and its overhead hidden by asynchronous processes • Data transfer overhead almost completely hidden • AMR naturally supported • DRAWBACKS: much more complex implementation Claudio Gheller 34 © CSCS 2013 - Conclusions • PRACE is providing European scientist top level HPC services • PRACE-2IP WP8 successfully introduced a methodology for code development relying on a close synergy between scientists, community codes developers and HPC experts • Many community codes re-design and implemented to exploit novel HPC architectures (see http://prace2ipwp8.hpcforge.org/ for details) • Most of WP8 results are already available to the community • WP8 is going to be extended one more year (no similar activity in PRACE-3IP) 35