Michael L. Norman, UC San Diego and SDSC mlnorman@ucsd.edu ENZO AND EXTREME SCALE AMR FOR HYDRODYNAMIC COSMOLOGY WHAT IS ENZO? A parallel AMR application for astrophysics and cosmology simulations Under continuous development since 1994 Hybrid physics: fluid + particle + gravity + radiation Block structured AMR MPI or hybrid parallelism Greg Bryan and Mike Norman @ NCSA Shared memorydistributed memoryhierarchical memory C++/C/F, >185,000 LOC Community code in widespread use worldwide Hundreds of users, dozens of developers Version 2.0 @ http://enzo.googlecode.com ASTROPHYSICAL FLUID DYNAMICS Supersonic turbulence HYDRODYNAMIC COSMOLOGY Large scale structure TWO PRIMARY APPLICATION DOMAINS ENZO PHYSICS Physics Equations Math type Algorithm(s) Communication Dark matter Newtonian N-body Numerical integration Particle-mesh Gather-scatter Gravity Poisson Elliptic FFT multigrid Global Gas dynamics Euler Nonlinear hyperbolic Explicit finite volume Nearest neighbor Magnetic fields Ideal MHD Nonlinear hyperbolic Explicit finite volume Nearest neighbor Implicitin finite PhysicsFlux-limited modulesNonlinear can be used any Global radiation parabolic difference combination 3D making diffusion in 1D, 2D andMultigrid solves Multispecies Coupled stiff BE , None ENZO aKinetic very powerful and Explicit versatile code Radiation transport chemistry equations ODEs implicit Inertial, tracer, source , and sink particles Newtonian N-body Numerical integration Particle-mesh Gather-scatter ENZO MESHING Berger-Collela structured AMR Cartesian base grid and subgrids Hierarchical timetepping AMR = collection of grids (patches); each grid is a C++ object Level 0 Level 1 Level 2 Unigrid = collection of Level 0 grid patches EVOLUTION OF ENZO PARALLELISM Shared memory (PowerC) parallel (1994-1998) SMP and DSM architecture (SGI Origin 2000, Altix) Parallel DO across grids at a given refinement level including block decomposed base grid O(10,000) grids Distributed memory (MPI) parallel (1998-2008) MPP and SMP cluster architectures (e.g., IBM PowerN) Level 0 grid partitioned across processors Level >0 grids within a processor executed sequentially Dynamic load balancing by messaging grids to underloaded processors (greedy load balancing) O(100,000) grids Projection of refinement levels 160,000 grid patches at 4 refinement levels 1 MPI task per processor Task = a Level 0 grid patch and all associated subgrids; processed sequentially across and within levels EVOLUTION OF ENZO PARALLELISM Hierarchical memory (MPI+OpenMP) parallel (2008-) SMP and multicore cluster architectures (SUN Constellation, Cray XT4/5) Level 0 grid partitioned across shared memory nodes/multicore processors Parallel DO across grids at a given refinement level within a node Dynamic load balancing less critical because of larger MPI task granularity (statistical load balancing) O(1,000,000) grids N MPI tasks per SMP M OpenMP threads per task Task = a Level 0 grid patch and all associated subgrids processed concurrently within levels and sequentially across levels Each grid is an OpenMP thread ENZO ON CRAY XT5 Non-AMR 64003 80 Mpc box 1% OF THE 64003 SIMULATION 15,625 (253) MPI tasks, 2563 root grid tiles 6 OpenMP threads per task 93,750 cores 30 TB per checkpoint/restart/data dump >15 GB/sec read, >7 GB/sec write Benefit of threading reduce MPI overhead & improve disk I/O ENZO ON PETASCALE PLATFORMS ENZO ON CRAY XT5 AMR 10243 50 Mpc box, 7 levels of refinement 105 SPATIAL DYNAMIC RANGE 4096 (163) MPI tasks, 643 root grid tiles 1 to 6 OpenMP threads per task - 4096 to 24,576 cores Benefit of threading Thread count increases with memory growth reduce replication of grid hierarchy data ENZO ON PETASCALE PLATFORMS Using MPI+threads to access more RAM as the AMR calculation grows in size ENZO-RHD ON CRAY XT5 Including radiation transport 10x more expensive COSMIC REIONIZATION LLNL Hypre multigrid solver dominates run time near ideal scaling to at least 32K MPI tasks Non-AMR 10243 8 and 16 Mpc boxes 4096 (163) MPI tasks, 643 root grid tiles ENZO ON PETASCALE PLATFORMS BLUE WATERS TARGET SIMULATION RE-IONIZING THE UNIVERSE Cosmic Reionization is a weak-scaling problem large volumes at a fixed resolution to span range of scales Non-AMR 40963 with ENZO-RHD Hybrid MPI and OpenMP SMT and SIMD tuning 1283 to 2563 root grid tiles 4-8 OpenMP threads per task 4-8 TBytes per checkpoint/re-start/data dump (HDF5) In-core intermediate checkpoints (?) 64-bit arithmetic, 64-bit integers and pointers Aiming for 64-128 K cores 20-40 M hours (?) PETASCALE AND BEYOND ENZO’s AMR infrastructure limits scalability to O(104) cores We are developing a new, extremely scalable AMR infrastructure called Cello http://lca.ucsd.edu/projects/cello ENZO-P will be implemented on top of Cello to scale to CURRENT CAPABILITIES: AMR VS TREECODE CELLO EXTREME AMR FRAMEWORK: DESIGN PRINCIPLES Hierarchical parallelism and load balancing to improve localization Relax global synchronization to a minimum Flexible mapping between data structures and concurrency Object-oriented design Build on best available software for faulttolerant, dynamically scheduled concurrent objects (Charm++) CELLO EXTREME AMR FRAMEWORK: APPROACH AND SOLUTIONS 1. 2. 3. 4. 5. 6. 7. 8. 9. hybrid replicated/distributed octree-based AMR approach, with novel modifications to improve AMR scaling in terms of both size and depth; patch-local adaptive time steps; flexible hybrid parallelization strategies; hierarchical load balancing approach based on actual performance measurements; dynamical task scheduling and communication; flexible reorganization of AMR data in memory to permit independent optimization of computation, communication, and storage; variable AMR grid block sizes while keeping parallel task sizes fixed; address numerical precision and range issues that arise in particularly deep AMR hierarchies; detecting and handling hardware or software faults during run-time to improve software resilience and enable software self-management. IMPROVING THE AMR MESH: PATCH COALESCING IMPROVING THE AMR MESH: TARGETED REFINEMENT IMPROVING THE AMR MESH: TARGETED REFINEMENT WITH BACKFILL CELLO SOFTWARE COMPONENTS http://lca.ucsd.edu/projects/cello ROADMAP ENZO RESOURCES Enzo website (code, documentation) http://lca.ucsd.edu/projects/enzo 2010 Enzo User Workshop slides http://lca.ucsd.edu/workshops/enzo2010 yt website (analysis and vis.) http://yt.enzotools.org Jacques website (analysis and vis.) http://jacques.enzotools.org/doc/Jacques/Jacques .html BACKUP SLIDES GRID HIERARCHY DATA STRUCTURE Level 0 Level 1 x Level 2 (0,0) x x (1,0) (2,0) (2,1) Scaling the AMR grid hierarchy in depth and breadth (0) Depth (level) (1,0) (2,0) (3,0) (3,1) (4,0) (2,1) (3,2) (1,1) (2,2) (3,4) (4,1) Breadth (# siblings) (2,3) (3,5) (4,3) (2,4) (3,6) (4,4) (3,7) 10243, 7 LEVEL AMR STATS Level Grids Memory (MB) Work = Mem*(2^level) 0 512 179,029 179,029 1 223,275 114,629 229,258 2 51,522 21,226 84,904 3 17,448 6,085 48,680 4 7,216 1,975 31,600 5 3,370 1,006 32,192 6 1,674 599 38,336 7 794 311 39,808 Total 305,881 324,860 683,807 Current MPI Implementation real grid object grid metadata physics data virtual grid object grid metadata SCALING AMR GRID HIERARCHY Flat MPI implementation is not scalable because grid hierarchy metadata is replicated in every processor For very large grid counts, this dominates memory requirement (not physics data!) Hybrid parallel implementation helps a lot! Now hierarchy metadata is only replicated in every SMP node instead of every processor We would prefer fewer SMP nodes (8192-4096) with bigger core counts (32-64) (=262,144 cores) Communication burden is partially shifted from MPI to intranode memory accesses CELLO EXTREME AMR FRAMEWORK Targeted at fluid, particle, or hybrid (fluid+particle) simulations on millions of cores Generic AMR scaling issues: Small AMR patches restrict available parallelism Dynamic load balancing Maintaining data locality for deep hierarchies Re-meshing efficiency and scalability Inherently global multilevel elliptic solves Increased range and precision requirements for deep hierarchies