INRIA-UIUC-WS4-mnorman

advertisement
Michael L. Norman, UC San Diego and SDSC
mlnorman@ucsd.edu
ENZO AND EXTREME SCALE AMR FOR
HYDRODYNAMIC COSMOLOGY
WHAT IS ENZO?

A parallel AMR application for astrophysics and cosmology
simulations




Under continuous development since 1994




Hybrid physics: fluid + particle + gravity + radiation
Block structured AMR
MPI or hybrid parallelism
Greg Bryan and Mike Norman @ NCSA
Shared memorydistributed memoryhierarchical memory
C++/C/F, >185,000 LOC
Community code in widespread use worldwide


Hundreds of users, dozens of developers
Version 2.0 @ http://enzo.googlecode.com
ASTROPHYSICAL FLUID DYNAMICS
Supersonic turbulence
HYDRODYNAMIC COSMOLOGY
Large scale structure
TWO PRIMARY APPLICATION DOMAINS
ENZO PHYSICS
Physics
Equations
Math type
Algorithm(s)
Communication
Dark matter
Newtonian
N-body
Numerical
integration
Particle-mesh
Gather-scatter
Gravity
Poisson
Elliptic
FFT
multigrid
Global
Gas dynamics
Euler
Nonlinear
hyperbolic
Explicit finite
volume
Nearest
neighbor
Magnetic
fields
Ideal MHD
Nonlinear
hyperbolic
Explicit finite
volume
Nearest
neighbor
Implicitin
finite
PhysicsFlux-limited
modulesNonlinear
can be used
any Global
radiation
parabolic
difference
combination
3D making
diffusion in 1D, 2D andMultigrid
solves
Multispecies
Coupled stiff
BE ,
None
ENZO aKinetic
very powerful
and Explicit
versatile
code
Radiation
transport
chemistry
equations
ODEs
implicit
Inertial, tracer,
source , and
sink particles
Newtonian
N-body
Numerical
integration
Particle-mesh
Gather-scatter
ENZO MESHING



Berger-Collela
structured AMR
Cartesian base
grid and
subgrids
Hierarchical
timetepping
AMR = collection of grids (patches);
each grid is a C++ object
Level 0
Level 1
Level 2
Unigrid = collection of Level 0 grid patches
EVOLUTION OF ENZO PARALLELISM

Shared memory (PowerC) parallel (1994-1998)




SMP and DSM architecture (SGI Origin 2000, Altix)
Parallel DO across grids at a given refinement level including
block decomposed base grid
O(10,000) grids
Distributed memory (MPI) parallel (1998-2008)





MPP and SMP cluster architectures (e.g., IBM PowerN)
Level 0 grid partitioned across processors
Level >0 grids within a processor executed sequentially
Dynamic load balancing by messaging grids to underloaded
processors (greedy load balancing)
O(100,000) grids
Projection of refinement levels
160,000 grid patches at 4 refinement levels
1 MPI task per processor
Task = a Level 0 grid patch
and all associated subgrids;
processed sequentially
across and within levels
EVOLUTION OF ENZO PARALLELISM

Hierarchical memory (MPI+OpenMP) parallel (2008-)





SMP and multicore cluster architectures (SUN Constellation,
Cray XT4/5)
Level 0 grid partitioned across shared memory
nodes/multicore processors
Parallel DO across grids at a given refinement level within a
node
Dynamic load balancing less critical because of larger MPI
task granularity (statistical load balancing)
O(1,000,000) grids
N MPI tasks per SMP
M OpenMP threads per task
Task = a Level 0 grid patch and all
associated subgrids processed
concurrently within levels and
sequentially across levels
Each grid is an OpenMP thread
ENZO ON CRAY XT5

Non-AMR 64003 80 Mpc box






1% OF THE 64003 SIMULATION
15,625 (253) MPI tasks, 2563
root grid tiles
6 OpenMP threads per task
93,750 cores
30 TB per checkpoint/restart/data dump
>15 GB/sec read, >7 GB/sec
write
Benefit of threading

reduce MPI overhead &
improve disk I/O
ENZO ON PETASCALE PLATFORMS
ENZO ON CRAY XT5

AMR 10243 50 Mpc box, 7
levels of refinement



105 SPATIAL DYNAMIC RANGE
4096 (163) MPI tasks, 643 root
grid tiles
1 to 6 OpenMP threads per
task - 4096 to 24,576 cores
Benefit of threading


Thread count increases with
memory growth
reduce replication of grid
hierarchy data
ENZO ON PETASCALE PLATFORMS
Using MPI+threads to
access more RAM as
the AMR calculation
grows in size
ENZO-RHD ON CRAY XT5

Including radiation transport
10x more expensive



COSMIC REIONIZATION
LLNL Hypre multigrid solver
dominates run time
near ideal scaling to at least
32K MPI tasks
Non-AMR 10243 8 and 16
Mpc boxes

4096 (163) MPI tasks, 643 root
grid tiles
ENZO ON PETASCALE PLATFORMS
BLUE WATERS TARGET SIMULATION
RE-IONIZING THE UNIVERSE

Cosmic Reionization is a weak-scaling problem


large volumes at a fixed resolution to span range of scales
Non-AMR 40963 with ENZO-RHD









Hybrid MPI and OpenMP
SMT and SIMD tuning
1283 to 2563 root grid tiles
4-8 OpenMP threads per task
4-8 TBytes per checkpoint/re-start/data dump (HDF5)
In-core intermediate checkpoints (?)
64-bit arithmetic, 64-bit integers and pointers
Aiming for 64-128 K cores
20-40 M hours (?)
PETASCALE AND BEYOND
ENZO’s AMR infrastructure limits scalability to
O(104) cores
 We are developing a new, extremely scalable
AMR infrastructure called Cello

 http://lca.ucsd.edu/projects/cello

ENZO-P will be implemented on top of Cello to
scale to
CURRENT CAPABILITIES: AMR VS TREECODE
CELLO EXTREME AMR FRAMEWORK:
DESIGN PRINCIPLES
Hierarchical parallelism and load balancing to
improve localization
 Relax global synchronization to a minimum
 Flexible mapping between data structures and
concurrency
 Object-oriented design
 Build on best available software for faulttolerant, dynamically scheduled concurrent
objects (Charm++)

CELLO EXTREME AMR FRAMEWORK:
APPROACH AND SOLUTIONS
1.
2.
3.
4.
5.
6.
7.
8.
9.
hybrid replicated/distributed octree-based AMR approach, with novel
modifications to improve AMR scaling in terms of both size and depth;
patch-local adaptive time steps;
flexible hybrid parallelization strategies;
hierarchical load balancing approach based on actual performance
measurements;
dynamical task scheduling and communication;
flexible reorganization of AMR data in memory to permit independent
optimization of computation, communication, and storage;
variable AMR grid block sizes while keeping parallel task sizes fixed;
address numerical precision and range issues that arise in particularly
deep AMR hierarchies;
detecting and handling hardware or software faults during run-time to
improve software resilience and enable software self-management.
IMPROVING THE AMR MESH:
PATCH COALESCING
IMPROVING THE AMR MESH:
TARGETED REFINEMENT
IMPROVING THE AMR MESH:
TARGETED REFINEMENT WITH BACKFILL
CELLO SOFTWARE COMPONENTS
http://lca.ucsd.edu/projects/cello
ROADMAP
ENZO RESOURCES

Enzo website (code, documentation)
 http://lca.ucsd.edu/projects/enzo

2010 Enzo User Workshop slides
 http://lca.ucsd.edu/workshops/enzo2010

yt website (analysis and vis.)
 http://yt.enzotools.org

Jacques website (analysis and vis.)
 http://jacques.enzotools.org/doc/Jacques/Jacques
.html
BACKUP SLIDES
GRID HIERARCHY DATA STRUCTURE
Level 0
Level 1
x
Level 2
(0,0)
x
x
(1,0)
(2,0)
(2,1)
Scaling the AMR grid hierarchy in depth and breadth
(0)
Depth (level)
(1,0)
(2,0)
(3,0)
(3,1)
(4,0)
(2,1)
(3,2)
(1,1)
(2,2)
(3,4)
(4,1)
Breadth (# siblings)
(2,3)
(3,5)
(4,3)
(2,4)
(3,6)
(4,4)
(3,7)
10243, 7 LEVEL AMR STATS
Level
Grids
Memory (MB)
Work =
Mem*(2^level)
0
512
179,029
179,029
1
223,275
114,629
229,258
2
51,522
21,226
84,904
3
17,448
6,085
48,680
4
7,216
1,975
31,600
5
3,370
1,006
32,192
6
1,674
599
38,336
7
794
311
39,808
Total
305,881
324,860
683,807
Current MPI Implementation
real grid
object
grid metadata
physics data
virtual grid
object
grid metadata
SCALING AMR GRID HIERARCHY

Flat MPI implementation is not scalable
because grid hierarchy metadata is replicated
in every processor


For very large grid counts, this dominates memory
requirement (not physics data!)
Hybrid parallel implementation helps a lot!
Now hierarchy metadata is only replicated in every
SMP node instead of every processor
 We would prefer fewer SMP nodes (8192-4096) with
bigger core counts (32-64) (=262,144 cores)
 Communication burden is partially shifted from MPI
to intranode memory accesses

CELLO EXTREME AMR FRAMEWORK
Targeted at fluid, particle, or hybrid (fluid+particle)
simulations on millions of cores
 Generic AMR scaling issues:

Small AMR patches restrict available parallelism
 Dynamic load balancing
 Maintaining data locality for deep hierarchies
 Re-meshing efficiency and scalability
 Inherently global multilevel elliptic solves
 Increased range and precision requirements for deep
hierarchies

Download