FASTMath-unstructured-mesh

advertisement
Unstructured Mesh Technologies
Mark S. Shephard for the FASTMath
Unstructured Meshing Team
FASTMath SciDAC Institute
‹#›
Unstructured Mesh Technologies to Support Massively
Parallel High-Order Simulations Over General Domains
Architecture
Mesh adaptation
Parallel mesh
Parallel
aware
and quality
infrastructures:
performance on
implementations
control:
unstructured
Underlying
Methods that
Tools to modify
meshes:
mesh structures
take advantage
meshes so that
and services for
Scalablility at
of high core
analysis codes
application
the highest core
count,
reliably produce
developers
counts
heterogeneous
results of desired
nodes
accuracy
Application developers need tools to take advantage of the ability of unstructured meshes
to produce results of a given accuracy with a minimum number of unknowns
Dynamic load
balancing:
Fast methods to
achieve load
balance for all
steps in a
simulation
workflow
‹#›
The FASTMath team includes experts from four national
laboratories and five universities
Lawrence Berkeley
National Laboratory
Mark Adams
Ann Almgren
Phil Colella
Anshu Dubey
Dan Graves
Sherry Li
Lin Lin
Terry Ligocki
Mike Lijewski
Peter McCorquodale
Esmond Ng
Brian Van Straalen
Chao Yang
Subcontract: Jim Demmel
(UC Berkeley)
Lawrence Livermore
National Laboratory
Barna Bihari
Lori Diachin
Milo Dorr
Rob Falgout
Mark Miller
Jacob Schroder
Carol Woodward
Ulrike Yang
Subcontract: Carl Ollivier-Gooch
(Univ of British Columbia)
Subcontract: Dan Reynolds
(Southern Methodist)
Rensselear Polytechnic Inst.
E. Seegyoung Seol
Onkar Sahni
Mark Shephard
Cameron Smith
Subcontract: Ken Jansen
(UC Boulder)
Argonne National Laboratory
Jed Brown
Lois Curfman McInnes
Todd Munson
Vijay Mahadevan
Barry Smith
Subcontract: Jim Jiao
(SUNY Stony Brook)
Subcontract: Paul Wilson
(Univ of Wisconsin)
Sandia National
Laboratories
Karen Devine
Glen Hansen
Jonathan Hu
Vitus Leung
Siva Rajamanickam
Michel Wolf
Andrew Salinger
‹#›
Parallel Structures and Services to Support Development
of Unstructured Mesh Simulation Workflows


Development goals
• Architecture aware parallel mesh libraries & services to meet application needs
• Solution field support for adaptive multiphysics simulation workflows
Advances made to date
P2
• MOAB library developments
P1
 Supporting applications on >32K cores
 Supporting Meshkit meshing library
interIntraprocess
 Improved memory access and parallel I/O
process
boundary
• PUMI developments
boundary
 Support for adaptively evolving mixed meshes
 Combined use of MPI and threads
 Meshes to 92B elements on ¾ million parts
• Attached Parallel Fields (APF) development underway
 Effective storage of solution fields on meshes
 Supports operations on the fields
‹#›
Parallel Structures and Services to Support Development
of Unstructured Mesh Simulation Workflows


Plans for FY 14-16
• Supporting additional unstructured mesh needs
 Combined structured and unstructured meshes
 Support of scale/dimension change of meshes
• More efficient array-based implementations to
support evolving meshes
Results/Impact
• Unstructured mesh methods to multiple applications
– ACES4BGC (Climate) XGC (fusion), M3D-C1
(fusion), ACE3P (accelerators), Plasma-SurfaceInteraction (PSI), NEAMS (NE), CESAR (ASCR),
Albany (multiple applications), Athena VMS (NNSA)
• Being used in the creation of efficient in-memory
adaptive simulations for multiple applications ACE3P, M3D-C1, Albany, PHASTA, Athena VMS,
FUN3D, Proteus
‹#›
Highlight: Unstructured Mesh Techniques for
Edge Plasma Fusion Simulations


EPSI PIC coupled to mesh simulation
requires high quality meshes meeting a
strict set of layout constraints
• Existing method took >11 hours and
mesh did not have desired quality
• FASTMath meshing technologies put
together and extended – produce better
quality meshes that meet constraints
• Run time reduced by a factor of >60 to
well under 10 minutes for finest mesh
Particle-in-Cell with distributed mesh
• Current XGC copies entire mesh on
each process
• PUMI distributed mesh being extended
to support parallel mesh with particles
than can move through the mesh
‹#›
Highlight: Mesh Intersections and Tracer Transport
for BioGeochemical Cycles in ACES4BGC





Computing intersections and interpolating key quantities between two meshes
covering the same domain is a vital algorithm to many applications
MOAB parallel infrastructure and moving mesh intersection algorithm
implementations are used to track multi-tracer transport
Applied computationally efficient schemes for biogeochemical cycles, with
enhancements to implement a scalable (linear complexity), conservative, 2-D
remapping algorithm.
Efficient and balanced re-distribution of meshes implemented internally
A collaborative work between computational scientists in the ACES4BGC project,
the FASTMath Institute, and performance engineers in the SUPER Institute
‹#›
Highlight: Unstructured Mesh Infrastructure for the
M3D-C1 MHD Code for Fusion Plasma Simulations

Provide the mesh infrastructure for M3D-C1
• Geometric model interface defined by
analytic expressions with B-splines
• Distributed mesh management including
 process grouping to define plane
 each plane loaded with the same
distributed 2D mesh then
 3D mesh and corresponding
partitioning topology constructed
• Mesh adaptation and load balancing
• Adjacency-based node ordering
• Mapping of mesh to PETSc structures
and control of assembly processes
Fig: 3D mesh constructed from 64
2D planes on 12288 processes [1]
(only the mesh between selected
planes shown)
[1] S.C.Jardin, et al, Multiple timescale calculations of sawteeth and other macroscopic dynamics of
tokamak plasmas, Computational Science and Discovery 5 (2012) 014002
‹#›
Dynamic Load Balancing to Ensure the Ability of
Applications to Achieve and Maintain Scalability



Development goals
• Architecture aware dynamic load balancing library
• Fast dynamic load balancing as needed by component
operations in adaptive simulation workflows
Advances made to date
• Hybrid (MPI/threads) dynamic load balancing tools (Zoltan2, ParMA)
• Scalable Multi-Jagged (MJ) geometric (coordinate) partitioner
• Partition improvement accounting for multiple mesh entity types
• Multilevel/multimethod fast partitioning to quickly partition to larger core counts
Plans for FY 14-16
• Partition refinement and mesh data incorporated in Zoltan2
• ParMA for fast load balancing during mesh adaptation
• Fast dynamic load balancing for meshes of many billions of
elements on millions of cores – provide capability as part
of Zoltan2
A nine-part 3x3 MJ partition
‹#›

Execution time
norm. wrt Serial RCB
Dynamic Load Balancing to Ensure the Ability of
Applications to Achieve and Maintain Scalability
14
Results/Impact
12
MJ
10
• Scalable geometric partitioning
RCB
8
of meshes with 8 billion elements
6
4
on 64K parts
2
• Improved scalability of
0
1
24
96
384 1536 6144
unstructured mesh simulations
Number of cores
obtained by accounting for
Reduced data movement in MJ enables better scaling than
multiple entity types
Recursive Coordinate Bisection on NERSC’s Hopper.
• Effective use of predictive load
balancing to avoid memory
problems and increase
performance of parallel mesh
adaptation
• Multilevel/multimethod fast
dynamic partitioning to partition
X+ParMA partition quality. Local ParMetis and Local RIB
meshes of 92 billion elements to were executed in less than two minutes; 131072 RIB and
16384 ParMETIS required over three minutes
3.1 million parts on ¾ million cores
‹#›
Highlight: Using architecture-aware unstructured
grid tools to improve multigrid scalability

Goal: Reduce MueLu multigrid execution time by reducing communication costs within
and between multigrid levels

Two-step approach:
• Bipartite graph matching reduces communication
between fine operator (on all cores) and coarse
operator (on subset of cores)
• Zoltan2’s geometric task mapping algorithm
places dependent fine-level tasks in “nearby”
cores in the machine’s network
Results:
• Bipartite graph matching reduces MueLu’s
setup and solve time
• Task placement further reduces solve time
with only small overhead in setup
Potential Impact:
• Ice-sheet modeling in PISCEES
• Solvers for SPH in CM4
• Finite element codes SIERRA & Alegra


Weak scaling experiments with MueLu
on NERSC Hopper
Time for multigrid V-cycle setup
Time for one multigrid solve
‹#›
Mesh Adaptation and Quality Improvement to Ensure
Unstructured Mesh Simulation Reliability


Development goals are to provide
• General purpose parallel mesh modification methods to create anisotropically
adapted meshes as needed by a broad range of applications (MeshAdapt)
• Mesh quality improvement tool that optimizes element shapes using targetmatrix quality metrics, objective functions and optimization solvers (Mesquite)
Advances made to date
• Mixed mesh adaptation for boundary layers
• Adapting meshes to 92 billion elements
• Boundary layer thickness adaptation
• Scaling of mesh quality improvement
(Mesquite) to 125,000 cores
• In-memory integration of MeshAdapt into
multiple simulation workflows
• Support for curved mesh geometry
(complete for quadratic)
‹#›
Mesh Adaptation and Quality Improvement to Ensure
Unstructured Mesh Simulation Reliability


Plans for FY 14-16
• Curved mesh adaptation with mesh geometry
greater than quadratic
• Improved scalability through more frequent
local load balancing
Mesquite mesh quality algorithms
• Continued scalability of mesh quality
improve ALE simulation meshes
longer than original scheme
algorithms; exploration of transactional
memory
• Better anisotropic mesh control for additional
applications
Results/Impact
• Meshes to 92 billion elements
• Integrated into Albany/Trilinos
• Being applied for fusion, accelerator, nuclear,
turbulent flows, active flow control, solid
mechanics, and ALE hydromechanics
‹#›
Highlight: Parallel Mesh Adaptation with Curved Mesh
Geometry for High-Order Accelerator EM Simulations




Development goal is to provide parallel mesh
modification procedure capable of
creating/adapting unstructured meshes with
curved mesh geometry
Parallel mesh adaptation procedure has been
developed (as part of MeshAdapt) that
supports quadratic curved meshes
Ongoing efforts include supporting mesh
geometry of higher than quadratic order for
unstructured meshes in 3D
The procedure has been integrated with the
high-order Electro-magnetic solver ACE3P to
support the linear accelerator simulation
applications at the SLAC National Accelerator
Laboratory
‹#›
Parallel Performance on Unstructured Grids –
Scaling a 92 Billion Element Mesh to 3/4M Cores


Development goal is to demonstrate the ability to
scale adapted anisotropic unstructured meshes
to full machine sizes
Advances made to date
• Create a workflow using FASTMath
unstructured mesh tools to create well
balanced unstructured meshes approaching
100B elements
• Provide a general purpose Navier Stokes
solver that can scale when solving turbulent
flow problems on general geometries with
anisotropic meshes
• Demonstrated effectiveness of using multiple
processes per core on BG/Q – excellent
scaling for a 92B element problem on ¾
million cores (full Mira) with 3.1M parts
‹#›
Parallel Performance on Unstructured Grids –
Scaling a 92 Billion Element Mesh to 3/4M Cores

Plans for FY 14-16
• Effective integration of other FASTMath
solver technologies
Impact
• Provided clear demonstration of the ability
to solve problems on general anisotropic
meshes on complex geometries
• Being applied on active flow problems over
complex problems with applications in wind
energy and in design of more energy
efficient aircraft
• Applied to nuclear reactor flows for
accident scenario planning
• Lead to a joint Colorado/Boeing INCITE
project that performed relevant simulations
on Mira – can reach full flight scale in a day
Equa on Solu on Scaling (32 dB NL Residual Reduc on)
2
1.8
1.6
1.4
Scaling

PETSc 1mpi/core
1.2
PETSc 2mpi/core
1
PETSc 4mpi/core
0.8
Na ve 1mpi/core
0.6
Na ve 2mpi/core
0.4
Na ve 4mpi/core
0.2
0
8
16
32
K cores
64
128
‹#›
Architecture Aware Implementations of
Unstructured Mesh Operations



Development goal: provide tools and methods for operating on unstructured
meshes that effectively use high core-count, hybrid parallel compute nodes
Advances made to date
• Architecture-aware task placement to reduce
congestion and communication costs
• Hybrid (MPI/threads) dynamic load balancing
(Zoltan2, ParMA) and partitioned meshes (PUMI)
• A parallel control utility (PCU) that supports
hybrid operation on unstructured meshes
Plans for FY 14-16
• More effective use of hybrid methods in unstructured mesh operations
• Consideration of use of accelerators (Phi’s) for parallel mesh adaptation
• Investigation of task-placement strategies for Dragonfly networks
• Provide efficient tools for hybrid MPI/thread parallelism for evolving mesh
applications that employ partitioned meshes
‹#›
Architecture Aware Implementations of Unstructured
Mesh Operations
Impact
• More efficient dynamic load balancing using hybrid methods
• 34% reduction in stencil-based computation times using task placement
• Hybrid support for unstructured mesh operations (and other evolving graphs)
Congestion (Max. Messages on a Link)

1,200
1,000
None
LibTopoMap
GeometricTaskPlacement
800
600
400
200
0
4K
8K
16K
32K
64K
Number of Processors
Improvement in performance using hybrid
MPI/Threads in going from N to M parts
where M=(multiple)N
Architecture-aware geometric task placement reduces
congestion (and, thus, communication time) in stencil-based
computation compared to default task placement and mapping
from LibTopoMap (Hoefler, ETH-Zurich)
‹#›
Unstructured Mesh Technologies Support Massively
Parallel High-Order Simulations Over General Domains
To take full advantage of unstructured mesh methods to accurately solve
problems on general domains application developers need the FASTMath
unstructured mesh technologies
FASTMath unstructured mesh technologies provide:
• Parallel mesh infrastructures to support developing applications without the
need to develop the underlying mesh structures and services
• Dynamically load balance as needed to achieve scalability for the
operational steps in a simulation workflow
• Mesh adaptation and quality improvement tools to modify meshes so that
analysis programs can reliably produce results of the desired accuracy
• Architecture aware implementations to take advantage of the high core
count, heterogeneous nodes
SciDAC and other DOE applications are taking advantage of the FASTMath
unstructured mesh technologies to develop scalable simulation workflows such as
those indicated indicated in this and the FASTMath Integrated Technologies
presentation
‹#›
Download