Unstructured Mesh Technologies Mark S. Shephard for the FASTMath Unstructured Meshing Team FASTMath SciDAC Institute ‹#› Unstructured Mesh Technologies to Support Massively Parallel High-Order Simulations Over General Domains Architecture Mesh adaptation Parallel mesh Parallel aware and quality infrastructures: performance on implementations control: unstructured Underlying Methods that Tools to modify meshes: mesh structures take advantage meshes so that and services for Scalablility at of high core analysis codes application the highest core count, reliably produce developers counts heterogeneous results of desired nodes accuracy Application developers need tools to take advantage of the ability of unstructured meshes to produce results of a given accuracy with a minimum number of unknowns Dynamic load balancing: Fast methods to achieve load balance for all steps in a simulation workflow ‹#› The FASTMath team includes experts from four national laboratories and five universities Lawrence Berkeley National Laboratory Mark Adams Ann Almgren Phil Colella Anshu Dubey Dan Graves Sherry Li Lin Lin Terry Ligocki Mike Lijewski Peter McCorquodale Esmond Ng Brian Van Straalen Chao Yang Subcontract: Jim Demmel (UC Berkeley) Lawrence Livermore National Laboratory Barna Bihari Lori Diachin Milo Dorr Rob Falgout Mark Miller Jacob Schroder Carol Woodward Ulrike Yang Subcontract: Carl Ollivier-Gooch (Univ of British Columbia) Subcontract: Dan Reynolds (Southern Methodist) Rensselear Polytechnic Inst. E. Seegyoung Seol Onkar Sahni Mark Shephard Cameron Smith Subcontract: Ken Jansen (UC Boulder) Argonne National Laboratory Jed Brown Lois Curfman McInnes Todd Munson Vijay Mahadevan Barry Smith Subcontract: Jim Jiao (SUNY Stony Brook) Subcontract: Paul Wilson (Univ of Wisconsin) Sandia National Laboratories Karen Devine Glen Hansen Jonathan Hu Vitus Leung Siva Rajamanickam Michel Wolf Andrew Salinger ‹#› Parallel Structures and Services to Support Development of Unstructured Mesh Simulation Workflows Development goals • Architecture aware parallel mesh libraries & services to meet application needs • Solution field support for adaptive multiphysics simulation workflows Advances made to date P2 • MOAB library developments P1 Supporting applications on >32K cores Supporting Meshkit meshing library interIntraprocess Improved memory access and parallel I/O process boundary • PUMI developments boundary Support for adaptively evolving mixed meshes Combined use of MPI and threads Meshes to 92B elements on ¾ million parts • Attached Parallel Fields (APF) development underway Effective storage of solution fields on meshes Supports operations on the fields ‹#› Parallel Structures and Services to Support Development of Unstructured Mesh Simulation Workflows Plans for FY 14-16 • Supporting additional unstructured mesh needs Combined structured and unstructured meshes Support of scale/dimension change of meshes • More efficient array-based implementations to support evolving meshes Results/Impact • Unstructured mesh methods to multiple applications – ACES4BGC (Climate) XGC (fusion), M3D-C1 (fusion), ACE3P (accelerators), Plasma-SurfaceInteraction (PSI), NEAMS (NE), CESAR (ASCR), Albany (multiple applications), Athena VMS (NNSA) • Being used in the creation of efficient in-memory adaptive simulations for multiple applications ACE3P, M3D-C1, Albany, PHASTA, Athena VMS, FUN3D, Proteus ‹#› Highlight: Unstructured Mesh Techniques for Edge Plasma Fusion Simulations EPSI PIC coupled to mesh simulation requires high quality meshes meeting a strict set of layout constraints • Existing method took >11 hours and mesh did not have desired quality • FASTMath meshing technologies put together and extended – produce better quality meshes that meet constraints • Run time reduced by a factor of >60 to well under 10 minutes for finest mesh Particle-in-Cell with distributed mesh • Current XGC copies entire mesh on each process • PUMI distributed mesh being extended to support parallel mesh with particles than can move through the mesh ‹#› Highlight: Mesh Intersections and Tracer Transport for BioGeochemical Cycles in ACES4BGC Computing intersections and interpolating key quantities between two meshes covering the same domain is a vital algorithm to many applications MOAB parallel infrastructure and moving mesh intersection algorithm implementations are used to track multi-tracer transport Applied computationally efficient schemes for biogeochemical cycles, with enhancements to implement a scalable (linear complexity), conservative, 2-D remapping algorithm. Efficient and balanced re-distribution of meshes implemented internally A collaborative work between computational scientists in the ACES4BGC project, the FASTMath Institute, and performance engineers in the SUPER Institute ‹#› Highlight: Unstructured Mesh Infrastructure for the M3D-C1 MHD Code for Fusion Plasma Simulations Provide the mesh infrastructure for M3D-C1 • Geometric model interface defined by analytic expressions with B-splines • Distributed mesh management including process grouping to define plane each plane loaded with the same distributed 2D mesh then 3D mesh and corresponding partitioning topology constructed • Mesh adaptation and load balancing • Adjacency-based node ordering • Mapping of mesh to PETSc structures and control of assembly processes Fig: 3D mesh constructed from 64 2D planes on 12288 processes [1] (only the mesh between selected planes shown) [1] S.C.Jardin, et al, Multiple timescale calculations of sawteeth and other macroscopic dynamics of tokamak plasmas, Computational Science and Discovery 5 (2012) 014002 ‹#› Dynamic Load Balancing to Ensure the Ability of Applications to Achieve and Maintain Scalability Development goals • Architecture aware dynamic load balancing library • Fast dynamic load balancing as needed by component operations in adaptive simulation workflows Advances made to date • Hybrid (MPI/threads) dynamic load balancing tools (Zoltan2, ParMA) • Scalable Multi-Jagged (MJ) geometric (coordinate) partitioner • Partition improvement accounting for multiple mesh entity types • Multilevel/multimethod fast partitioning to quickly partition to larger core counts Plans for FY 14-16 • Partition refinement and mesh data incorporated in Zoltan2 • ParMA for fast load balancing during mesh adaptation • Fast dynamic load balancing for meshes of many billions of elements on millions of cores – provide capability as part of Zoltan2 A nine-part 3x3 MJ partition ‹#› Execution time norm. wrt Serial RCB Dynamic Load Balancing to Ensure the Ability of Applications to Achieve and Maintain Scalability 14 Results/Impact 12 MJ 10 • Scalable geometric partitioning RCB 8 of meshes with 8 billion elements 6 4 on 64K parts 2 • Improved scalability of 0 1 24 96 384 1536 6144 unstructured mesh simulations Number of cores obtained by accounting for Reduced data movement in MJ enables better scaling than multiple entity types Recursive Coordinate Bisection on NERSC’s Hopper. • Effective use of predictive load balancing to avoid memory problems and increase performance of parallel mesh adaptation • Multilevel/multimethod fast dynamic partitioning to partition X+ParMA partition quality. Local ParMetis and Local RIB meshes of 92 billion elements to were executed in less than two minutes; 131072 RIB and 16384 ParMETIS required over three minutes 3.1 million parts on ¾ million cores ‹#› Highlight: Using architecture-aware unstructured grid tools to improve multigrid scalability Goal: Reduce MueLu multigrid execution time by reducing communication costs within and between multigrid levels Two-step approach: • Bipartite graph matching reduces communication between fine operator (on all cores) and coarse operator (on subset of cores) • Zoltan2’s geometric task mapping algorithm places dependent fine-level tasks in “nearby” cores in the machine’s network Results: • Bipartite graph matching reduces MueLu’s setup and solve time • Task placement further reduces solve time with only small overhead in setup Potential Impact: • Ice-sheet modeling in PISCEES • Solvers for SPH in CM4 • Finite element codes SIERRA & Alegra Weak scaling experiments with MueLu on NERSC Hopper Time for multigrid V-cycle setup Time for one multigrid solve ‹#› Mesh Adaptation and Quality Improvement to Ensure Unstructured Mesh Simulation Reliability Development goals are to provide • General purpose parallel mesh modification methods to create anisotropically adapted meshes as needed by a broad range of applications (MeshAdapt) • Mesh quality improvement tool that optimizes element shapes using targetmatrix quality metrics, objective functions and optimization solvers (Mesquite) Advances made to date • Mixed mesh adaptation for boundary layers • Adapting meshes to 92 billion elements • Boundary layer thickness adaptation • Scaling of mesh quality improvement (Mesquite) to 125,000 cores • In-memory integration of MeshAdapt into multiple simulation workflows • Support for curved mesh geometry (complete for quadratic) ‹#› Mesh Adaptation and Quality Improvement to Ensure Unstructured Mesh Simulation Reliability Plans for FY 14-16 • Curved mesh adaptation with mesh geometry greater than quadratic • Improved scalability through more frequent local load balancing Mesquite mesh quality algorithms • Continued scalability of mesh quality improve ALE simulation meshes longer than original scheme algorithms; exploration of transactional memory • Better anisotropic mesh control for additional applications Results/Impact • Meshes to 92 billion elements • Integrated into Albany/Trilinos • Being applied for fusion, accelerator, nuclear, turbulent flows, active flow control, solid mechanics, and ALE hydromechanics ‹#› Highlight: Parallel Mesh Adaptation with Curved Mesh Geometry for High-Order Accelerator EM Simulations Development goal is to provide parallel mesh modification procedure capable of creating/adapting unstructured meshes with curved mesh geometry Parallel mesh adaptation procedure has been developed (as part of MeshAdapt) that supports quadratic curved meshes Ongoing efforts include supporting mesh geometry of higher than quadratic order for unstructured meshes in 3D The procedure has been integrated with the high-order Electro-magnetic solver ACE3P to support the linear accelerator simulation applications at the SLAC National Accelerator Laboratory ‹#› Parallel Performance on Unstructured Grids – Scaling a 92 Billion Element Mesh to 3/4M Cores Development goal is to demonstrate the ability to scale adapted anisotropic unstructured meshes to full machine sizes Advances made to date • Create a workflow using FASTMath unstructured mesh tools to create well balanced unstructured meshes approaching 100B elements • Provide a general purpose Navier Stokes solver that can scale when solving turbulent flow problems on general geometries with anisotropic meshes • Demonstrated effectiveness of using multiple processes per core on BG/Q – excellent scaling for a 92B element problem on ¾ million cores (full Mira) with 3.1M parts ‹#› Parallel Performance on Unstructured Grids – Scaling a 92 Billion Element Mesh to 3/4M Cores Plans for FY 14-16 • Effective integration of other FASTMath solver technologies Impact • Provided clear demonstration of the ability to solve problems on general anisotropic meshes on complex geometries • Being applied on active flow problems over complex problems with applications in wind energy and in design of more energy efficient aircraft • Applied to nuclear reactor flows for accident scenario planning • Lead to a joint Colorado/Boeing INCITE project that performed relevant simulations on Mira – can reach full flight scale in a day Equa on Solu on Scaling (32 dB NL Residual Reduc on) 2 1.8 1.6 1.4 Scaling PETSc 1mpi/core 1.2 PETSc 2mpi/core 1 PETSc 4mpi/core 0.8 Na ve 1mpi/core 0.6 Na ve 2mpi/core 0.4 Na ve 4mpi/core 0.2 0 8 16 32 K cores 64 128 ‹#› Architecture Aware Implementations of Unstructured Mesh Operations Development goal: provide tools and methods for operating on unstructured meshes that effectively use high core-count, hybrid parallel compute nodes Advances made to date • Architecture-aware task placement to reduce congestion and communication costs • Hybrid (MPI/threads) dynamic load balancing (Zoltan2, ParMA) and partitioned meshes (PUMI) • A parallel control utility (PCU) that supports hybrid operation on unstructured meshes Plans for FY 14-16 • More effective use of hybrid methods in unstructured mesh operations • Consideration of use of accelerators (Phi’s) for parallel mesh adaptation • Investigation of task-placement strategies for Dragonfly networks • Provide efficient tools for hybrid MPI/thread parallelism for evolving mesh applications that employ partitioned meshes ‹#› Architecture Aware Implementations of Unstructured Mesh Operations Impact • More efficient dynamic load balancing using hybrid methods • 34% reduction in stencil-based computation times using task placement • Hybrid support for unstructured mesh operations (and other evolving graphs) Congestion (Max. Messages on a Link) 1,200 1,000 None LibTopoMap GeometricTaskPlacement 800 600 400 200 0 4K 8K 16K 32K 64K Number of Processors Improvement in performance using hybrid MPI/Threads in going from N to M parts where M=(multiple)N Architecture-aware geometric task placement reduces congestion (and, thus, communication time) in stencil-based computation compared to default task placement and mapping from LibTopoMap (Hoefler, ETH-Zurich) ‹#› Unstructured Mesh Technologies Support Massively Parallel High-Order Simulations Over General Domains To take full advantage of unstructured mesh methods to accurately solve problems on general domains application developers need the FASTMath unstructured mesh technologies FASTMath unstructured mesh technologies provide: • Parallel mesh infrastructures to support developing applications without the need to develop the underlying mesh structures and services • Dynamically load balance as needed to achieve scalability for the operational steps in a simulation workflow • Mesh adaptation and quality improvement tools to modify meshes so that analysis programs can reliably produce results of the desired accuracy • Architecture aware implementations to take advantage of the high core count, heterogeneous nodes SciDAC and other DOE applications are taking advantage of the FASTMath unstructured mesh technologies to develop scalable simulation workflows such as those indicated indicated in this and the FASTMath Integrated Technologies presentation ‹#›