NYS High Performance Computation Consortium funded by NYSTAR at $1M/year for 3 years Goal is to provide NY State users support in the application of HPC technologies in: Research and discovery Product development Improved engineering and manufacturing processes The HPC2 is a distributed activity - participants Rensselaer, Stony Brook/Brookhaven, SUNY Buffalo, NYSERNET 2 Xerox Corning ITT Fluid Technologies: Goulds Pumps Global Foundries Objectives Demonstrate end-to-end solution of two-phase flow problems. Couple with structural mechanics boundary condition. Provide interfaced, efficient and reliable software suite for guiding design. Tools Simmetrix SimAppS Graphical Interface – mesh generation and problem definition PHASTA – two-phase level set flow solver PhParAdapt – solution transfer and mesh adaptation driver Kitware Paraview – visualization Systems CCNI BG/L, CCNI Opterons Cluster REPLACE WITH ANIMATION Fluid ejected into air. Ran on 4000 CCNI BG/L cores. Six iterations of mesh adaptation on two-phase simulation. Autonomously ran on 128 cores of CCNI Opterons for approximately 4 hours Initial work interfaces simulations through serial file formats for displacement and pressure data. Structural mechanics simulation runs in serial. PHASTA simulation runs in parallel. Distribute serial displacement data to partitioned PHASTA mesh. Aggregate partitioned PHASTA nodal pressure data to serial input file. Modifications to automated mesh adaptation Perl script. Structural Mechanics Mesh of Input Face PHASTA Partitioned Mesh of Input Face Objectives Demonstrate capability of available computational tools/resources for parallel simulation of highly viscous sheet flows. Solve a model sheet flow problem relevant to the actual process/geometry. Develop and define processes for high fidelity twin screw extruder parallel CFD simulation. Investigated Tools (to date) ACUSIM AcuConsole and AcuSolve, Simmetrix MeshSim, Kitware Paraview Systems CCNI Opterons Cluster High Aspect Ratio Sheet Aspect ratio : 500:1 Element count: 1.85 Million 7 mins on 512 cores 300 mins on 8 cores 9 Mesh generation in Simmetrix SimAppS graphical interface. Gaps that are ~1/180 of large feature dimension. Conceptual Rendering of Single Screw Extruder Assembly* Single Screw Extruder CAD** * http://en.wikipedia.org/wiki/Plastics_extrusion 10 ** https://sites.google.com/site/oscarsalazarcespedescaddesign/project03 Objectives Apply HPC systems and software to setup and run 3D pump flow simulations in hours instead of days. Provide automated mesh generation for fluid geometries with rotating components. Tools ACUSIM Suite, PHASTA, ANSYS CFX, FMDB, Simmetrix MeshSim, Kitware Paraview Systems CCNI Opterons Cluster AcuConsole Interface Problem definition, mesh generation, runtime monitor, and data visualization Simmetrix provided customized mesh generation and problem definition GUI after iterating with industrial partner. Supports automated identification of pump geometric model features and application of attributes Problem definition with support for exporting data for multiple CFD analysis tools. Reduced mesh generation time frees engineers to focus on simulation and design optimizations improved products Goal: Develop simulation technologies that allow practitioners to evaluate systems of interest. To meet this goal we Develop adaptive methods for reliable simulations Develop methods to do all computation on massively parallel computers Develop multiscale computational methods Develop interoperable technologies that speed simulation system development Partner on the construction of simulation systems for specific applications in multiple areas Software available (http://www.scorec.rpi.edu/software.php) Some tools not yet linked – email shephard@rpi.edu with any questions Simulation Model and Data Management Geometric model interface to interrogate CAD models Parallel mesh topological representation Representation of tensor fields Relationship manager Parallel Control Neighborhood aware message packing - IPComMan Iterative mesh partition improvement with multiple criteria - ParMA Processor mesh entity reordering to improve cache performance Adaptive Meshing Adaptive mesh modification Mesh curving Adaptive Control Support for executing parallel adaptive unstructured mesh flow simulations with PHASTA Adaptive multimodel simulation infrastructure Analysis Parallel Hierarchic Adaptive Stabilized Transient Analysis software for compressible or incompressible, laminar or turbulent, steady or unsteady flows on 3D unstructured meshes (with U. Colorado) Parallel hierarchic multiscale modeling of soft tissues Interoperable Technologies for Advanced Petascale Simulations (ITAPS) Petascale Integrated Tools AMR Front tracking Shape Optimization Solution Adaptive Loop Solution Transfer Petascale Mesh Generation Build on Component Tools Front tracking Smoothing Mesh Adapt Swapping Interpolation Kernels Dynamic Services Are unified by Common Interfaces Mesh Geometry Relations Field Geom/Mesh Services Excellent strong scaling Implicit time integration Employs the partitioned mesh for system formulation and solution Specific number of ALL-REDUCE communications also required 105M vertex mesh (CCNI Blue Gene/L) #Proc. 512 El./core t(sec) scale 204,800 2120 1 1,024 102,400 1052 1.01 2,048 51,200 529 1.00 4,096 25,600 267 8,192 12,800 16,384 32,768 6,400 3,200 1 billion element anisotropic mesh on Intrepid Blue Gene/P #of cores Rgn imb Vtx imb Time (s) Scaling 0.99 16k 2.03% 7.13% 222.03 1 131 1.02 32k 1.72% 8.11% 112.43 0.987 64.5 35.6 1.03 0.93 64k 1.6% 11.18% 57.09 0.972 128k 5.49% 17.85% 31.35 0.885 AAA 5B elements: full-system scale on Jugene (IBM BG/P system) Without ParMA partition improvement strong scaling factor is 0.88 (time is 70.5 secs). Can yield 43 cpu-years savings for production runs! Requires functional support for Mesh distribution Mesh level inter-processor communications Parallel mesh modification Dynamic load balancing Have parallel implementations for each – focusing on increasing scalability Mesh size field of air bubbles distributing in a tube (segment of the model – 64 bubbles total) Initial mesh: uniform, 17 million mesh regions Adapted mesh: 160 air bubbles 2.2 billion mesh regions Multiple predictive load balance steps used to make the adaptation possible Larger meshes possible (not out of memory) Initial and adapted mesh (zoom of a bubble), colored by magnitude of mesh size field Test strong scaling uniform refinement on Ranger 4.3M to 2.2B elements Nonuniform field driven refinement (with mesh optimization) on Ranger 4.2M to 730M elements (time for dynamic load balancing not included) Nonuniform field driven refinement (with mesh optimization operations) on Blue Gene/P 4.2M to 730M elements (time for dynamic load balancing not included) # of Parts Time (s) Scaling 2048 21.5 1.0 4096 11.2 0.96 8192 5.67 0.95 16384 2.73 0.99 # of Parts Time (s) Scaling 2048 110.6 1.0 4096 57.4 0.96 8192 35.4 0.79 # of Parts Time (s) Scaling 4096 173 1.0 8192 105 0.82 16384 66.1 0.65 32768 36.1 0.60 Adaptive Loop Construction Tightly coupled Adv: Computationally efficient Disadv: More complex code t=2e-4 development Example: Explicit solution of cannon blasts t=5e-4 Loosely coupled Adv: Ability to use existing analysis codes Disadv: Overhead of multiple structures and data conversion Example: Implicit high-order Active flow control modeling t=0.0 Adaptive Loop Driver – C++ Coordinates API calls to execute solve-adapt loop phSolver – Fortran 90 Flow solver scalable to 288k cores of BG-P, Field API phParAdapt – C++ Invokes parallel mesh adaptation ▪ SCOREC FMDB and MeshAdapt, Simmetrix MeshSim and MeshSimAdapt Control Control Adaptive Loop Driver phSolver Field API Compact Mesh and Solution Data Field Data phParAdapt Field Data Field API Mesh Data 28 Base Solution Fields General-purpose communication package built on top of MPI Architecture independent neighborhood based inter-processor communications. Neighborhood in parallel applications Subset of processors exchanging messages during a specific communication round. Bounded by a constant, typically under 40, independent of the total number of processors. Several useful features of the library Automatic message packing. Management of sends and receives with non-blocking MPI functions. Asynchronous behavior unless the other is specified. Support of dynamically changing neighborhood during communication steps. Buffer Memory Management Assemble messages in pre-allocated buffers for each destination. Send each package out when its buffer size is reached. Provide memory allocation for both sending and receiving buffers. Deal with constant or arbitrary message sizes. Processor-Neighborhood-Domain Concept Support efficient communication to processor neighbors based on knowledge of neighborhoods. No collective call verifications if neighbors are fixed. If new neighbors are encountered, perform a collective call to figure out the correctness of communication. Communication Paradigm No need to verify and send the number of packages to neighbors, it is wrapped in the last buffer. If nothing to send to its neighbor a constant is sent notifying that the communication is done. No message order rule, thus save communication time by processing the first available buffer. Tiling patterns to test the message flow control in a pseudo-unstructured neighborhood environment on 1024 cores. N/4 processors has 2 neighbors, N/8 processors has 3 neighbors, N/4 processors has 4 neighbors, 3N/16 processors has 5 neighbors, N/16 processors has 9 neighbors, N/16 processors has 14 neighbors, N/16 processors has 36 neighbors Sending and receiving 8 byte messages without buffering. Mesh modification before load balancing can lead to memory problems – Predictive load balancing performs weighted dynamic load balance Mesh metric field at any point P is decomposed to three unit direction (e1,e2,e3) and desired length (h1,h2,h3) in each corresponding direction. The volume of desired element (tetrahedron) : h1h2h3/6 Estimate # of elements to be generated: Incremental redistribution of mesh entities to improve overall balance Partitioning using Mesh Adjacencies - ParMA Designed to improve balance for multiple entities types Use mesh adjacencies directly to determine best candidates for movement Current implementation based on neighborhood diffusion Table: Region and vertex imbalance for a 8.8 million region uniform mesh on a bifurcation pipe model partitioned to different number of parts Selection of vertices to be migrated: ones bounding small number of elements Vertices with only one remote copy considered to avoid the possibility to create nasty part boundaries Vertex imbalance: from 14.3% to 5% Region imbalance: from 2.1% to 5% Enabling Co-Design of Multi-Layer Exascale Storage Architectures Using the Rensselaer Optimistic Simulation System (ROSS) as a parallel simulation framework, we are building a highly detailed and accurate model of the BG/L Torus network, enabling us to investigate contention of I/O and compute network traffic in potential exascale architectures. Do our simulations scale on today’s leadership-class systems? Event Rate Scalability: Event rate as a function of BG/L processors Do our models accurately reflect behavior of existing hardware? Comparison of Network Torus Latency: Blue Gene/L versus Simulation • Mesh curving applied to 8-cavity cryomodule simulations • • 2.97 Million curved regions 1,583 invalid elements corrected – leads to stable simulation and executes 30% faster mesh close-up before and after correcting invalid mesh regions marked in yellow 37 • FETD for short-range wakefield calculations ▪ Adaptively refined meshes have 1~1.5 million curved regions ▪ Uniform refined mesh using small mesh size has 6 million curved regions Electric fields on the three refined curved meshes Initial mesh has 7.1 million regions Initial mesh is isotropic outside boundary layer The adapted mesh: 42.8 million regions 7.1M->10.8M->21.2M->33.0M->42.8M Boundary layer based mesh adaptation Mesh is anisotropic • Multiscale simulation linking microscale network model to a macroscale finite element continuum model. • Collaborating with experimentalists at the University of Minnesota Macroscale Model Microscale Model Nano-void subjected to hydrostatic tension. Finite element discretization of the problem domain and dislocation structures. Nano-indentation of a thin film. Concurrent model configuration at 60th load step (3 A indentation displacement). Colors represent the sub-domains in which manufacture devices circuits size scale atoms/carriers design use/performance 1st principles CMOS modeling Simulation Automation Components Super-resolution lithography tools Mechanics of damage nucleation in devices Modeling/simulation development Technology development Device simulation Reactive ion etching variation-aware circuit design Parallel Computing Methods 42 As Si CMOS devices shrink nanoelectronic effects emerge. Input to circuit level from atomic level physics Fermi-function based analysis gives way to quantum energy-level analysis. Poisson and Schrodinger equations reconciled iteratively, allowing for current predictions. Carrier dynamics respond to strain in increasingly complex ways from mobility changes to tunneling effects. New functionalities might be exploited ▪ ▪ ▪ ▪ E Fermi level Poisson Single-electron transistors Graphene semiconductors Carbon nanotube conductors Schrödinger Spintronics – encoding information into charge carrier’s spin NU UI UN 43 Motivation: Reducing feature size in has made the modeling of underlying physics critical. In projective lithography simple biases not adequate In holographic lithography near-field phenomenon is predominant Modeling approach must be based on Maxwell’s equations Projective Lithography Holographic Lithography Goal: Develop unified computational algorithms for the design and analysis of super-resolution lithographic processes that model the underlying physics with high fidelity 44 To handle SRAM-scale systems, we expect much larger computational systems, e.g., 105 - 106 surface elements. Transport tracking scales O(n2) with number of surface elements n. ▪ Parallelizes well – every view factor can be computed completely independently of every other view factor, giving almost linear speed up. Computational complexity of chemistry solver depends upon particular chemical mechanisms associated with etch recipe. Tend to be O(n2). Cut away view of reactive ion etch simulation of an aspect ratio 1.4 via into a dielectric substrate with 7% porosity, and complete selectivity with respect to the underlying etch stop. A generic ion-radical 45 etch model was used. ~103 surface elements. [Bloomfield et al., SISPAD 2003, IEEE.] At 90 nm and below, devices have come to rely on increased carrier mobility produced by strained silicon. As devices scale down, the relative importance of scattering centers increases. Can we have our cake and eat it too? How much strain can be built into a given device before processing variations and thermo-mechanical load during use cause critical dislocation shedding? Continuum FEM calculations automatically identify critical high-stress regions. A local atomistic problem is constructed and an MD simulation is run, looking for criticality. 46 Results feed back to continuum. Advanced meshing tools and expertise exist at RPI and associated spin-off Leverage tools to support CCNI projects such as the advanced device-modeling. Local refinement and adaptivity can help carry the computation resources further. “More bang for the buck.” 47