Tools - Scientific Computation Research Center

advertisement
NYS High Performance Computation
Consortium funded by NYSTAR at $1M/year
for 3 years
 Goal is to provide NY State users support in
the application of HPC technologies in:

 Research and discovery
 Product development
 Improved engineering and manufacturing
processes

The HPC2 is a distributed activity - participants
 Rensselaer, Stony Brook/Brookhaven, SUNY
Buffalo, NYSERNET
2




Xerox
Corning
ITT Fluid Technologies: Goulds Pumps
Global Foundries
Objectives
 Demonstrate end-to-end solution of two-phase flow
problems.
 Couple with structural mechanics boundary condition.
 Provide interfaced, efficient and reliable software suite
for guiding design.
Tools
 Simmetrix SimAppS Graphical Interface – mesh
generation and problem definition
 PHASTA – two-phase level set flow solver
 PhParAdapt – solution transfer and mesh adaptation
driver
 Kitware Paraview – visualization
Systems
 CCNI BG/L, CCNI Opterons Cluster
REPLACE
WITH
ANIMATION
Fluid ejected into air.
Ran on 4000 CCNI BG/L cores.
Six iterations of mesh adaptation on two-phase simulation.
Autonomously ran on 128 cores of CCNI Opterons for approximately 4 hours





Initial work interfaces simulations
through serial file formats for
displacement and pressure data.
Structural mechanics simulation
runs in serial. PHASTA simulation
runs in parallel.
Distribute serial displacement data
to partitioned PHASTA mesh.
Aggregate partitioned PHASTA
nodal pressure data to serial input
file.
Modifications to automated mesh
adaptation Perl script.
Structural Mechanics Mesh
of Input Face
PHASTA Partitioned Mesh
of Input Face
Objectives
 Demonstrate capability of available computational
tools/resources for parallel simulation of highly
viscous sheet flows.
 Solve a model sheet flow problem relevant to the
actual process/geometry.
 Develop and define processes for high fidelity twin
screw extruder parallel CFD simulation.
Investigated Tools (to date)
 ACUSIM AcuConsole and AcuSolve, Simmetrix
MeshSim, Kitware Paraview
Systems
 CCNI Opterons Cluster

High Aspect Ratio Sheet
 Aspect ratio : 500:1
 Element count: 1.85 Million
 7 mins on 512 cores
 300 mins on 8 cores
9
 Mesh generation in Simmetrix
SimAppS graphical interface.
 Gaps that are ~1/180 of large
feature dimension.
Conceptual Rendering
of Single Screw Extruder
Assembly*
Single Screw Extruder CAD**
* http://en.wikipedia.org/wiki/Plastics_extrusion
10
** https://sites.google.com/site/oscarsalazarcespedescaddesign/project03
Objectives
 Apply HPC systems and software to setup
and run 3D pump flow simulations in hours
instead of days.
 Provide automated mesh generation for fluid
geometries with rotating components.
Tools
 ACUSIM Suite, PHASTA, ANSYS CFX,
FMDB, Simmetrix MeshSim, Kitware
Paraview
Systems
 CCNI Opterons Cluster
 AcuConsole Interface
 Problem definition, mesh
generation, runtime monitor,
and data visualization
Simmetrix provided customized mesh generation and
problem definition GUI after iterating with industrial
partner.
 Supports automated identification of pump
geometric model features and application of
attributes
 Problem definition with support for exporting data
for multiple CFD analysis tools.
 Reduced mesh generation time frees engineers to
focus on simulation and design optimizations 
improved products



Goal: Develop simulation technologies that
allow practitioners to evaluate systems of
interest.
To meet this goal we
 Develop adaptive methods for reliable simulations
 Develop methods to do all computation on
massively parallel computers
 Develop multiscale computational methods
 Develop interoperable technologies that speed
simulation system development
 Partner on the construction of simulation systems
for specific applications in multiple areas

Software available (http://www.scorec.rpi.edu/software.php)
 Some tools not yet linked – email shephard@rpi.edu with any
questions

Simulation Model and Data Management
 Geometric model interface to interrogate CAD models
 Parallel mesh topological representation
 Representation of tensor fields
 Relationship manager

Parallel Control
 Neighborhood aware message packing - IPComMan
 Iterative mesh partition improvement with multiple criteria - ParMA
 Processor mesh entity reordering to improve cache performance

Adaptive Meshing
 Adaptive mesh modification
 Mesh curving

Adaptive Control
 Support for executing parallel adaptive unstructured mesh
flow simulations with PHASTA
 Adaptive multimodel simulation infrastructure

Analysis
 Parallel Hierarchic Adaptive Stabilized Transient Analysis
software for compressible or incompressible, laminar or
turbulent, steady or unsteady flows on 3D unstructured
meshes (with U. Colorado)
 Parallel hierarchic multiscale modeling of soft tissues
Interoperable Technologies for Advanced
Petascale Simulations (ITAPS)
Petascale
Integrated
Tools
AMR
Front tracking
Shape
Optimization
Solution
Adaptive
Loop
Solution
Transfer
Petascale
Mesh
Generation
Build on
Component
Tools
Front
tracking
Smoothing
Mesh
Adapt
Swapping
Interpolation
Kernels
Dynamic
Services
Are unified
by
Common
Interfaces
Mesh
Geometry
Relations
Field
Geom/Mesh
Services
Excellent
strong scaling
 Implicit time integration
 Employs the partitioned mesh for
system formulation and solution
 Specific number of ALL-REDUCE
communications also required
105M vertex mesh (CCNI Blue Gene/L)
#Proc.
512
El./core t(sec) scale
204,800 2120
1
1,024
102,400
1052
1.01
2,048
51,200
529
1.00
4,096
25,600
267
8,192
12,800
16,384
32,768
6,400
3,200
1 billion element anisotropic mesh on
Intrepid Blue Gene/P
#of
cores
Rgn
imb
Vtx
imb
Time (s)
Scaling
0.99
16k
2.03%
7.13%
222.03
1
131
1.02
32k
1.72%
8.11%
112.43
0.987
64.5
35.6
1.03
0.93
64k
1.6%
11.18%
57.09
0.972
128k
5.49%
17.85%
31.35
0.885

AAA 5B elements: full-system scale
on Jugene (IBM BG/P system)
Without ParMA partition improvement strong scaling factor is 0.88 (time is 70.5 secs).
Can yield 43 cpu-years savings for production runs!

Requires functional support for





Mesh distribution
Mesh level inter-processor communications
Parallel mesh modification
Dynamic load balancing
Have parallel implementations for each –
focusing on increasing scalability
Mesh size field of air bubbles distributing in a
tube (segment of the model – 64 bubbles total)
 Initial mesh: uniform, 17 million mesh
regions
 Adapted mesh: 160 air bubbles 2.2 billion
mesh regions
 Multiple predictive load balance steps used
to make the adaptation possible
 Larger meshes possible (not out of memory)
Initial and adapted mesh
(zoom of a bubble), colored by
magnitude of mesh size field



Test strong scaling uniform refinement
on Ranger 4.3M to 2.2B elements
Nonuniform field driven refinement
(with mesh optimization) on Ranger
4.2M to 730M elements (time for
dynamic load balancing not included)
Nonuniform field driven refinement
(with mesh optimization operations)
on Blue Gene/P 4.2M to 730M
elements (time for dynamic load
balancing not included)
# of Parts
Time (s)
Scaling
2048
21.5
1.0
4096
11.2
0.96
8192
5.67
0.95
16384
2.73
0.99
# of Parts
Time (s)
Scaling
2048
110.6
1.0
4096
57.4
0.96
8192
35.4
0.79
# of Parts
Time (s)
Scaling
4096
173
1.0
8192
105
0.82
16384
66.1
0.65
32768
36.1
0.60
Adaptive Loop Construction
 Tightly coupled
 Adv: Computationally efficient
 Disadv: More complex code
t=2e-4
development
 Example: Explicit solution of
cannon blasts
t=5e-4
 Loosely coupled
 Adv: Ability to use existing
analysis codes
 Disadv: Overhead of multiple
structures and data conversion
 Example: Implicit high-order
Active flow control modeling
t=0.0

Adaptive Loop Driver – C++
 Coordinates API calls to execute solve-adapt loop

phSolver – Fortran 90
 Flow solver scalable to 288k cores of BG-P, Field API

phParAdapt – C++
 Invokes parallel mesh adaptation
▪ SCOREC FMDB and MeshAdapt, Simmetrix MeshSim and
MeshSimAdapt
Control
Control
Adaptive Loop
Driver
phSolver
Field
API
Compact Mesh and
Solution Data
Field
Data
phParAdapt
Field
Data
Field
API
Mesh Data
28
Base
Solution
Fields

General-purpose communication package built on
top of MPI
 Architecture independent neighborhood based inter-processor
communications.

Neighborhood in parallel applications
 Subset of processors exchanging messages during a specific
communication round.
 Bounded by a constant, typically under 40, independent of the total
number of processors.

Several useful features of the library




Automatic message packing.
Management of sends and receives with non-blocking MPI functions.
Asynchronous behavior unless the other is specified.
Support of dynamically changing neighborhood during communication
steps.

Buffer Memory Management





Assemble messages in pre-allocated buffers for each destination.
Send each package out when its buffer size is reached.
Provide memory allocation for both sending and receiving buffers.
Deal with constant or arbitrary message sizes.
Processor-Neighborhood-Domain Concept
 Support efficient communication to processor neighbors based on knowledge
of neighborhoods.
 No collective call verifications if neighbors are fixed.
 If new neighbors are encountered, perform a collective call to figure out the
correctness of communication.

Communication Paradigm
 No need to verify and send the number of packages to neighbors, it is
wrapped in the last buffer.
 If nothing to send to its neighbor a constant is sent notifying that the
communication is done.
 No message order rule, thus save communication time by processing the
first available buffer.
Tiling patterns to test the message flow control in a pseudo-unstructured neighborhood
environment on 1024 cores.
N/4 processors has 2 neighbors, N/8 processors has 3 neighbors,
N/4 processors has 4 neighbors, 3N/16 processors has 5
neighbors, N/16 processors has 9 neighbors, N/16 processors has
14 neighbors, N/16 processors has 36 neighbors
Sending and receiving 8 byte
messages without buffering.


Mesh modification before load balancing can lead to memory problems –
Predictive load balancing performs weighted dynamic load balance
 Mesh metric field at any point P is decomposed to three unit direction
(e1,e2,e3) and desired length (h1,h2,h3) in each corresponding direction.
 The volume of desired element (tetrahedron) : h1h2h3/6
 Estimate # of elements to be generated:


Incremental redistribution of mesh entities to improve overall balance
Partitioning using Mesh Adjacencies - ParMA
 Designed to improve balance for multiple entities types
 Use mesh adjacencies directly to determine best candidates for movement
 Current implementation based on neighborhood diffusion
Table: Region and vertex imbalance for a 8.8 million region uniform mesh on a
bifurcation pipe model partitioned to different number of parts


Selection of vertices to be migrated:
ones bounding small number of
elements
Vertices with only one remote copy
considered to avoid the possibility to
create nasty part boundaries
Vertex imbalance: from 14.3% to 5%
Region imbalance: from 2.1% to 5%
Enabling Co-Design of Multi-Layer Exascale
Storage Architectures
Using the Rensselaer Optimistic Simulation System (ROSS) as a parallel simulation framework, we
are building a highly detailed and accurate model of the BG/L Torus network, enabling us to investigate
contention of I/O and compute network traffic in potential exascale architectures.
Do our simulations scale on today’s
leadership-class systems?
Event Rate Scalability: Event rate as
a function of BG/L processors
Do our models accurately reflect
behavior of existing hardware?
Comparison of Network Torus Latency:
Blue Gene/L versus Simulation
•
Mesh curving applied to 8-cavity cryomodule simulations
•
•
2.97 Million curved regions
1,583 invalid elements corrected – leads to stable simulation and
executes 30% faster
mesh close-up before
and after correcting
invalid mesh regions
marked in yellow
37
• FETD for short-range wakefield calculations
▪ Adaptively refined meshes have 1~1.5
million curved regions
▪ Uniform refined mesh using small mesh
size has 6 million curved regions
Electric fields on the three refined curved meshes
Initial mesh has 7.1 million regions
Initial mesh is isotropic outside boundary layer
The adapted mesh: 42.8 million regions
7.1M->10.8M->21.2M->33.0M->42.8M
Boundary layer based mesh adaptation
Mesh is anisotropic
•
Multiscale simulation
linking microscale network
model to a macroscale
finite element continuum
model.
• Collaborating with
experimentalists at the
University of Minnesota
Macroscale Model
Microscale Model
Nano-void subjected to hydrostatic
tension. Finite element discretization
of the problem domain and
dislocation structures.
Nano-indentation of a thin film.
Concurrent model
configuration at 60th load
step (3 A indentation
displacement). Colors
represent
the sub-domains in which
manufacture
devices
circuits
size scale
atoms/carriers
design
use/performance
1st principles
CMOS modeling
Simulation
Automation
Components
Super-resolution
lithography tools
Mechanics of
damage nucleation
in devices
Modeling/simulation
development
Technology
development
Device
simulation
Reactive ion
etching
variation-aware
circuit design
Parallel
Computing
Methods
42

As Si CMOS devices shrink
nanoelectronic effects emerge.
Input to circuit level from
atomic level physics
 Fermi-function based analysis gives way
to quantum energy-level analysis.
 Poisson and Schrodinger equations
reconciled
iteratively, allowing for current predictions.
 Carrier dynamics respond to strain
in increasingly complex ways from mobility
changes to tunneling effects.
 New functionalities might be exploited
▪
▪
▪
▪
E
Fermi level
Poisson
Single-electron transistors
Graphene semiconductors
Carbon nanotube conductors
Schrödinger
Spintronics – encoding information into charge
carrier’s spin
NU
UI
UN
43

Motivation:
 Reducing feature size in has made the
modeling of underlying physics critical.
 In projective lithography simple biases
not adequate
 In holographic lithography near-field
phenomenon is predominant
 Modeling approach must be based on
Maxwell’s equations

Projective Lithography
Holographic
Lithography
Goal:
 Develop unified computational
algorithms for the design and analysis of
super-resolution lithographic processes
that model the underlying physics
with high fidelity
44

To handle SRAM-scale systems, we expect much larger computational
systems, e.g., 105 - 106 surface elements.
 Transport tracking scales O(n2) with number of surface elements n.
▪ Parallelizes well – every view factor can be computed completely
independently of every other view factor, giving almost linear speed up.
 Computational complexity of chemistry solver depends upon particular
chemical mechanisms associated with etch recipe. Tend to be O(n2).
Cut away view of reactive ion etch simulation of an aspect ratio 1.4 via into a dielectric substrate with
7% porosity, and complete selectivity with respect to the underlying etch stop. A generic ion-radical
45
etch model was used. ~103 surface elements. [Bloomfield et al., SISPAD 2003, IEEE.]
 At 90 nm and below, devices have come to rely on increased carrier mobility
produced by strained silicon.
 As devices scale down, the relative importance of scattering centers
increases.
 Can we have our cake and eat it too? How much strain can be built into a
given device before processing variations and thermo-mechanical load
during use cause critical dislocation shedding?
Continuum FEM calculations
automatically identify critical
high-stress regions.
A local atomistic problem is constructed and
an MD simulation is run, looking for criticality.
46
Results feed back to continuum.
 Advanced meshing tools
and expertise exist at RPI
and associated spin-off
 Leverage tools to support
CCNI projects such as the
advanced device-modeling.
 Local refinement and
adaptivity can help carry the
computation resources
further. “More bang for the
buck.”
47
Download