Preparing for Science at the Petascale: Cores

advertisement
Preparing for Science at the Petascale:
Real Science on up to Tens of Thousands of
Cores
Radhika S. Saksena1, Bruce Boghosian2,
Luis Fazendeiro1, Owain A. Kenway, Steven Manos1,
Marco Mazzeo1, S. Kashif Sadik1, James L. Suter1,
David Wright1 and Peter V. Coveney1
1. Centre for Computational Science, UCL, UK
2. Tufts University, Boston, USA
Contents
• Moving towards the petascale
- HECToR (XT4 and X2)
- Ranger (TACC)
- Intrepid (ALCF)
• Scientific fields of research:
–
–
–
–
–
Turbulence
Liquid crystalline rheology
Clay-polymer nanocomposites
HIV drug resistance
Patient specific haemodynamics
• Conclusions
2
New era of petascale machines
Ranger (TACC) - NSF funded SUN Cluster
• 0.58 petaflops (theoretical) peak: ~ 10 times HECToR
(59 Tflops)
“bigger” than all other TeraGrid resources combined
• Linpack speed 0.31 petaflops, 123TB memory
• Architecture: 82 racks; 1 rack = 4 chassis; 1 chassis =
12 nodes
• 1 node = Sun blade x6420 (four 16 bit AMD Opteron
Quad-Core processors);
• 3,936 nodes = 62,976 cores
Intrepid (ALCF) - DOE funded BlueGene/P
•
•
•
•
0.56 petaflops (theoretical) peak
163,840 cores; 80TB memory
Linpack speed 0.45 petaflops
“Fastest” machine available for open science and third
in general1
3
1. http://www.top500.org/lists/2008/06
New era of petascale machines

US firmly committed to path to petascale (and beyond)

NSF: Ranger (5 years, $59 million award)


University of Tennessee, to build system with just under 1PF
peak performance ($65 million, 5-year project)1
“Blue Waters” will come online in 2011 at NCSA ($208 grant), using
IBM technology – to deliver peak 10 Pflops performance
(~ 200K cores, 10PB of disk)
1. http://www.nsf.gov/news/news_summ.jsp?cntn_id=109850
4
New era of petascale machines
• We wish to do new science at this scale – not just incremental
advances
• Applications that scale linearly up to tens of thousands of cores
(large system sizes, many time steps) – capability computing at
petascale
• High throughput for “intermediate scale” applications
(in the 128 – 512 core range)
5
Intercontinental HPC grid environment
US TeraGrid
UK NGS
NCSA
AHE
SDSC
ANL
(Intrepid)‫‏‬
TACC
(Ranger)‫‏‬
HECToR
HPCx
Leeds
PSC
Manchester
Oxford
RAL
DEISA



Massive data transfers
Lightpaths
Advanced reservation/
co-scheduling
Emergency/pre-emptive access
6
Lightpaths - Dedicated 1 Gb UK/US network


JANET Lightpath is a centrally managed service which supports
large research projects on the JANET network by providing endto-end connectivity, from 100’s of Mb up to whole fibre
wavelengths (10 Gb).
Typical usage
– Dedicated 1Gb network to connect to
national and international HPC infrastructure
– Shifting TB datasets between the
UK/US
– Real-time visualisation
– Interactive computational steering
– Cross-site MPI runs (e.g. between
NGS2 Manchester and NGS2 Oxford)‫‏‬
7
Advanced reservations
•
Plan in advance to have access to the resources
- Process of reserving multiple resources for use by a single application
- HARC1 - Highly Available Resource Co-Allocator
- GUR2 - Grid Universal Remote
•
Can reserve the resources:
– For the same time:
• Distributed MPIg/MPICH-G2 jobs
• Distributed visualization
• Booking equipment (e.g. visualization facilities)‫‏‬
– Or some coordinated set of times
– Computational workflows

Urgent computing and pre-emptive access (SPRUCE)
1. http://www.realitygrid.org/middleware.shtml#HARC
2. http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/TGIA64LinuxCluster/Doc/coschedule.html
8
Application Hosting Environment
• Middleware which simplifies
access to distributed
resources; manage workflows
• Wrestling with middleware
can't be a limiting step for
scientists - Hiding
complexities of the ‘grid’ from
the end user
• Applications are stateful Web
services
• Application can consist of a
coupled model, parameter
sweep, steerable application,
or a single executable
9
HYPO4D - Hydrodynamic periodic orbits in 4D

Scientific goal: to identify and characterize unstable periodic orbits in driven
turbulent fluid flow (from which exact time averages can be computed)





Approach parallelizes space and time
Well known formalism exists from which observables can be computed through
the dynamical zeta function
Degree of accuracy is high and converges quickly, with the number of known
lower period UPOs
No need to redo initial value problem every time we wish to compute average of
some quantity
Observables are no longer stochastic in nature -- we have an exact
expansion to compute them
10
HYPO4D




Uses lattice-Boltzmann method: stream and collide (local collision
operator + linear streaming operator )
Full locality with nearest-neighbour communications
Code written in C, using MPI for communications; halo-swapping
done strictly through non-blocking (asynchronous) MPI SEND/RECV
No aggressive optimization pursued, so as to run on many platforms
as possible
11
HYPO4D
•
HYPO4D performance: linear scaling up to 16K cores, 70% speed-up from
16384 to 32768 cores
12
HYPO4D
•
HYPO4D performance: linear scaling up to 33K cores, 38% speed-up from
32768 to 65536 cores
13
HYPO4D
• HECToR results on XT4 are the best so far in terms of timing, i.e. code
performance, approximately 50% faster than Ranger.
• Value of SUPS remains practically constant and higher than on Intrepid
or even Ranger
14
HYPO4D
•
Amdahl's Law for vectorization: the performance of a program is dominated by
its slowest component which in the case of a vectorized program is scalar code
(http://www.pcc.qub.ac.uk/tec/courses/cray/ohp/CRAY-slides_3.html)
•
Amdahl's Law - the formulation for vector code which is R times faster than
scalar code is:
sv =
•
•
•
•
1
Fv
Fs +
Rv
Sv = maximum expected speedup
Fv = fraction of a program that is vectorized
Fs = fraction of a program that is scalar = 1 - fv
Rv = ratio of scalar to vector processing time
!
15
HYPO4D
• For Cray Research systems Rv ranges from 10 to 20.
• It is common for programs to be 70% to 80% vectorized ie 70% to 80%
of their running time is spent executing vector instructions.
• Not always easy to reach 70% to 80% vectorization in a program and
vectorizing beyond this level becomes increasingly difficult normally
requiring major changes to the algorithm.
• Many users stop their vectorization efforts once the vectorized code is
running 2 to 4 times faster than scalar code.
16
HYPO4D
• X2 (Black Widow component) comparison with XT4
• Excellent performance with some simple code changes (such as
changing order of loops/ better managing of I/O calls): still room for
improvement (work-in-progress)
• For the 16, 32 cores already a 64% increase in SUPS, with NAG's
support (Ning Li)
17
HYPO4D
Novel approach to turbulence studies: efficiently parallelizes time and
space
 Algorithm is extremely memory-intensive: full spacetime trajectories are
numerically relaxed to nearby minimum (unstable periodic orbit)
 Ranger is currently best resource for this work (123 TB of aggregated
RAM)‫‏‬

During early-user period millions
of time steps for different systems
simulated and then compared, ~
10TB of data
18
LB3D -- complex condensed matter physics

LB3D -- three-dimensional lattice-Boltzmann solver for multi-component
fluid dynamics, in particular amphiphilic systems
• Mature code - 9 years in development. It has been extensively used on
the US TeraGrid, UK NGS, HECToR and HPCx machines
• Largest model simulated to date is 20483 (needs Ranger)
19
Lattice-Boltzmann Simulations of Amphiphilic Liquid
Crystals - Scientific Objectives
• Mesophase self-assembly - cubics: gyroid, diamond and primitive, non-cubics:
lamellar and hexagonal and exotic multi-lamellar onions and vesicular structures.
• Cubic phase rheology
• Pressure and shear induced phase transitions:
Cubic -> Cubic and Lamellar/Hexagonal -> Cubic
• Defect Dynamics
• Finite-size effects
Rheological Simulations under Shear
Amphiphilic fluid: cubic mesophase
Influence of Defects
LB3D benchmarks on Ranger
LB3D benchmarks on HECToR XT4
Description: HECToR is a Cray XT4 machine in the UK with a current
theoretical peak of 59 Tflops and 33.2 TB of memory overall.
LB3D benchmarks on HECToR X2
No optimization has been applied to LB3D for X2. We expect vector performance
to improve after vectorisation of the control flow.
Lattice-Boltzmann Simulations of Amphiphilic Liquid
Crystals - Self-Assembly Results
Cubic, lamellar and HPL phases
Gyroid phase simulation
Gyroid phase - periodic nodal
approximation: sin(2Πx)cos(2 Πy)+
sin(2 Πy)cos(2 Πz)+sin(2 Πz)cos(2 Πx)=0
Lattice-Boltzmann Simulations of Amphiphilic Liquid
Crystals - Self-Assembly Results
Cubic, lamellar and HPL phases
Diamond phase simulation
Diamond phase - periodic nodal
Approximation:
cos(2 Πx)cos(2 Πy)cos(2 Πz)
+cos(2 Πx)sin(2 Πy)sin(2 Πz)
+ sin(2 Πx)cos(2 Πy)sin(2 Πz)
+sin(2 Πx)sin(2 Πy)cos(2 Πz)=0
Lattice-Boltzmann Simulations of Amphiphilic Liquid
Crystals - Self-Assembly Results
Cubic, lamellar and HPL phases
Primitive phase simulation
Primitive phase - periodic nodal
Approximation: cos(2 Πx)+cos(2 Πy)
+ cos(2 Πz)
Lattice-Boltzmann Simulations of Amphiphilic Liquid
Crystals - Self-Assembly Results
Cubic, lamellar and HPL phases
Lamellar phase
Ripple phase
Hexagonal Perforated
Lamellar phase
Lattice-Boltzmann Simulations of Amphiphilic Liquid
Crystals - Self-Assembly by varying pressure/concentration
Variation of scalar pressure is in agreement with the morphology
obtained in experiments. In order of decreasing pressure:
lamellar -> ripple -> primitive -> hexagonal
Variation of relative concentration of amphiphile is in agreement with
the morphology obtained in experiments. The following sequence
observed with increasing relative amphiphilic concentration is in
agreement with experiments:
sponge -> gyroid -> lamellar
R. S. Saksena and P. V. Coveney, “Self-Assembly of Ternary Cubic, Hexagonal and Lamellar
Mesophases using the Lattice-Boltzmann Kinetic Method”, J. Phys. Chem. B, 112(10), 2950-2957
(2008).
Cubic Phase Rheology Results1

Recent results include the
tracking of large time-scale
defect dynamics on 10243
lattice-sites systems; only
possible on Ranger, due to
sustained core count and disk
storage requirements
2563 lattice-sites gyroidal system
with multiple domains
• Regions of high stress
magnitude are localized in the
vicinity of defects
• Fluid is viscoelastic
1. See R. S. Saksena et al. “Real science at
The petascale” preprint (2008)
29
Clay-Polymer Nanocomposites
•
•
•
•
Aluminosilicate clay sheets carry a
negative charge.
This is charge-balanced by cations such
as Na+ and K+.
Water of hydration also present.
The nano-scale clay sheets may be
dispersed within a polymer in three ways:
– As tactoids
– With the polymer intercalated
– Exfoliated
– Research shows a mixture of the latter
two is typical.
Our Interest in Clay-Polymer Nanocomposites
•
•
•
•
•
•
Drilling fluid additives to prevent clay swelling during bore hole drilling.
Significant interest from the oilfield industry - US Patent 6,787,507
“Stabilizing clayey formations”.
Low clay fraction clay-polymer nanocomposites are receiving much
attention due to the increase in materials and thermochemical properties of
plastics for a small increase in weight.
To understand the mechanical properties of the composites we require
accurate knowledge of the materials properties of each component.
Experimentally, determining the material properties of clays (such as
montmorillonite) has proved very difficult. Can simulation provide an
answer?
£1.6M TSB project with MI-SWACO including HECToR access
Simulation work
• The small thickness and separation of the clay sheets (~1nm) and the changes that
occur due to intermolecular interactions require that we simulate the clay sheet in
atomistic detail.
• Currently we work with the following software:
– MD Code = LAMMPS (5K to > 107 atom models)
Large Atomistic/Molecularly Massive Parallel Simulator. Uses spatial-decomposition
techniques to partition the simulation cell into 3d sub-domains, one of which is
assigned to each processor.
LAMMPS uses FFTw to calculate the electrostatic interactions in reciprocal space. It
keeps track of nearby particles for real-space part of electrostatics and pairwise
interactions (I.e. Lennard-Jones) through neighbour lists.
– Builder = Accelrys Inc. Cerius2
– MC Code = Cerius2 (C2) Discover
– MD Code = C2 Discover (Small <5K atom models)
– Forcefield = ClayFF
– Visualizer = Chimera (Small models)
– Visualizer = Atom Eye (Large models)
Large-Scale MD: sheet undulations
Case-Study: Non-swelling amine-clay composites
•
Using very large simulation super-cells (350,840 atoms; 28nm X 50nm
X 3nm) the clay sheets had been able to flex, resulting in a much
broader distribution of atom density.
•
This distortion may be more evident at even larger super cell sizes, or
using 2 -dimensional boundary conditions
•
Small models suffer from unphysical finite size effects.
Large-Scale MD:
Reaching the Limits of Current Computing Power
How can we simulate clay
materials on a large enough
scale to capture this
“mesoscopic” motion of the
undulating clay sheets, while
still capturing all atomistic
information?
Answer: We federate
international Grids in the US,
UK and Europe, allowing us
transparent access to
unprecedented resources
Large-Scale MD
• Using our federated grid, we can use LAMMPS and the fully flexible force-field
ClayFF to simulate in detail long length scale motions which emerge with
increasing size of clay sheet. These undulations are prohibited in the small
sizes commonly used in MD simulations.
• Such simulations give information regarding the mechanical properties of the
composite, allowing comparison with experiment and prediction of material
properties.
• The largest system we have simulated is a 10 million atom clay-water
system, dimensions: 150 x 270 x 25nm. This approaches a realistic size of
platelet
Clay-Polymer Nanocomposites: Elastic Properties
• We want to find out the wavelengths
and amplitudes of the thermal
undulations of the clay sheet.
• We therefore need to calculate the
central position of the clay sheet on a
regular grid from the atomic positions.
• We then Fourier transform this height
function h(x,y) to find out the
wavelengths and amplitudes.
Height function h(x,y) of a 1 million
atom montmorillonite clay sheet
discretised onto a 100 x 150 grid,
with grid spacing 0.45nm x 0.45nm
Clay-Polymer Nanocomposites: Elastic Properties
• The amplitudes can be related to elastic properties of the clay
• The free energy per unit area for the undulations is related to the
bending of the surface:
Gund (x, y) = 0.5kc | " 2 h(x, y) |2 +0.5# | "h(x, y) |2
!
where kc is the bending modulus i.e. how easy it is for the sheet to bend,
and γ is the surface tension, which will resist increasing the surface area
of the sheet.
If we convert to Fourier space we find the above expression becomes,
for wavevectors k,:
Gund = 0.5(k c k 4 + "k 2 )h(k) 2
!
Clay-Polymer Nanocomposites: Elastic Properties
If we assume each mode of vibration has the same energy of 1/2 kBT (the
equipartition principle, with kB the Boltzmann constant), the amplitude of
undulations becomes:
<h2(k)> = kBT / A(kck4 + γk2)
where A is the clay sheet area
We see k-4 behaviour at
long wavelengths, the
gradient is kc
Artifact due to
imposed periodicity
of model
The collective motion exhibiting k-4
behaviour is only apparent with
wavelengths greater than 15 nm
How Large Can We Go With LAMMPS?
We have benchmarked LAMMPS using our
clay simulations (full interactions including
bonds, angles, Lennard-Jones and
electrostatic interactions) on Ranger at TACC,
University of Texas - The largest computing
system in the world for open science research.
Sun Constellation Linux Cluster, Ranger, is
configured with 3,936 16-way SMP
compute-nodes (blades)
We find a significant tailing off in
performance when number of
atoms per cores < 10,000
Compiler options: mpicc -x0 -O3
LAMMPS scaling on Intrepid (BlueGene/P)
• Time per timestep of simulation is approximately is 4x
slower than on Ranger
• Need access to more processors for equivalent wall clock
time.
LAMMPS Scaling on HECToR XT4
• 9.5 million atom
(left) and 85
million atom
system
• LAMMPS is a
supported code
on HECToR XT4
• Approximately
33% faster per
timestep than
Ranger
LAMMPS on the X2?
• LAMMPS currently uses spatial sub-division methods with
explicit communication for parallelization.
• Large amount of code reorganisation required to effectively
utilise vector architecture.
• Successful vectorisation is possible when identical operations
are repeated over comparatively large data items organized as
linear arrays. LAMMPS uses neighbour-list methods for
calculating pair potentials; low number of innermost loops and
data is not independent1.
• LAMMPS uses FFTW to calculate electrostatics; FFTW does
not vectorise and FFTs would have to rewritten to use Cray
FFTs.
1.
Rapaport, D.C., Comp. Chem. Comm. 174 (2006) 521-529
How Large Can We Go With LAMMPS?
We cannot go above 2 billion atoms, (231) with LAMMPS because of atom
IDs being required to find bond partners which are stored as positive 32-bit
integers.
Therefore, major code revision is required to access > 2 billion atom
systems.
In the meantime, we investigate using coarse-grained molecular dynamics,
built using details from atomistic molecular dynamics
Atomistic MD
Coarse-grained MD
Clay-Polymer Nanocomposites:
Future directions
• The question of whether periodic boundary conditions become limiting at
large length scales remains to be addressed.
• Answer: remove periodic boundary conditions!
• Simulating “life-sized” clay platelets (approx. 50nm across)
• Will remove artificial constraints
• Will allow us to examine unexplored behaviours, such as interactions at
clay sheet edges and undulations.
• Clay sheets in a simulation box
of water or polymer matrix
• Requires very large systems
to ensure our models are realistic:
between 106 and 108 atoms
• We can explore even larger systems for longer timescales if we use
coarse-grained molecular dynamics
Clay-Polymer Nanocomposites: Future Directions
Isolated Clay Sheets - Interactions Over Many Length Scales
With our very large atomistic and coarse-grained
simulations of clay sheets, we will be able to
investigate the important chemistry of clays that
occurs at the edges of the clay sheets.
On a short length scale it can affect ion absorption,
interactions with polymers, acid-base properties
etc.
On a long length scale, it may affect the
mesoscopic structure of clay
platelets
Mesophases of oblate uniaxial particles dispersed in polymer: a) isotropic b) nematic c)
smectic A, d) columnar, e) house of cards plastic solid f) crystal. From Ginsburg et al 2000
HIV-1 Protease
•
•
•
•
Enzyme of HIV responsible for
protein maturation
Target for Anti-retroviral Inhibitors
Example of Structure Assisted Drug
Design
Several inhibitors of HIV-1 protease
Monomer B
101 - 199
Glycine - 48, 148
Monomer A
1 - 99
Flaps
Saquinavir
So what’s the problem ?
P2 Subsite
•
•
•
Emergence of drug resistant
mutations in protease
Render drug ineffective
Drug Resistant mutants have
emerged for all inhibitors
Leucine - 90, 190
Catalytic Aspartic
Acids - 25, 125
C-terminal
N-terminal
Molecular Dynamics Simulations of HIV-1 Protease
AIMS
•
•
•
Study the differential interactions between wild-type and mutant proteases with an inhibitor
Gain insight at molecular level into dynamical cause of drug resistance
Determine conformational differences of the drug in the active site
Mutant 1: G48V (Glycine to Valine)
Inhibitor: Saquinavir
Mutant 2: L90M (Leucine to Methionine)
HIV-1 drug resistance




Goal: to study the effect of antiretroviral inhibitors (targetting
proteins in the HIV lifecycle, such
as viral protease and reversetranscriptase enzymes)
High end computational power to
confer clinical decision support
On Ranger, up to 100 replicas
(configurations) simulated, for the
first time, in some cases going to
100 ns
3.5TB of trajectory and free
energy analysis
Energy differences of binding compared with
experimental results for wildtype and MDR
proteases with inhibitors LPV and RTV using
10ns trajectory.
• 3.4 microseconds in four weeks
(av. 120ns/day, peak 300 ns/day)
• AHE orchestrated workflows
48
GENIUS project

Grid Enabled Neurosurgical Imaging Using Simulation (GENIUS)‫‏‬
• Scientific goal: to perform real time patient specific medical simulation
• Combines blood flow simulation with clinical data
• Fitting the computational time scale
to the clinical time scale:
• Capture the clinical workflow
• Get results which will influence
clinical decisions: 1 day? 1 week?
• GENIUS - 15 to 30 minutes
49
GENIUS project
• Blood flow is simulated using lattice-Boltzmann method (HemeLB)‫‏‬
• Parallel ray tracer doing real time in situ visualization
• Sub-frames rendered on each MPI processor/rank and composited before
being sent over the network to a (lightweight) viewing client
• Addition of volume rendering cuts down scalability of fluid solver due to
required global communications
• Even so, datasets rendered at more than 30 frames per second (10242
pixel resolution)
50
CONCLUSIONS
• A wide range of scientific research activities were presented that make
effective use of the new range of petascale resources available in the USA
• These demonstrate the emergence of new science not possible without
access to this scale of resources
• These applications have shown linear scaling up to tens of
thousands of cores
• Vectorisation of Lattice-Boltzmann codes can be expected to produce a
2-4 times faster performance
• Future prospects: we are well placed to move onto next machines coming
online in the US and Japan
51
Acknowledgements
JANET/David Salmon
NGS staff
TeraGrid Staff
Simon Clifford (CCS)
Jay Bousseau (TACC)
Lucas Wilson (TACC)
Pete Beckmann (ANL)
Ramesh Balakrishnan (ANL)
Brian Toonen (ANL)
Prof. Nicholas Karonis (ANL)
Kevin Roy (Cray)
Ning Li (NAG)
52
Download