TeraGyroid HPC Applications ready for UKLight Stephen Pickles <>

advertisement
TeraGyroid
HPC Applications ready for UKLight
Stephen Pickles <stephen.pickles@man.ac.uk>
http://www.realitygrid.org
http://www.realitygrid.org/TeraGyroid.html
UKLight Town Meeting, NeSC, Edinburgh, 9/9/2004
The TeraGyroid Project
 Funded by EPSRC (UK) & NSF (USA) to join the UK eScience Grid and US TeraGrid
– application from RealityGrid, a UK e-Science Pilot Project
– 3 month project including work exhibited at SC’03 and SC Global,
Nov 2003
– thumbs up from TeraGrid mid-September, funding from EPSRC
approved later
 Main objective was to deliver high impact science which it
would not be possible to perform without the combined
resources of the US and UK grids
 Study of defect dynamics in liquid crystalline surfactant
systems using lattice-Boltzmann methods
– featured world’s largest Lattice Boltzmann simulation
– 1024^3 cell simulation of gyroid phase demands terascale computing
• hence “TeraGyroid”
2
UKLight Town Meeting, NeSC, 9/9/2004
Networking
TCP (near - realtime)
TCP (non - realtime)
UDP (realtime)
HPC engine
HPC engine
checkpoint
files
steering: control
and status
visualization
data
compressed
video
3
visualization
engine
storage
UKLight Town Meeting, NeSC, 9/9/2004
LB3D: 3-dimensional
Lattice-Boltzmann simulations

LB3D code is written in Fortran90 and
parallelized using MPI
 Scales linearly on all available
resources (Lemieux, HPCx, CSAR,
Linux/Itanium II clusters)

Data produced during a single run can
exceed 100s of gigabytes to terabytes

Simulations require supercomputers

High end visualization hardware (eg.
SGI Onyx, dedicated viz clusters) and
parallel rendering software (e.g. VTK)
needed for data analysis
4
3D datasets showing
snapshots from a simulation
of spinodal decomposition:
A binary mixture of water
and oil phase separates.
‘Blue’ areas denote high
water densities and ‘red’
visualizes the interface
between both fluids.
UKLight Town Meeting, NeSC, 9/9/2004
Computational Steering of
Lattice Boltzmann Simulations






5
LB3D instrumented for steering using the
RealityGrid steering library.
Malleable checkpoint/restart functionality
allows ‘rewinding’ of simulations and runtime job migration across architectures.
Steering reduces storage requirements
because the user can adapt data dumping
frequencies.
CPU time can be saved because users do
not have to wait for jobs to be finished if they
can already see that nothing relevant is
happening.
Instead of doing “task farming”, parameter
searches are accelerated by “steering”
through parameter space.
Analysis time is significantly reduced
because less irrelevant data is produced.
Applied to study of gyroid
mesophase of amphiphilic
liquid crystals at
unprecedented space and
time scales
UKLight Town Meeting, NeSC, 9/9/2004
Parameter space exploration
Cubic micellar phase,
high surfactant density
gradient.
Cubic micellar phase,
low surfactant density
gradient.
Initial condition:
Self-assembly
Random water/
starts.
surfactant mixture.
Rewind and
restart from
checkpoint.
6
Lamellar phase:
surfactant bilayers
between water layers.
UKLight Town Meeting, NeSC, 9/9/2004
Strategy
 Aim: use federated resources of US TeraGrid and UK e-Science Grid to
accelerate scientific process
 Rapidly map out parameter space using large number of independent
“small” (128^3) simulations
– use job cloning and migration to exploit available resources and save equilibration
time
 Monitor their behaviour using on-line visualization
 Hence identify parameters for high-resolution simulations on HPCx and
Lemieux
– 1024^3 on Lemieux (PSC) – takes 0.5 TB to checkpoint!
– create initial conditions by stacking smaller simulations with periodic boundary
conditions
 Selected 128^3 simulations were used for long-time studies
 All simulations monitored and steered by geographically distributed
team of computational scientists
7
UKLight Town Meeting, NeSC, 9/9/2004
The Architecture of Steering
OGSI middle tier
Steering
GS
bind
Simulation
components start
independently and
attach/detach dynamically
Steering library
library
Steering
publish
Client
connect
Steering
Steering library
client
Registry
find
data transfer
(Globus-IO)
Display
publish
Steering library
bind
multiple clients: Qt/C++,
.NET on PocketPC,
GridSphere Portlet (Java)
Display
Steering
GS
•Computations run at HPCx, CSAR, SDSC, PSC and NCSA
•Visualizations run at Manchester, UCL, Argonne, NCSA, Phoenix
•Scientists in 4 sites steer calculations, collaborating via Access Grid
•Visualizations viewed remotely
•Grid services run anywhere
8
Visualization
Visualization
remote visualization
through SGI VizServer,
Chromium, and/or
streamed to Access Grid
Display
UKLight Town Meeting, NeSC, 9/9/2004
SC Global ’03 Demonstration
9
UKLight Town Meeting, NeSC, 9/9/2004
TeraGyroid Testbed
Starlight (Chicago)
10 Gbps
ANL
Netherlight
(Amsterdam)
PSC
Manchester
Caltech
NCSA
Daresbury
BT provision
2 x 1 Gbps
production
network
SJ4
SDSC
10
MB-NG
Phoenix
Visualization
Computation
Access Grid node
Network PoP
Service Registry
UCL
Dual-homed system
UKLight Town Meeting, NeSC, 9/9/2004
Trans-Atlantic
Network
Collaborators:
 Manchester Computing
 Daresbury Laboratory
Networking Group
 MB-NG and UKERNA
 UCL Computing Service
 BT
 SurfNET (NL)
 Starlight (US)
 Internet-2 (US)
11
UKLight Town Meeting, NeSC, 9/9/2004
TeraGyroid:
Hardware Infrastructure
Computation (using more than 6000 processors) including:
 HPCx (Daresbury), 1280 procs IBM Power4 Regatta, 6.6 Tflops peak, 1.024 TB
 Lemieux (PSC), 3000 procs HP/Compaq, 3TB memory, 6 Tflops peak
 TeraGrid Itanium2 cluster (NCSA), 256 procs, 1.3 Tflops peak
 TeraGrid Itanium2 cluster (SDSC), 256 procs, 1.3 Tflops peak
 Green (CSAR), SGI Origin 3800, 512 procs, 0.512 TB memory (shared)
 Newton (CSAR), SGI Altix 3700, 256 Itanium 2 procs, 384GB memory (shared)
Visualization:
 Bezier (Manchester), SGI Onyx 300, 6xIR3, 32procs
 Dirac (UCL), SGI Onyx 2, 2xIR3, 16 procs
 SGI loan machine, Phoenix, SGI Onyx 1xIR4, 1xIR3, commissioned on site
 TeraGrid Visualization Cluster (ANL), Intel Xeon
 SGI Onyx (NCSA)
Service Registry:
 Frik (Manchester), Sony Playstation2
Storage:
 20 TB of science data generated in project
 2 TB moved to long term storage for on-going analysis - Atlas Petabyte Storage System
(RAL)
Access Grid nodes at Boston University, UCL, Manchester, Martlesham, Phoenix (4)
12
UKLight Town Meeting, NeSC, 9/9/2004
Network lessons
 Less than three weeks to debug networks
– applications people and network people nodded wisely but didn’t understand each
other
– middleware such as GridFTP is infrastructure to applications folk, but an application
to network folk
– rapprochement necessary for success
 Grid middleware not designed with dual-homed systems in mind
–
–
–
–
HPCx, CSAR (Green) and Bezier are busy production systems
had to be dual homed on SJ4 and MB-NG
great care with routing
complication: we needed to drive everything from laptops that couldn’t see the MBNG network
 Many other problems encountered
– but nothing that can’t be fixed once and for all given persistent infrastructure
13
UKLight Town Meeting, NeSC, 9/9/2004
Measured Transatlantic
Bandwidths during SC’03
14
UKLight Town Meeting, NeSC, 9/9/2004
TeraGyroid: Summary
 Real computational science...
– Gyroid mesophase of amphiphilic
liquid crystals
– Unprecedented space and time
scales
– investigating phenomena previously
out of reach
Dislocations
 ...on real Grids...
– enabled by high-bandwidth
networks
 ...to reduce time to insight
15
Interfacial
Surfactant
Density
UKLight Town Meeting, NeSC, 9/9/2004
TeraGyroid: Collaborating
Organisations
Our thanks to hundreds of individuals at:...
Argonne National Laboratory (ANL)
Boston University
BT
BT Exact
Caltech
CSC
Computing Services for Academic Research (CSAR)
CCLRC Daresbury Laboratory
Department of Trade and Industry (DTI)
Edinburgh Parallel Computing Centre
Engineering and Physical Sciences Research Council (EPSRC)
Forschungzentrum Juelich
HLRS (Stuttgart)
HPCx
IBM
Imperial College London
National Center for Supercomputer Applications (NCSA)
Pittsburgh Supercomputer Center
San Diego Supercomputer Center
SCinet
SGI
SURFnet
TeraGrid
Tufts University, Boston
UKERNA
UK Grid Support Centre
University College London
University of Edinburgh
University of Manchester
16
ANL
UKLight Town Meeting, NeSC, 9/9/2004
The TeraGyroid Experiment
S. M. Pickles1, R. J. Blake2, B. M. Boghosian3, J. M. Brooke1,
J. Chin4, P. E. L. Clarke5, P. V. Coveney4,
N. González-Segredo4, R. Haines1, J. Harting4, M. Harvey4,
M. A. S. Jones1, M. Mc Keown1, R. L. Pinning1,
A. R. Porter1, K. Roy1, and M. Riding1.
1.
2.
3.
4.
5.
Manchester Computing, University of Manchester
CLRC Daresbury Laboratory, Daresbury
Tufts University, Massachusetts
Centre for Computational Science, University College London
Department of Physics & Astronomy, University College London
http://www.realitygrid.org
http://www.realitygrid.org/TeraGyroid.html
New Application at AHM2004
“Exact” calculation of peptide-protein binding
energies by steered thermodynamic integration
using high-performance computing grids.
Philip Fowler, Peter Coveney, Shantenu Jha and Shunzhou Wan
UK e-Science All Hands Meeting
31 August – 3 September 2004
Why are we studying this
system?
 Measuring binding energies are vital
for e.g. designing new drugs.
 Calculating a peptide-protein binding
energy can take weeks to months.
 We have developed a grid-based
method to accelerate this process
To compute Gbind during the AHM 2004 conference
i.e. in less than 48 hours
Using federated resources of UK National Grid Service
and US TeraGrid
19
UKLight Town Meeting, NeSC, 9/9/2004
Thermodynamic Integration on
Computational Grids
H

Starting
conformation
Use steering to launch, spawn
and terminate - jobs
t
Check for convergence
Combine and calculate integral
=0.1
time
=0.2
=0.3
lambda
…
Seed successive
simulations
H

(10 sims, each 2ns)
=0.9
Run each independent job on the Grid
20
UKLight Town Meeting, NeSC, 9/9/2004
checkpointing
steering and
control
monitoring
21
UKLight Town Meeting, NeSC, 9/9/2004
We successfully ran many
simulations…
 This is the first time we have completed an entire
calculation.
– Insight gained will help us improve the throughput.
 The simulations were started at 5pm on Tuesday and the
data was collated at 10am Thursday.
 26 simulations were run
 At 4.30pm on Wednesday, we had nine simulations in
progress (140 processors)
– 1x TG-SDSC, 3x TG-NCSA, 3x NGS-Oxford, 1x NGS-Leeds, 1x NGS-RAL
 We simulated over 6.8ns of classical molecular dynamics
in this time
22
UKLight Town Meeting, NeSC, 9/9/2004
Very preliminary results
400
Thermodynamic Integrations
Experiment
“Quick and dirty” analysis*
300
dp
po
dE/dl
200
G
(kcal/mol)
-1.0 ± 0.3
-9 to -12
* - as at 41 hours
100
0
0
-100
-200
23
0.2
0.4
0.6
0.8
1
lambda
We expect our value to improve with further
analysis around the endpoints.
UKLight Town Meeting, NeSC, 9/9/2004
Conclusions
 We can harness today’s grids to accelerate high-end
computational science
 On-line visualization and job migration require high
bandwidth networks
 Need persistent network infrastructure
– else set up costs are too high
 QoS: Would like ability to reserve bandwidth
– and processors, graphics pipes, AG rooms, virtual venues, nodops... (but
that’s another story)
 Hence our interest in UKLight
24
UKLight Town Meeting, NeSC, 9/9/2004
Download