TeraGyroid HPC Applications ready for UKLight Stephen Pickles <stephen.pickles@man.ac.uk> http://www.realitygrid.org http://www.realitygrid.org/TeraGyroid.html UKLight Town Meeting, NeSC, Edinburgh, 9/9/2004 The TeraGyroid Project Funded by EPSRC (UK) & NSF (USA) to join the UK eScience Grid and US TeraGrid – application from RealityGrid, a UK e-Science Pilot Project – 3 month project including work exhibited at SC’03 and SC Global, Nov 2003 – thumbs up from TeraGrid mid-September, funding from EPSRC approved later Main objective was to deliver high impact science which it would not be possible to perform without the combined resources of the US and UK grids Study of defect dynamics in liquid crystalline surfactant systems using lattice-Boltzmann methods – featured world’s largest Lattice Boltzmann simulation – 1024^3 cell simulation of gyroid phase demands terascale computing • hence “TeraGyroid” 2 UKLight Town Meeting, NeSC, 9/9/2004 Networking TCP (near - realtime) TCP (non - realtime) UDP (realtime) HPC engine HPC engine checkpoint files steering: control and status visualization data compressed video 3 visualization engine storage UKLight Town Meeting, NeSC, 9/9/2004 LB3D: 3-dimensional Lattice-Boltzmann simulations LB3D code is written in Fortran90 and parallelized using MPI Scales linearly on all available resources (Lemieux, HPCx, CSAR, Linux/Itanium II clusters) Data produced during a single run can exceed 100s of gigabytes to terabytes Simulations require supercomputers High end visualization hardware (eg. SGI Onyx, dedicated viz clusters) and parallel rendering software (e.g. VTK) needed for data analysis 4 3D datasets showing snapshots from a simulation of spinodal decomposition: A binary mixture of water and oil phase separates. ‘Blue’ areas denote high water densities and ‘red’ visualizes the interface between both fluids. UKLight Town Meeting, NeSC, 9/9/2004 Computational Steering of Lattice Boltzmann Simulations 5 LB3D instrumented for steering using the RealityGrid steering library. Malleable checkpoint/restart functionality allows ‘rewinding’ of simulations and runtime job migration across architectures. Steering reduces storage requirements because the user can adapt data dumping frequencies. CPU time can be saved because users do not have to wait for jobs to be finished if they can already see that nothing relevant is happening. Instead of doing “task farming”, parameter searches are accelerated by “steering” through parameter space. Analysis time is significantly reduced because less irrelevant data is produced. Applied to study of gyroid mesophase of amphiphilic liquid crystals at unprecedented space and time scales UKLight Town Meeting, NeSC, 9/9/2004 Parameter space exploration Cubic micellar phase, high surfactant density gradient. Cubic micellar phase, low surfactant density gradient. Initial condition: Self-assembly Random water/ starts. surfactant mixture. Rewind and restart from checkpoint. 6 Lamellar phase: surfactant bilayers between water layers. UKLight Town Meeting, NeSC, 9/9/2004 Strategy Aim: use federated resources of US TeraGrid and UK e-Science Grid to accelerate scientific process Rapidly map out parameter space using large number of independent “small” (128^3) simulations – use job cloning and migration to exploit available resources and save equilibration time Monitor their behaviour using on-line visualization Hence identify parameters for high-resolution simulations on HPCx and Lemieux – 1024^3 on Lemieux (PSC) – takes 0.5 TB to checkpoint! – create initial conditions by stacking smaller simulations with periodic boundary conditions Selected 128^3 simulations were used for long-time studies All simulations monitored and steered by geographically distributed team of computational scientists 7 UKLight Town Meeting, NeSC, 9/9/2004 The Architecture of Steering OGSI middle tier Steering GS bind Simulation components start independently and attach/detach dynamically Steering library library Steering publish Client connect Steering Steering library client Registry find data transfer (Globus-IO) Display publish Steering library bind multiple clients: Qt/C++, .NET on PocketPC, GridSphere Portlet (Java) Display Steering GS •Computations run at HPCx, CSAR, SDSC, PSC and NCSA •Visualizations run at Manchester, UCL, Argonne, NCSA, Phoenix •Scientists in 4 sites steer calculations, collaborating via Access Grid •Visualizations viewed remotely •Grid services run anywhere 8 Visualization Visualization remote visualization through SGI VizServer, Chromium, and/or streamed to Access Grid Display UKLight Town Meeting, NeSC, 9/9/2004 SC Global ’03 Demonstration 9 UKLight Town Meeting, NeSC, 9/9/2004 TeraGyroid Testbed Starlight (Chicago) 10 Gbps ANL Netherlight (Amsterdam) PSC Manchester Caltech NCSA Daresbury BT provision 2 x 1 Gbps production network SJ4 SDSC 10 MB-NG Phoenix Visualization Computation Access Grid node Network PoP Service Registry UCL Dual-homed system UKLight Town Meeting, NeSC, 9/9/2004 Trans-Atlantic Network Collaborators: Manchester Computing Daresbury Laboratory Networking Group MB-NG and UKERNA UCL Computing Service BT SurfNET (NL) Starlight (US) Internet-2 (US) 11 UKLight Town Meeting, NeSC, 9/9/2004 TeraGyroid: Hardware Infrastructure Computation (using more than 6000 processors) including: HPCx (Daresbury), 1280 procs IBM Power4 Regatta, 6.6 Tflops peak, 1.024 TB Lemieux (PSC), 3000 procs HP/Compaq, 3TB memory, 6 Tflops peak TeraGrid Itanium2 cluster (NCSA), 256 procs, 1.3 Tflops peak TeraGrid Itanium2 cluster (SDSC), 256 procs, 1.3 Tflops peak Green (CSAR), SGI Origin 3800, 512 procs, 0.512 TB memory (shared) Newton (CSAR), SGI Altix 3700, 256 Itanium 2 procs, 384GB memory (shared) Visualization: Bezier (Manchester), SGI Onyx 300, 6xIR3, 32procs Dirac (UCL), SGI Onyx 2, 2xIR3, 16 procs SGI loan machine, Phoenix, SGI Onyx 1xIR4, 1xIR3, commissioned on site TeraGrid Visualization Cluster (ANL), Intel Xeon SGI Onyx (NCSA) Service Registry: Frik (Manchester), Sony Playstation2 Storage: 20 TB of science data generated in project 2 TB moved to long term storage for on-going analysis - Atlas Petabyte Storage System (RAL) Access Grid nodes at Boston University, UCL, Manchester, Martlesham, Phoenix (4) 12 UKLight Town Meeting, NeSC, 9/9/2004 Network lessons Less than three weeks to debug networks – applications people and network people nodded wisely but didn’t understand each other – middleware such as GridFTP is infrastructure to applications folk, but an application to network folk – rapprochement necessary for success Grid middleware not designed with dual-homed systems in mind – – – – HPCx, CSAR (Green) and Bezier are busy production systems had to be dual homed on SJ4 and MB-NG great care with routing complication: we needed to drive everything from laptops that couldn’t see the MBNG network Many other problems encountered – but nothing that can’t be fixed once and for all given persistent infrastructure 13 UKLight Town Meeting, NeSC, 9/9/2004 Measured Transatlantic Bandwidths during SC’03 14 UKLight Town Meeting, NeSC, 9/9/2004 TeraGyroid: Summary Real computational science... – Gyroid mesophase of amphiphilic liquid crystals – Unprecedented space and time scales – investigating phenomena previously out of reach Dislocations ...on real Grids... – enabled by high-bandwidth networks ...to reduce time to insight 15 Interfacial Surfactant Density UKLight Town Meeting, NeSC, 9/9/2004 TeraGyroid: Collaborating Organisations Our thanks to hundreds of individuals at:... Argonne National Laboratory (ANL) Boston University BT BT Exact Caltech CSC Computing Services for Academic Research (CSAR) CCLRC Daresbury Laboratory Department of Trade and Industry (DTI) Edinburgh Parallel Computing Centre Engineering and Physical Sciences Research Council (EPSRC) Forschungzentrum Juelich HLRS (Stuttgart) HPCx IBM Imperial College London National Center for Supercomputer Applications (NCSA) Pittsburgh Supercomputer Center San Diego Supercomputer Center SCinet SGI SURFnet TeraGrid Tufts University, Boston UKERNA UK Grid Support Centre University College London University of Edinburgh University of Manchester 16 ANL UKLight Town Meeting, NeSC, 9/9/2004 The TeraGyroid Experiment S. M. Pickles1, R. J. Blake2, B. M. Boghosian3, J. M. Brooke1, J. Chin4, P. E. L. Clarke5, P. V. Coveney4, N. González-Segredo4, R. Haines1, J. Harting4, M. Harvey4, M. A. S. Jones1, M. Mc Keown1, R. L. Pinning1, A. R. Porter1, K. Roy1, and M. Riding1. 1. 2. 3. 4. 5. Manchester Computing, University of Manchester CLRC Daresbury Laboratory, Daresbury Tufts University, Massachusetts Centre for Computational Science, University College London Department of Physics & Astronomy, University College London http://www.realitygrid.org http://www.realitygrid.org/TeraGyroid.html New Application at AHM2004 “Exact” calculation of peptide-protein binding energies by steered thermodynamic integration using high-performance computing grids. Philip Fowler, Peter Coveney, Shantenu Jha and Shunzhou Wan UK e-Science All Hands Meeting 31 August – 3 September 2004 Why are we studying this system? Measuring binding energies are vital for e.g. designing new drugs. Calculating a peptide-protein binding energy can take weeks to months. We have developed a grid-based method to accelerate this process To compute Gbind during the AHM 2004 conference i.e. in less than 48 hours Using federated resources of UK National Grid Service and US TeraGrid 19 UKLight Town Meeting, NeSC, 9/9/2004 Thermodynamic Integration on Computational Grids H Starting conformation Use steering to launch, spawn and terminate - jobs t Check for convergence Combine and calculate integral =0.1 time =0.2 =0.3 lambda … Seed successive simulations H (10 sims, each 2ns) =0.9 Run each independent job on the Grid 20 UKLight Town Meeting, NeSC, 9/9/2004 checkpointing steering and control monitoring 21 UKLight Town Meeting, NeSC, 9/9/2004 We successfully ran many simulations… This is the first time we have completed an entire calculation. – Insight gained will help us improve the throughput. The simulations were started at 5pm on Tuesday and the data was collated at 10am Thursday. 26 simulations were run At 4.30pm on Wednesday, we had nine simulations in progress (140 processors) – 1x TG-SDSC, 3x TG-NCSA, 3x NGS-Oxford, 1x NGS-Leeds, 1x NGS-RAL We simulated over 6.8ns of classical molecular dynamics in this time 22 UKLight Town Meeting, NeSC, 9/9/2004 Very preliminary results 400 Thermodynamic Integrations Experiment “Quick and dirty” analysis* 300 dp po dE/dl 200 G (kcal/mol) -1.0 ± 0.3 -9 to -12 * - as at 41 hours 100 0 0 -100 -200 23 0.2 0.4 0.6 0.8 1 lambda We expect our value to improve with further analysis around the endpoints. UKLight Town Meeting, NeSC, 9/9/2004 Conclusions We can harness today’s grids to accelerate high-end computational science On-line visualization and job migration require high bandwidth networks Need persistent network infrastructure – else set up costs are too high QoS: Would like ability to reserve bandwidth – and processors, graphics pipes, AG rooms, virtual venues, nodops... (but that’s another story) Hence our interest in UKLight 24 UKLight Town Meeting, NeSC, 9/9/2004