Preparing for Science at the Petascale: Real Science on up to Tens of Thousands of Cores Radhika S. Saksena1, Bruce Boghosian2, Luis Fazendeiro1, Owain A. Kenway, Steven Manos1, Marco Mazzeo1, S. Kashif Sadik1, James L. Suter1, David Wright1 and Peter V. Coveney1 1. Centre for Computational Science, UCL, UK 2. Tufts University, Boston, USA Contents • Moving towards the petascale - HECToR (XT4 and X2) - Ranger (TACC) - Intrepid (ALCF) • Scientific fields of research: – – – – – Turbulence Liquid crystalline rheology Clay-polymer nanocomposites HIV drug resistance Patient specific haemodynamics • Conclusions 2 New era of petascale machines Ranger (TACC) - NSF funded SUN Cluster • 0.58 petaflops (theoretical) peak: ~ 10 times HECToR (59 Tflops) “bigger” than all other TeraGrid resources combined • Linpack speed 0.31 petaflops, 123TB memory • Architecture: 82 racks; 1 rack = 4 chassis; 1 chassis = 12 nodes • 1 node = Sun blade x6420 (four 16 bit AMD Opteron Quad-Core processors); • 3,936 nodes = 62,976 cores Intrepid (ALCF) - DOE funded BlueGene/P • • • • 0.56 petaflops (theoretical) peak 163,840 cores; 80TB memory Linpack speed 0.45 petaflops “Fastest” machine available for open science and third in general1 3 1. http://www.top500.org/lists/2008/06 New era of petascale machines US firmly committed to path to petascale (and beyond) NSF: Ranger (5 years, $59 million award) University of Tennessee, to build system with just under 1PF peak performance ($65 million, 5-year project)1 “Blue Waters” will come online in 2011 at NCSA ($208 grant), using IBM technology – to deliver peak 10 Pflops performance (~ 200K cores, 10PB of disk) 1. http://www.nsf.gov/news/news_summ.jsp?cntn_id=109850 4 New era of petascale machines • We wish to do new science at this scale – not just incremental advances • Applications that scale linearly up to tens of thousands of cores (large system sizes, many time steps) – capability computing at petascale • High throughput for “intermediate scale” applications (in the 128 – 512 core range) 5 Intercontinental HPC grid environment US TeraGrid UK NGS NCSA AHE SDSC ANL (Intrepid) TACC (Ranger) HECToR HPCx Leeds PSC Manchester Oxford RAL DEISA Massive data transfers Lightpaths Advanced reservation/ co-scheduling Emergency/pre-emptive access 6 Lightpaths - Dedicated 1 Gb UK/US network JANET Lightpath is a centrally managed service which supports large research projects on the JANET network by providing endto-end connectivity, from 100’s of Mb up to whole fibre wavelengths (10 Gb). Typical usage – Dedicated 1Gb network to connect to national and international HPC infrastructure – Shifting TB datasets between the UK/US – Real-time visualisation – Interactive computational steering – Cross-site MPI runs (e.g. between NGS2 Manchester and NGS2 Oxford) 7 Advanced reservations • Plan in advance to have access to the resources - Process of reserving multiple resources for use by a single application - HARC1 - Highly Available Resource Co-Allocator - GUR2 - Grid Universal Remote • Can reserve the resources: – For the same time: • Distributed MPIg/MPICH-G2 jobs • Distributed visualization • Booking equipment (e.g. visualization facilities) – Or some coordinated set of times – Computational workflows Urgent computing and pre-emptive access (SPRUCE) 1. http://www.realitygrid.org/middleware.shtml#HARC 2. http://www.ncsa.uiuc.edu/UserInfo/Resources/Hardware/TGIA64LinuxCluster/Doc/coschedule.html 8 Application Hosting Environment • Middleware which simplifies access to distributed resources; manage workflows • Wrestling with middleware can't be a limiting step for scientists - Hiding complexities of the ‘grid’ from the end user • Applications are stateful Web services • Application can consist of a coupled model, parameter sweep, steerable application, or a single executable 9 HYPO4D - Hydrodynamic periodic orbits in 4D Scientific goal: to identify and characterize unstable periodic orbits in driven turbulent fluid flow (from which exact time averages can be computed) Approach parallelizes space and time Well known formalism exists from which observables can be computed through the dynamical zeta function Degree of accuracy is high and converges quickly, with the number of known lower period UPOs No need to redo initial value problem every time we wish to compute average of some quantity Observables are no longer stochastic in nature -- we have an exact expansion to compute them 10 HYPO4D Uses lattice-Boltzmann method: stream and collide (local collision operator + linear streaming operator ) Full locality with nearest-neighbour communications Code written in C, using MPI for communications; halo-swapping done strictly through non-blocking (asynchronous) MPI SEND/RECV No aggressive optimization pursued, so as to run on many platforms as possible 11 HYPO4D • HYPO4D performance: linear scaling up to 16K cores, 70% speed-up from 16384 to 32768 cores 12 HYPO4D • HYPO4D performance: linear scaling up to 33K cores, 38% speed-up from 32768 to 65536 cores 13 HYPO4D • HECToR results on XT4 are the best so far in terms of timing, i.e. code performance, approximately 50% faster than Ranger. • Value of SUPS remains practically constant and higher than on Intrepid or even Ranger 14 HYPO4D • Amdahl's Law for vectorization: the performance of a program is dominated by its slowest component which in the case of a vectorized program is scalar code (http://www.pcc.qub.ac.uk/tec/courses/cray/ohp/CRAY-slides_3.html) • Amdahl's Law - the formulation for vector code which is R times faster than scalar code is: sv = • • • • 1 Fv Fs + Rv Sv = maximum expected speedup Fv = fraction of a program that is vectorized Fs = fraction of a program that is scalar = 1 - fv Rv = ratio of scalar to vector processing time ! 15 HYPO4D • For Cray Research systems Rv ranges from 10 to 20. • It is common for programs to be 70% to 80% vectorized ie 70% to 80% of their running time is spent executing vector instructions. • Not always easy to reach 70% to 80% vectorization in a program and vectorizing beyond this level becomes increasingly difficult normally requiring major changes to the algorithm. • Many users stop their vectorization efforts once the vectorized code is running 2 to 4 times faster than scalar code. 16 HYPO4D • X2 (Black Widow component) comparison with XT4 • Excellent performance with some simple code changes (such as changing order of loops/ better managing of I/O calls): still room for improvement (work-in-progress) • For the 16, 32 cores already a 64% increase in SUPS, with NAG's support (Ning Li) 17 HYPO4D Novel approach to turbulence studies: efficiently parallelizes time and space Algorithm is extremely memory-intensive: full spacetime trajectories are numerically relaxed to nearby minimum (unstable periodic orbit) Ranger is currently best resource for this work (123 TB of aggregated RAM) During early-user period millions of time steps for different systems simulated and then compared, ~ 10TB of data 18 LB3D -- complex condensed matter physics LB3D -- three-dimensional lattice-Boltzmann solver for multi-component fluid dynamics, in particular amphiphilic systems • Mature code - 9 years in development. It has been extensively used on the US TeraGrid, UK NGS, HECToR and HPCx machines • Largest model simulated to date is 20483 (needs Ranger) 19 Lattice-Boltzmann Simulations of Amphiphilic Liquid Crystals - Scientific Objectives • Mesophase self-assembly - cubics: gyroid, diamond and primitive, non-cubics: lamellar and hexagonal and exotic multi-lamellar onions and vesicular structures. • Cubic phase rheology • Pressure and shear induced phase transitions: Cubic -> Cubic and Lamellar/Hexagonal -> Cubic • Defect Dynamics • Finite-size effects Rheological Simulations under Shear Amphiphilic fluid: cubic mesophase Influence of Defects LB3D benchmarks on Ranger LB3D benchmarks on HECToR XT4 Description: HECToR is a Cray XT4 machine in the UK with a current theoretical peak of 59 Tflops and 33.2 TB of memory overall. LB3D benchmarks on HECToR X2 No optimization has been applied to LB3D for X2. We expect vector performance to improve after vectorisation of the control flow. Lattice-Boltzmann Simulations of Amphiphilic Liquid Crystals - Self-Assembly Results Cubic, lamellar and HPL phases Gyroid phase simulation Gyroid phase - periodic nodal approximation: sin(2Πx)cos(2 Πy)+ sin(2 Πy)cos(2 Πz)+sin(2 Πz)cos(2 Πx)=0 Lattice-Boltzmann Simulations of Amphiphilic Liquid Crystals - Self-Assembly Results Cubic, lamellar and HPL phases Diamond phase simulation Diamond phase - periodic nodal Approximation: cos(2 Πx)cos(2 Πy)cos(2 Πz) +cos(2 Πx)sin(2 Πy)sin(2 Πz) + sin(2 Πx)cos(2 Πy)sin(2 Πz) +sin(2 Πx)sin(2 Πy)cos(2 Πz)=0 Lattice-Boltzmann Simulations of Amphiphilic Liquid Crystals - Self-Assembly Results Cubic, lamellar and HPL phases Primitive phase simulation Primitive phase - periodic nodal Approximation: cos(2 Πx)+cos(2 Πy) + cos(2 Πz) Lattice-Boltzmann Simulations of Amphiphilic Liquid Crystals - Self-Assembly Results Cubic, lamellar and HPL phases Lamellar phase Ripple phase Hexagonal Perforated Lamellar phase Lattice-Boltzmann Simulations of Amphiphilic Liquid Crystals - Self-Assembly by varying pressure/concentration Variation of scalar pressure is in agreement with the morphology obtained in experiments. In order of decreasing pressure: lamellar -> ripple -> primitive -> hexagonal Variation of relative concentration of amphiphile is in agreement with the morphology obtained in experiments. The following sequence observed with increasing relative amphiphilic concentration is in agreement with experiments: sponge -> gyroid -> lamellar R. S. Saksena and P. V. Coveney, “Self-Assembly of Ternary Cubic, Hexagonal and Lamellar Mesophases using the Lattice-Boltzmann Kinetic Method”, J. Phys. Chem. B, 112(10), 2950-2957 (2008). Cubic Phase Rheology Results1 Recent results include the tracking of large time-scale defect dynamics on 10243 lattice-sites systems; only possible on Ranger, due to sustained core count and disk storage requirements 2563 lattice-sites gyroidal system with multiple domains • Regions of high stress magnitude are localized in the vicinity of defects • Fluid is viscoelastic 1. See R. S. Saksena et al. “Real science at The petascale” preprint (2008) 29 Clay-Polymer Nanocomposites • • • • Aluminosilicate clay sheets carry a negative charge. This is charge-balanced by cations such as Na+ and K+. Water of hydration also present. The nano-scale clay sheets may be dispersed within a polymer in three ways: – As tactoids – With the polymer intercalated – Exfoliated – Research shows a mixture of the latter two is typical. Our Interest in Clay-Polymer Nanocomposites • • • • • • Drilling fluid additives to prevent clay swelling during bore hole drilling. Significant interest from the oilfield industry - US Patent 6,787,507 “Stabilizing clayey formations”. Low clay fraction clay-polymer nanocomposites are receiving much attention due to the increase in materials and thermochemical properties of plastics for a small increase in weight. To understand the mechanical properties of the composites we require accurate knowledge of the materials properties of each component. Experimentally, determining the material properties of clays (such as montmorillonite) has proved very difficult. Can simulation provide an answer? £1.6M TSB project with MI-SWACO including HECToR access Simulation work • The small thickness and separation of the clay sheets (~1nm) and the changes that occur due to intermolecular interactions require that we simulate the clay sheet in atomistic detail. • Currently we work with the following software: – MD Code = LAMMPS (5K to > 107 atom models) Large Atomistic/Molecularly Massive Parallel Simulator. Uses spatial-decomposition techniques to partition the simulation cell into 3d sub-domains, one of which is assigned to each processor. LAMMPS uses FFTw to calculate the electrostatic interactions in reciprocal space. It keeps track of nearby particles for real-space part of electrostatics and pairwise interactions (I.e. Lennard-Jones) through neighbour lists. – Builder = Accelrys Inc. Cerius2 – MC Code = Cerius2 (C2) Discover – MD Code = C2 Discover (Small <5K atom models) – Forcefield = ClayFF – Visualizer = Chimera (Small models) – Visualizer = Atom Eye (Large models) Large-Scale MD: sheet undulations Case-Study: Non-swelling amine-clay composites • Using very large simulation super-cells (350,840 atoms; 28nm X 50nm X 3nm) the clay sheets had been able to flex, resulting in a much broader distribution of atom density. • This distortion may be more evident at even larger super cell sizes, or using 2 -dimensional boundary conditions • Small models suffer from unphysical finite size effects. Large-Scale MD: Reaching the Limits of Current Computing Power How can we simulate clay materials on a large enough scale to capture this “mesoscopic” motion of the undulating clay sheets, while still capturing all atomistic information? Answer: We federate international Grids in the US, UK and Europe, allowing us transparent access to unprecedented resources Large-Scale MD • Using our federated grid, we can use LAMMPS and the fully flexible force-field ClayFF to simulate in detail long length scale motions which emerge with increasing size of clay sheet. These undulations are prohibited in the small sizes commonly used in MD simulations. • Such simulations give information regarding the mechanical properties of the composite, allowing comparison with experiment and prediction of material properties. • The largest system we have simulated is a 10 million atom clay-water system, dimensions: 150 x 270 x 25nm. This approaches a realistic size of platelet Clay-Polymer Nanocomposites: Elastic Properties • We want to find out the wavelengths and amplitudes of the thermal undulations of the clay sheet. • We therefore need to calculate the central position of the clay sheet on a regular grid from the atomic positions. • We then Fourier transform this height function h(x,y) to find out the wavelengths and amplitudes. Height function h(x,y) of a 1 million atom montmorillonite clay sheet discretised onto a 100 x 150 grid, with grid spacing 0.45nm x 0.45nm Clay-Polymer Nanocomposites: Elastic Properties • The amplitudes can be related to elastic properties of the clay • The free energy per unit area for the undulations is related to the bending of the surface: Gund (x, y) = 0.5kc | " 2 h(x, y) |2 +0.5# | "h(x, y) |2 ! where kc is the bending modulus i.e. how easy it is for the sheet to bend, and γ is the surface tension, which will resist increasing the surface area of the sheet. If we convert to Fourier space we find the above expression becomes, for wavevectors k,: Gund = 0.5(k c k 4 + "k 2 )h(k) 2 ! Clay-Polymer Nanocomposites: Elastic Properties If we assume each mode of vibration has the same energy of 1/2 kBT (the equipartition principle, with kB the Boltzmann constant), the amplitude of undulations becomes: <h2(k)> = kBT / A(kck4 + γk2) where A is the clay sheet area We see k-4 behaviour at long wavelengths, the gradient is kc Artifact due to imposed periodicity of model The collective motion exhibiting k-4 behaviour is only apparent with wavelengths greater than 15 nm How Large Can We Go With LAMMPS? We have benchmarked LAMMPS using our clay simulations (full interactions including bonds, angles, Lennard-Jones and electrostatic interactions) on Ranger at TACC, University of Texas - The largest computing system in the world for open science research. Sun Constellation Linux Cluster, Ranger, is configured with 3,936 16-way SMP compute-nodes (blades) We find a significant tailing off in performance when number of atoms per cores < 10,000 Compiler options: mpicc -x0 -O3 LAMMPS scaling on Intrepid (BlueGene/P) • Time per timestep of simulation is approximately is 4x slower than on Ranger • Need access to more processors for equivalent wall clock time. LAMMPS Scaling on HECToR XT4 • 9.5 million atom (left) and 85 million atom system • LAMMPS is a supported code on HECToR XT4 • Approximately 33% faster per timestep than Ranger LAMMPS on the X2? • LAMMPS currently uses spatial sub-division methods with explicit communication for parallelization. • Large amount of code reorganisation required to effectively utilise vector architecture. • Successful vectorisation is possible when identical operations are repeated over comparatively large data items organized as linear arrays. LAMMPS uses neighbour-list methods for calculating pair potentials; low number of innermost loops and data is not independent1. • LAMMPS uses FFTW to calculate electrostatics; FFTW does not vectorise and FFTs would have to rewritten to use Cray FFTs. 1. Rapaport, D.C., Comp. Chem. Comm. 174 (2006) 521-529 How Large Can We Go With LAMMPS? We cannot go above 2 billion atoms, (231) with LAMMPS because of atom IDs being required to find bond partners which are stored as positive 32-bit integers. Therefore, major code revision is required to access > 2 billion atom systems. In the meantime, we investigate using coarse-grained molecular dynamics, built using details from atomistic molecular dynamics Atomistic MD Coarse-grained MD Clay-Polymer Nanocomposites: Future directions • The question of whether periodic boundary conditions become limiting at large length scales remains to be addressed. • Answer: remove periodic boundary conditions! • Simulating “life-sized” clay platelets (approx. 50nm across) • Will remove artificial constraints • Will allow us to examine unexplored behaviours, such as interactions at clay sheet edges and undulations. • Clay sheets in a simulation box of water or polymer matrix • Requires very large systems to ensure our models are realistic: between 106 and 108 atoms • We can explore even larger systems for longer timescales if we use coarse-grained molecular dynamics Clay-Polymer Nanocomposites: Future Directions Isolated Clay Sheets - Interactions Over Many Length Scales With our very large atomistic and coarse-grained simulations of clay sheets, we will be able to investigate the important chemistry of clays that occurs at the edges of the clay sheets. On a short length scale it can affect ion absorption, interactions with polymers, acid-base properties etc. On a long length scale, it may affect the mesoscopic structure of clay platelets Mesophases of oblate uniaxial particles dispersed in polymer: a) isotropic b) nematic c) smectic A, d) columnar, e) house of cards plastic solid f) crystal. From Ginsburg et al 2000 HIV-1 Protease • • • • Enzyme of HIV responsible for protein maturation Target for Anti-retroviral Inhibitors Example of Structure Assisted Drug Design Several inhibitors of HIV-1 protease Monomer B 101 - 199 Glycine - 48, 148 Monomer A 1 - 99 Flaps Saquinavir So what’s the problem ? P2 Subsite • • • Emergence of drug resistant mutations in protease Render drug ineffective Drug Resistant mutants have emerged for all inhibitors Leucine - 90, 190 Catalytic Aspartic Acids - 25, 125 C-terminal N-terminal Molecular Dynamics Simulations of HIV-1 Protease AIMS • • • Study the differential interactions between wild-type and mutant proteases with an inhibitor Gain insight at molecular level into dynamical cause of drug resistance Determine conformational differences of the drug in the active site Mutant 1: G48V (Glycine to Valine) Inhibitor: Saquinavir Mutant 2: L90M (Leucine to Methionine) HIV-1 drug resistance Goal: to study the effect of antiretroviral inhibitors (targetting proteins in the HIV lifecycle, such as viral protease and reversetranscriptase enzymes) High end computational power to confer clinical decision support On Ranger, up to 100 replicas (configurations) simulated, for the first time, in some cases going to 100 ns 3.5TB of trajectory and free energy analysis Energy differences of binding compared with experimental results for wildtype and MDR proteases with inhibitors LPV and RTV using 10ns trajectory. • 3.4 microseconds in four weeks (av. 120ns/day, peak 300 ns/day) • AHE orchestrated workflows 48 GENIUS project Grid Enabled Neurosurgical Imaging Using Simulation (GENIUS) • Scientific goal: to perform real time patient specific medical simulation • Combines blood flow simulation with clinical data • Fitting the computational time scale to the clinical time scale: • Capture the clinical workflow • Get results which will influence clinical decisions: 1 day? 1 week? • GENIUS - 15 to 30 minutes 49 GENIUS project • Blood flow is simulated using lattice-Boltzmann method (HemeLB) • Parallel ray tracer doing real time in situ visualization • Sub-frames rendered on each MPI processor/rank and composited before being sent over the network to a (lightweight) viewing client • Addition of volume rendering cuts down scalability of fluid solver due to required global communications • Even so, datasets rendered at more than 30 frames per second (10242 pixel resolution) 50 CONCLUSIONS • A wide range of scientific research activities were presented that make effective use of the new range of petascale resources available in the USA • These demonstrate the emergence of new science not possible without access to this scale of resources • These applications have shown linear scaling up to tens of thousands of cores • Vectorisation of Lattice-Boltzmann codes can be expected to produce a 2-4 times faster performance • Future prospects: we are well placed to move onto next machines coming online in the US and Japan 51 Acknowledgements JANET/David Salmon NGS staff TeraGrid Staff Simon Clifford (CCS) Jay Bousseau (TACC) Lucas Wilson (TACC) Pete Beckmann (ANL) Ramesh Balakrishnan (ANL) Brian Toonen (ANL) Prof. Nicholas Karonis (ANL) Kevin Roy (Cray) Ning Li (NAG) 52