RealityGrid A testbed for computational condensed matter science Peter Coveney Paris, 31 March 2003 Academic Partners University College London Queen Mary, University of London Imperial College University of Manchester University of Edinburgh University of Oxford University of Loughborough www.realitygrid.org RealityGrid Collaborating institutions: Schlumberger Edward Jenner Institute for Vaccine Research Silicon Graphics Inc Computation for Science Consortium Advanced Visual Systems Fujitsu British Telecommunications RealityGrid—the Concept • A RealityGrid generalises the concept of a Reality Centre across a network of computational resources managed by Grid middleware. • It optimises the scientific discovery process by integrating simulation, visualization and data from experimental facilities – real-time data mining. • It builds on and extends the functionality of a DataGrid. • New middleware issues arise because a RealityGrid must address synchronicity of resources and their interaction. http://www.realitygrid.org UK and US Grid Technologies Instruments: XMT devices, LUSI,… Grid Middleware HPC resources Storage devices User with laptop Scalable MD, MC, mesoscale modelling Performance control/monitoring Visualization engines Steering VR and/or AG nodes Moving the bottleneck out of the hardware and into the human mind… The TeraGyroid Project • Funded by EPSRC & NSF (USA)-to join the UK e-Science Grid and US TeraGrid • Main objective was to deliver high impact science which it would not be possible to perform without the combined resources of the US and UK grids • Study of defect dynamics in liquid crystalline surfactant systems using lattice-Boltzmann methods • Three month project including work exhibited at Supercomputing 2003 and SC Global • TRICEPS was the HPC-Challenge aspect of this work Collaborating Institutions ...and hundreds of individuals at: Argonne National Laboratory (ANL) Boston University BT BT Exact Caltech CSC Computing Services for Academic Research (CSAR) Daresbury Laboratory Department of Trade and Industry (DTI) Edinburgh Parallel Computing Centre Engineering and Physical Sciences Research Council (EPSRC) Forschungzentrum Juelich HLRS (Stuttgart) HPCx IBM Imperial College London National Center for Supercomputer Applications (NCSA) Pittsburgh Supercomputer Center San Diego Supercomputer Center SCinet SGI SURFnet TeraGrid Tufts University, Boston UKERNA UK Grid Support Centre University College London University of Edinburgh University of Manchester ANL LB3D: Three dimensional Lattice-Boltzmann simulations • LB3D code is written in Fortran90 and parallelized using MPI • Scales linearly on all available resources (CSAR, HPCx, Lemieux, Linux/Itanium) • • Fully steerable Uses parallel data format PHDF5 • Data produced during a single large scale simulation can exceed hundreds of gigabytes to terabytes • Simulations require supercomputers • High end visualization hardware and parallel rendering software (e.g. VTK) needed for data analysis 3D datasets showing snapshots from a simulation of spinodal decomposition: A binary mixture of water and oil phase separates. ‘Blue’ areas denote high water densities and ‘red’ visualizes the interface between both fluids. Computational Steering with LB3D • All simulation parameters are steerable using the RealityGrid steering library. • Checkpointing/restarting functionality allows ‘rewinding’ of simulations and run time job migration across architectures. • Steering reduces storage requirements because the user can adapt data dumping frequencies. • CPU time can be saved because users do not have to wait for jobs to be finished if they can already see that nothing relevant is happening. • Instead of doing “task farming”, i.e. launching many simulations at the same time, parameter searches can be done by “steering” through parameter space of bigger simuation. • Analysis time is significantly reduced because less irrelevant data is produced. Parameter Space Exploration Cubic micellar phase, high surfactant density gradient. Cubic micellar phase, low surfactant density gradient. Initial condition: Self-assembly Random water/ starts. surfactant mixture. Rewind and restart from checkpoint. Lamellar phase: surfactant bilayers between water layers. Distributed knowledge discovery The visualised output is streamed to a distributed set of collaborators located at Access Grid nodes across the USA and UK who also interact with the simulations. One of RealityGrid’s central aims— to provide Grid-enabled collaboratories—is now being realised. Liquid crystalline gyroid This distributed, collaborative activity further accelerates the discovery of new scientific phenomena hidden in terabytes of data. N. González-Segredo and P. V. Coveney, "Self-assembly of the gyroid cubic mesophase: lattice-Boltzmann simulations." Europhys. Lett., 65, 6, 795-801 (2004); “…the most important contribution to computational condensed matter science in 2003..” SC Global Demonstration Access Grid Session as seen at BT TRICEPS Grid Starlight (Chicago) 10 Gbps ANL Netherlight (Amsterdam) PSC Manchester Caltech NCSA Daresbury BT provision 2 x 1 Gbps production network SJ4 SDSC MB-NG Phoenix Visualization Computation Access Grid node Network PoP Service Registry UCL Dual-homed system Hardware Infrastructure Computation (using more than 6000 processors) including: • HPCx (Daresbury), 1280 procs IBM Power4 Regatta, 6.6 Tflops peak, 1.024 TB • Lemieux (PSC), 3000 procs HP/Compaq, 3TB memory, 6 Tflops peak • TeraGrid Itanium2 cluster (NCSA), 256 procs, 1.3 Tflops peak • TeraGrid Itanium2 cluster (SDSC), 256 procs, 1.3 Tflops peak • Green (CSAR), SGI Origin 3800, 512 procs, 0.512 TB memory (shared) • Newton (CSAR), SGI Altix 3700, 256 Itanium 2 procs, 384GB memory (shared) Visualization: • Bezier (Manchester), SGI Onyx 300, 6xIR3, 32procs • Dirac (UCL), SGI Onyx 2, 2xIR3, 16 procs • SGI loan machine, Phoenix, SGI Onyx 1xIR4, 1xIR3, commissioned on site • TeraGrid Visualization Cluster (ANL), Intel Xeon • SGI Onyx (NCSA) Service Registry: • Frik (Manchester), Sony Playstation2 Storage: • 20 TB of science data generated in project; Atlas Petabyte Storage System (RAL) Access Grid nodes at Boston University, UCL, Manchester, Martlesham, Phoenix (4) Steering in the OGSA Steering GS bind Simulation components start independently and attach/detach dynamically Steering library library Steering publish Client connect Steering Steering library client Registry find data transfer (Globus-IO) Display publish Steering library bind Display Steering GS Visualization Visualization Display •Computations run at HPCx, CSAR, SDSC, PSC and NCSA •Visualizations run at Manchester, UCL, Argonne, NCSA, Phoenix •Scientists steer calculations from UCL, BT and Boston, collaborating via Access Grid •Visualizations viewed remotely •Grid services run anywhere Software Infrastructure • • • • • • • Globus Toolkit, versions 2.2.3, 2.2.4, 2.4.3, 3.1 – We use GRAM (job management), Grid-FTP (migration of checkpoint files) and Globus-IO (inter-component communication) 6 Certificate Authorities involved OGSI::Lite – Perl implementation of OGSI used for Steering Grid Services, Registry and Checkpoint Metadata Tree RealityGrid Steering library and toolkit Visualization – VTK 4.0+patches, SGI OpenGL VizServer, Chromium, VIC,... Malleable Checkpoint/Restart – HDF5, XDR, RealityGrid performance control s/w Port forwarding s/w from EPCC TeraGyroid Summary • Distributed, collaborative, bleeding edge technology • Applied to deliver new results in condensed matter science – Gyroid mesophase of amphiphilic liquid crystals – Unprecedented scale (space and time scales) enables us to investigate phenomena which have been totally out of reach hitherto Hybrid Modelling Overview • Objective: – To construct a computational scheme to couple two descriptions of matter with very different characteristic length and time scales. • Applications: dynamical processes near interfaces governed by the interplay between micro and macro-dynamics. – Examples: complex-fluids near surfaces (polymers, colloids, surfactants, etc.), lipid membranes, complex boundary conditions in nano-channels (fluidics), wetting, crystal growth from vapour phase, melting, critical fluids near surfaces, etc. • Key to integrative/systems modelling approaches • Grid is the natural architecture for such modelling Oscillatory-wall shear flow Application: rheology in nano-slots, complex boundary conditions Velocity profile u = u(x,t) j Set-up uwall(t) = umax sin(2p wt) x Penetration length du=(2p n/w)1/2 C P y Solid wall (fixed) General coupling framework aims • Portability of models via conformance rules – Support exchange between developers – Promote ease of composition – Ability to export models to other frameworks – Ability to import models into a composition • Flexibility in deployment – Neutral to actual communication mechanism – Flexible mapping of models to machines http://www.cs.man.ac.uk/cnc/projects/gcf Examples of deployment: Coupling a MD code with a CFD code Single Machine MD Sequential control code Communication through shared buffer CFD Muliple machines MD CFD Concurrent deployment; (two executables) Communications: Distributed MPI or Web services or GT3, etc.… Loughborough University PDA Thin Steering Client for Grid Applications Wired/wireless connection and ADSL. Point and click interaction. Handwriting recognition via PDA OS. Frees application scientist from desktop. The registry screen provides access to multiple registries of simulations. The jobs screen allows access to any job on a grid registry The parameter screen allows the simulation to be steered via wireless network. Loughborough Mathematical Sciences • Terascaled Loughborough’s MD and ab initio codes. • Grid based real-time visualization and computational steering developed for the MD code. • Visualization suite developed for MD code. • Nanoindentation simulations performed illustrating the link between dislocation emission and “pop-ins”. • Used routinely in research work Edinburgh Physics & EPCC • Development of lattice Boltzmann code – Colloidal particles in binary fluids – Lattce-Boltzmann code with thermal fluctuations • Terascaling – LB3D, Plato, Ludwig (with HPCx) • Portal – Web-based RealityGrid steering (see demo) – Checkpoint management (under design) • Visualisation / animation / ReG integration LB algorithmic developments Terascaling ReG applications •LB3D •Plato •Colloids in binary fluids •Ludwig •LB with thermal fluctuations Web-based Steering Portal Visualisation / integration •Integration of Ludwig in ReG environment •Visualisation and animations See live demo! ICENI: An Integrated Grid Middleware to support e-Science Point & click install through Java Web Start to build the infrastructure and to populate it with resources ICENI Services LB3D LB3D LB3D @IC @UCL @MC Streamed to Access Grid Performance Database Scheduler Visualisation client 2 Visual composition to define and configure Visualisation client 1for execution an abstract simulation Best LB3D instance selected from the Grid - considering predicted performance and resource capability Later, build an extended workflow around the running application to dynamically steer, view data, and interact with collaborators Summary • We’ve tried to use the Grid for serious scientific work. • While we've had some notable successes, we've been held back in many ways by the state of existing middleware – Globus obstructs access to & use of resources (cf ssh, etc.) • This has been discouraging to people who could otherwise make very good use of Grid technologies. • A simpler, more lightweight system would be more useful to us. J. Chin and P. V. Coveney, "Towards tractable toolkits for the Grid: a plea for lightweight, usable middleware." UK e-Science Technical Report, number UKeS-2004-01 Our requirements • Computer-centric: simulations range from workstation-class codes to high-end HPC systems • Data-centric: we must keep track of and mine terabytes of geographically distributed simulation and experimental data • Community-centric: building cutting-edge visualization and collaboration systems Middleware sysadmin problems • Extremely tedious installation procedure • Uses customized versions of open-source libraries such as OpenSSL, instead of detecting and using installed versions • Uses own package manager (GPT), which can conflict with existing systems • Too many prerequisites: Relational database, servlet container, ... Middleware user problems • Level 2 Grid just about offers a job submission service, but far from transparently: requires significant customization on each machine • Very poorly documented • Little to no support for scientists with existing Fortran/C/C++ codes who wish to deploy them across an often Java-oriented Grid • Significant functionality in OGSI spec which hasn’t been used by scientists • Others have similar comments (e.g. Grid2003 Production Grid): – significant job failure rates due to the (Globus) middleware itself – "We spent inordinate amounts of time on configuration issues and fixing the same or similar problems“ over and over again Experience with a lightweight toolkit • OGSI::Lite by Mark McKeown of Manchester University • Lightweight OGSI implementation: uses existing software (eg for SSL) where possible • Many of the book-keeping Grid services implemented using it • Grid-based job submission and steering retrofitted onto the LB2D simulation code within a week • Standards compliance: we were able to steer simulations from a web browser, with no custom client software needed! A possible way forward? • Better collaboration with application scientists please • Documentation - HOWTOs detailing how to take an existing code and Grid-enable it • A useful toolkit would allow a user to only install the components (job submission, resource discovery and allocation, RPC, …) which were actually required. • We hope the UK e-Science programme will fund work on lightweight toolkits: it needs to be “bottom up” vs “top down”