Performance Considerations within RealityGrid Andrew Porter 22nd June 2005 Grid Performance Workshop NeSC Combining the strengths of UMIST and The Victoria Unviersity of Manchester Overview Applications in RealityGrid Performance criteria Hardware resource monitoring Failure management Managing performance What tool would we like… Summary Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 2 Applications in RealityGrid RealityGrid: employ Grid technology to aid computational studies of condensed-matter Simulations of: oil/water/surfactant mixtures Macromolecules (proteins) Semiconductor surfaces Hybrid (coupled) models: Fluid flow (atomic scale + continuum) Tip interaction with surface (atomic scale + continuum) Steering + online visualization Give the human some control back Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 3 HPC engine User launches jobs and steers them HPC engine checkpoint files Data for visualization compressed video visualization engine Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 4 Steering using Grid services layer GS middle tier Steering GS bind Simulation Steering library library Steering Components start independently & attach/detach dynamically publish Client connect Steering Steering library client Registry find data transfer (sockets/files) Display publish Steering library bind multiple clients: Qt/C++, .NET on PocketPC, GridSphere Portlet (Java) Combining the strengths of UMIST and The Victoria Unviersity of Manchester Display Steering GS Visualization Visualization Display Remote visualization through SGI VizServer, Chromium, and/or streamed to Access Grid (c) 2004 The Unviersity of Manchester 5 Applications in RealityGrid Performance criteria Hardware resource monitoring Failure management Managing performance What tool would we like… Summary Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 6 Performance criteria Will my job run? Sufficient memory Sufficient number of processors Sufficient max. wall clock How soon will my job start executing? Queue length Capacity of machine Amount of checkpoint data Network connectivity between checkpoint host and compute machine Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 7 Criteria continued… How long will my job take? No. of processors available Type of processors/interconnect Can I steer the job? Improve performance by allowing human to monitor job Can decide to migrate it… Network connectivity Will job run at a convenient time (scheduling) Will on-line visualization be possible? Again, allow for more detailed monitoring Network connectivity Location of visualization engine Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 8 Applications in RealityGrid Performance criteria Hardware resource monitoring Failure management Managing performance What tool would we like… Summary Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 9 Resource monitoring Many performance criteria depend upon: The application being run The configuration of queuing systems A ‘feel’ for how quickly a job gets through the queues How important steering/visualization/migration is However, have few useable resources Require HPC resources Require that application has been installed User maintains list of resources (including queue information) Is up to user to check on resource status… Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 10 Monitoring continued… Manual monitoring of resources: gsissh + qstat/bjobs etc. Web interfaces, e.g.: Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 11 Applications in RealityGrid Performance criteria Hardware resource monitoring Failure management Managing performance What tool would we like… Summary Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 12 Failure management We have no sophisticated tools for managing failure User simply repeats, hopefully after establishing reason for failure Possible modes of failure: Transfer of job files (input or checkpoint data) Job submission (to queuing system) User sees globus-related error messages debugging can be difficult and requires expertise Job execution Observed by steering client or, potentially, from globus job status User must examine stdout/err Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 13 Failure management continued… Steering communication Steering client unable to attach/fails to get status messages Normally indicates job has failed Can query the Grid service to gain more information Data transfer to visualization Often indicates networking problem (e.g. firewall, lack of network connectivity) Again, requires job stdout/err to debug User can attempt visualization elsewhere Display of visualization output X server not listening on port Firewall Incorrect configuration of VizServer… Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 14 Applications in RealityGrid Performance criteria Hardware resource monitoring Failure management Managing performance What tool would we like… Summary Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 15 Managing performance Simulation Client Steering library Steering messages (ASCII) Steering library data transfer (binary) Render Steering library Visualization Visualization Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 16 Performance management continued… Most important/expensive component is the simulation User must capitalise on the (considerable) computational resource being used Use standard HPC tools to ensure code is performing well on chosen number of processors (scaling) Again, stdout often contains simulation-generated timing information Simulation migration Impact of other components must be minimised Computational steering On-line visualization File transfer Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 17 Performance management - migration Can be triggered automatically (work by CNC, Manchester) Or manually by the user using RealityGrid launching client: User chooses job to migrate Client issues “checkpoint” and “stop” instructions using steering interface Job takes checkpoint, registers it and stops User chooses new resource Checkpoint data copied over Job restarted on new resource Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 18 Performance management - steering Steering allows some performance management Check that job is running as required e.g. Monitoring of CPU time/step Check on progress of calculation Steering also has performance implications Is a code serialization point Delay for SOAP exchange - location of Grid Service infrastructure Tune the frequency of message generation Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 19 Performance management - visualization Visualization allows user to check validity of results At a cost Collection of data from multiple nodes of parallel job Transfer of data to visualization Only collect data when a visualization is present Don’t do it every time step Ensure visualization does not become bottleneck Throttle data output --- only attempt to send more data when previous acknowledged Some of our vis. software reports the bandwidth it sees Currently limited to single tcp/ip socket Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 20 Performance management – file transfer Only an issue at the beginning of a job/job migration Copy job input files Potentially copy checkpoint data too Use GridFTP Have to tune flags for maximum performance e.g. number of parallel streams to use For TeraGyroid we encapsulated optimum flags in a script Transfer statistics logged to file Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 21 Applications in RealityGrid Performance criteria Hardware resource monitoring Failure management Managing performance What tool would we like… Summary Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 22 Grid ‘top’ or `GridStat’ Provide a quick, graphical snapshot of the load/queue length etc. on a number of machines Ideally, indication of accessibility (e.g. globus-job-run working?) Ability to drill down to get detail on specific machine Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 23 Summary RealityGrid’s focus is High Performance Computing The performance of the simulation is the rate-determining element Important to choose the right machine for the job Although can migrate subsequently Minimise impact of other components Steering an important way of monitoring and giving user manual control (e.g. to migrate job) We have work to do Especially on failure management Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 24 Acknowledgements Academic University College London Queen Mary, University of London Imperial College University of Manchester University of Edinburgh University of Oxford University of Loughborough Combining the strengths of UMIST and The Victoria Unviersity of Manchester Industrial Schlumberger Edward Jenner Institute for Vaccine Research Silicon Graphics Inc Computation for Science Consortium Advanced Visual Systems Fujitsu BT Exact (c) 2004 The Unviersity of Manchester 25 How to Contact Us http://www.realitygrid.org http://www.sve.man.ac.uk/Research/AtoZ/RealityGrid/ andrew.porter@manchester.ac.uk Combining the strengths of UMIST and The Victoria Unviersity of Manchester (c) 2004 The Unviersity of Manchester 26