Performance Considerations within RealityGrid Andrew Porter 22

advertisement
Performance Considerations within
RealityGrid
Andrew Porter
22nd June 2005
Grid Performance Workshop
NeSC
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
Overview







Applications in RealityGrid
Performance criteria
Hardware resource monitoring
Failure management
Managing performance
What tool would we like…
Summary
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
2
Applications in RealityGrid




RealityGrid: employ Grid technology to aid computational studies of
condensed-matter
Simulations of:
 oil/water/surfactant mixtures
 Macromolecules (proteins)
 Semiconductor surfaces
Hybrid (coupled) models:
 Fluid flow (atomic scale + continuum)
 Tip interaction with surface (atomic scale + continuum)
Steering + online visualization
 Give the human some control back
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
3
HPC engine
User launches
jobs and steers
them
HPC engine
checkpoint
files
Data for
visualization
compressed
video
visualization
engine
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
4
Steering using Grid services layer
GS middle tier
Steering
GS
bind
Simulation
Steering library
library
Steering
Components start
independently &
attach/detach
dynamically
publish
Client
connect
Steering
Steering library
client
Registry
find
data transfer
(sockets/files)
Display
publish
Steering library
bind
multiple clients: Qt/C++,
.NET on PocketPC,
GridSphere Portlet (Java)
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
Display
Steering
GS
Visualization
Visualization
Display
Remote visualization through
SGI VizServer, Chromium,
and/or streamed to Access Grid
(c) 2004 The Unviersity of Manchester
5







Applications in RealityGrid
Performance criteria
Hardware resource monitoring
Failure management
Managing performance
What tool would we like…
Summary
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
6
Performance criteria

Will my job run?
 Sufficient memory
 Sufficient number of processors
 Sufficient max. wall clock

How soon will my job start executing?




Queue length
Capacity of machine
Amount of checkpoint data
Network connectivity between checkpoint host and compute machine
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
7
Criteria continued…



How long will my job take?
 No. of processors available
 Type of processors/interconnect
Can I steer the job?




Improve performance by allowing human to monitor job
Can decide to migrate it…
Network connectivity
Will job run at a convenient time (scheduling)
Will on-line visualization be possible?
 Again, allow for more detailed monitoring
 Network connectivity
 Location of visualization engine
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
8







Applications in RealityGrid
Performance criteria
Hardware resource monitoring
Failure management
Managing performance
What tool would we like…
Summary
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
9
Resource monitoring


Many performance criteria depend upon:




The application being run
The configuration of queuing systems
A ‘feel’ for how quickly a job gets through the queues
How important steering/visualization/migration is
However, have few useable resources
 Require HPC resources
 Require that application has been installed
User maintains list of resources (including queue information)
Is up to user to check on resource status… 
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
10
Monitoring continued…

Manual monitoring of resources:
 gsissh + qstat/bjobs etc.
 Web interfaces, e.g.:
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
11







Applications in RealityGrid
Performance criteria
Hardware resource monitoring
Failure management
Managing performance
What tool would we like…
Summary
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
12
Failure management



We have no sophisticated tools for managing failure
User simply repeats, hopefully after establishing reason for failure
Possible modes of failure:
 Transfer of job files (input or checkpoint data)
 Job submission (to queuing system)
 User sees globus-related error messages
 debugging can be difficult and requires expertise
 Job execution
 Observed by steering client or, potentially, from globus job status
 User must examine stdout/err
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
13
Failure management continued…
 Steering communication
 Steering client unable to attach/fails to get status messages
 Normally indicates job has failed
 Can query the Grid service to gain more information
 Data transfer to visualization
 Often indicates networking problem (e.g. firewall, lack of network
connectivity)
 Again, requires job stdout/err to debug
 User can attempt visualization elsewhere
 Display of visualization output
 X server not listening on port
 Firewall
 Incorrect configuration of VizServer…
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
14







Applications in RealityGrid
Performance criteria
Hardware resource monitoring
Failure management
Managing performance
What tool would we like…
Summary
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
15
Managing performance
Simulation
Client
Steering library
Steering
messages
(ASCII)
Steering library
data
transfer
(binary)
Render
Steering library
Visualization
Visualization
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
16
Performance management continued…

Most important/expensive component is the simulation
 User must capitalise on the (considerable) computational resource being




used
Use standard HPC tools to ensure code is performing well on chosen
number of processors (scaling)
Again, stdout often contains simulation-generated timing information
Simulation migration
Impact of other components must be minimised
 Computational steering
 On-line visualization
 File transfer
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
17
Performance management - migration


Can be triggered automatically (work by
CNC, Manchester)
Or manually by the user using
RealityGrid launching client:
 User chooses job to migrate
 Client issues “checkpoint” and “stop”




instructions using steering interface
Job takes checkpoint, registers it and
stops
User chooses new resource
Checkpoint data copied over
Job restarted on new resource
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
18
Performance management - steering

Steering allows some performance management
 Check that job is running as required
 e.g. Monitoring of CPU time/step
 Check on progress of calculation

Steering also has performance implications
 Is a code serialization point
 Delay for SOAP exchange - location of Grid Service infrastructure
 Tune the frequency of message generation
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
19
Performance management - visualization




Visualization allows user to check validity of results
At a cost
 Collection of data from multiple nodes of parallel job
 Transfer of data to visualization
Only collect data when a visualization is present
 Don’t do it every time step
Ensure visualization does not become bottleneck
 Throttle data output --- only attempt to send more data when previous


acknowledged
Some of our vis. software reports the bandwidth it sees
Currently limited to single tcp/ip socket
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
20
Performance management – file transfer



Only an issue at the beginning of a job/job migration
 Copy job input files
 Potentially copy checkpoint data too
Use GridFTP
Have to tune flags for maximum performance
 e.g. number of parallel streams to use
 For TeraGyroid we encapsulated optimum flags in a script
 Transfer statistics logged to file
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
21







Applications in RealityGrid
Performance criteria
Hardware resource monitoring
Failure management
Managing performance
What tool would we like…
Summary
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
22
Grid ‘top’ or `GridStat’

Provide a quick, graphical snapshot of the load/queue
length etc. on a number of machines
 Ideally, indication of accessibility (e.g. globus-job-run working?)
 Ability to drill down to get detail on specific machine
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
23
Summary






RealityGrid’s focus is High Performance Computing
The performance of the simulation is the rate-determining element
Important to choose the right machine for the job
 Although can migrate subsequently
Minimise impact of other components
Steering an important way of monitoring and giving user manual
control (e.g. to migrate job)
We have work to do
 Especially on failure management
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
24
Acknowledgements
Academic
 University College London
 Queen Mary, University of London
 Imperial College
 University of Manchester
 University of Edinburgh
 University of Oxford
 University of Loughborough
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
Industrial
 Schlumberger
 Edward Jenner Institute for Vaccine
Research
 Silicon Graphics Inc
 Computation for Science Consortium
 Advanced Visual Systems
 Fujitsu
 BT Exact
(c) 2004 The Unviersity of Manchester
25
How to Contact Us
http://www.realitygrid.org
http://www.sve.man.ac.uk/Research/AtoZ/RealityGrid/
andrew.porter@manchester.ac.uk
Combining the strengths of UMIST and
The Victoria Unviersity of Manchester
(c) 2004 The Unviersity of Manchester
26
Download