Building the e-Science Grid in the UK: Providing a software... to enable operational monitoring and Grid integration

advertisement
Building the e-Science Grid in the UK: Providing a software toolkit
to enable operational monitoring and Grid integration
David Baker1, Simon Cox2, Jon Hillier3, Mark McKeown4 and Ron Fowler5
1
University of Southampton Information Systems Services
Email: D.J.Baker@soton.ac.uk
2
University of Southampton Regional e-Science Centre
3
Formally of University of Oxford
4
University of Manchester
5
CCLRC Rutherford Appleton Laboratory
Abstract
We have, over the past eighteen months, been developing a software toolkit to enable the sites
participating in the UK e-Science Grid to carry out operational monitoring of Grid resources and to
facilitate the integration of the Grid. In this paper we present and discuss the design and
implementation of the software tools throughout the development of the UK e-Science Grid, and
comment on our experience of using these tools to do integration testing and operational monitoring.
Finally we highlight current problems and suggest potential improvements to the tools, and additionally
make suggestions for ways in which the project could be developed and diversified in the future.
1.
will have access to the hardware and
software resources to enable them to
successfully carry out their research.
Background
The Grid Engineering Task Force (GETF) [1]
was formed in October 2001 to guide the
construction, testing and demonstration of a
prototype (Level 1) e-Science Grid in the UK.
It contains members from each of the 10
regional e-Science Centres. The GETF initially
operated through working groups which put
the key elements of the Grid infrastructure in
place.
The integration of the developing
software/hardware infrastructure is a vital
aspect of building an e-Science Grid. To
address this issue, UK e-Science participants at
Southampton University (Simon Cox and
David Baker) setup the Grid Integration
Working Group (GIWG) [2], under the
coordination of David Baker, with the
following objectives:
•
Enable all regional e-Science centres
to successfully deploy the Globus
Toolkit (GT) [3] on their local
resources.
•
Enable all regional e-Science centres
to successfully contribute Gridenabled resources to the UK wide
Grid infrastructure.
•
Ensure that the UK Grid is a useful
and usable resource, so that users
collaborating on e-Science projects
With these objectives in mind, the activities of
this group were as follows:
ƒ
Define a set of point to point
“integration tests” appropriate to the
Level 1 Grid. For example it was
important to ensure that simple jobs
(e.g. Hello World) could be submitted
from one site to another, and to
ensure that files could be transferred
“across” the Grid.
ƒ
Develop the software to enable the
integration tests to be performed from
each site on the Grid. In this respect
the Grid Integration Test Script
(GITS) [4] was written.
ƒ
Facilitate the integration of the Grid
by encouraging sites to carry out the
point to point tests to other sites on
the Grid, and to publish their results
on the GETF web site. This activity,
of course, involved providing some
basic troubleshooting advice to sites
experiencing problems, and generally
encouraging sites to become active
members of the developing UK wide
Grid.
During the second phase of the development of
the UK Grid, the Level 2 Grid (L2G), the
GIWG continued it’s activities (“recast” as the
Operational Monitoring Work Package).
During this period we concentrated on
developing, improving and supplementing the
prototype software with a view to providing a
robust and reliable toolkit to enable sites to
carry out daily integration testing and to enable
the Grid Support Centre (GSC) to monitor the
status of the UK Grid (also known as
“Operational Monitoring”)
The reminder of this paper is organised as
follows. Section 2 describes the key features,
and design of the GITS developed by Baker et
al. In Section 3 we outline the tests provided
by the GITS. In Section 4 we present some
experiences of carrying out operational
monitoring at the GSC (Fowler) using the data
provided by the GITS. In Section 5 we present
and describe some potential technical
improvements to the GITS. In Section 6 we
describe the GITS Web Service [5], developed
by McKeown, a tool developed to enable users
to interpret and analyse the data provided by
the GITS. Finally in Section 7 we present and
discuss potential future challenges, and
opportunities in this area of work.
2.
transfers, and tests to check the
Monitoring and Discovery Service
(MDS) functionality of hosts on the
Grid. MDS is the information services
component of the GT [6]. It is
envisaged, however, that the GITS
should be regarded as a “toolkit”. In
this respect users are able to select the
most appropriate subset of tests for
their needs.
It is worth noting at this juncture that
the present version of the GITS is
only fully compatible with version 2.x
of the GT, however it may provide
some functionality with GT version 1.
•
Test execution is done using
Proc::Reliable. Most of the tests
(interoperability tests plus GRIS test)
are performed using the perl module
Proc::Reliable [7]. This is a perl
module that ensures simple, reliable
and configurable subprocess
execution, and in particular ensures
that process timeouts can be handled
reliably in the GITS. It is, however,
worth noting that while timeouts are
handled reliably by the script, they
occasionally leave behind dead
processes in their wake, and this issue
does need to be addressed. This
discussed in Section 5 of this paper.
•
The GITS can output test results in
html format. By default the GITS
sends the test results to the screen,
and the generation of html output is
optional. Users of the GITS are,
however, strongly recommended to
use this feature since it provides them
with an excellent way of presenting
their results – with easy access to
diagnostic information (see below).
•
The GITS can output test results in
XML format. This is another
optional feature, and it provides the
user with a means of publishing their
results in a format that is appropriate
for use with the GITS Web Service.
At this juncture is worth noting that
the GITS Web Services is an
extremely useful “companion” to the
GITS. It enables users to store the
results generated by the GITS. The
use of historical data is invaluable in
terms of interpreting and analysing
the information provided by the
Development of the Grid Integration
Test Script (GITS)
The automation of the integration tests was a
high priority. As the number of machines in
the Grid grows then performing the tests from
each site would become progressively more
time consuming, and burdensome. In reality
without the GITS sites would probably not
carry out the tests, or write their own script to
automate the task.
The development of the GITS was begun
during the deployment of the Level 1 Grid, and
due to portability issues perl was selected as
the language of choice.
During the development of the L2G the focus
at Southampton has been to concentrate on
improving the GITS to ensure that it is a more
robust, and usable piece of software. The most
important features of the GITS are:
•
A comprehensive set of
integration/operation monitoring
tests. The GITS now provides a wide
range of tests that are appropriate to
the services supported on the UK
Grid. This includes point to point tests
to check job submission and file
GITS. The GITS Web Service is
described in detail in Section 6 of this
document.
•
The GITS provides data for both
fork and batch jobmanagers. Most
of the machines in the L2G run batch
jobmanagers, and so this is an
important feature of the GITS. It is,
however, worth noting that some of
the point to point tests may return a
“failed” state for batch jobmanagers
even when all is well. The batch
system in question may not be able to
return results fast enough for a test to
succeed. At present the GITS is not
yet able to check the status of batch
queues on remote hosts, and
determine if the queues are busy at
the time of the test.
The GITS is available for download, and it is
packaged with some basic documentation and
the Proc::Reliable perl module. The script may
be copied and modified freely by any site
under the terms of the GNU General Public
License. In this section we briefly describe the
tests done by the script; readers interested to
know more should consult the documentation
[4].
a) Interoperability tests
The following “point to point” tests are used to
check connectivity between hosts on the grid.
•
Ping verifies that you can
successfully authenticate to the
jobmanager(s) on a remote host using
the following Globus command:
globusrun –a –r hostname/jobmanager
•
Useful and accessible diagnostics.
When a test fails it is important to
have some clue as to why the test
failed. Relevant diagnostic
information is included in the html
output for easy access. In this respect
the functionality of the GITS is
limited by the diagnostic information
provided by the GT.
It is important to mention, that despite its
name, the GITS was written with two
important uses in mind. These are:
•
•
3.
Performing Integration tests. As
discussed above the GITS allows sites
to perform the point to point
integration tests – the primary
motivation for this work.
Operational Monitoring. The GITS
also provides a toolkit to allow sites
to monitor the operational status of
their own local Grid software
infrastructure. The requirement for
local Grid administrators to monitor
the “dynamic status” or “heartbeat” of
their own Grid machines can not be
understated. In this respect it is
envisaged that the GSC will in due
course adopt the GITS to provide the
“dynamic status” of the UK Grid to
its users.
Tests performed by the GITS
Successfully authenticating to a specific
jobmanager (fork or batch service) is a
prerequisite to running the interoperability
tests using that particular jobmanager.
ƒ
RSL-Hello verifies that you can run
the command /bin/echo Hello World
on a remote host using the globusrun
command.
•
Hello World verifies that you can run
the command /bin/echo Hello World
on a remote host using globus-jobrun.
•
Stage verifies that you can stage a
simple shell script to a remote host
and then execute it using globus-jobrun.
•
RSL-Shell verifies that a remote shell
script or executable can be run on a
host using the globus-sh-exec
command.
•
Batch-Submit verifies that you can
submit the batch job, /bin/sleep 600,
to a remote host using the globus-jobsubmit command.
•
Batch-Query verifies that you can
query the status of the batch job
submitted in “Batch-Submit” using
the globus-job-status command.
•
•
Batch-Retrieve verifies that you can
retrieve the output from a short batch
job, /bin/echo Hello there, submitted
to a remote machine using the globusjob-submit command. The GITS uses
the command globus-job-get-output
to perform this test.
•
UK GIIS verifies that the host is
advertising its Grid Resource
Information Service (GRIS) to the
UK National Grid Index Information
Service (ginfo.grid-support.ac.uk).
c)
MDS tests for administrators
The following MDS are useful to Grid/system
administrators who need to debug their local
MDS setup.
GASS verifies that the Globus Access
to Secondary Storage (GASS) can be
used to transfer a file to/from a
remote host. This is a difficult test to
automate, and to simplify the process
the GITS implements a “reverse
GASS” transfers in which the GASS
server is started on the user’s client
machine. The GITS tests the service
by executing two commands; the first
to transfer a file to the remote host,
and the second to transfer the file
back to the client host.
•
Local GIIS verifies that the host is
advertising its Grid Resource
Information Service (GRIS) to the
UK National Grid Index Information
Service (ginfo.grid-support.ac.uk).
•
Jobmanagers verifies that the host is
advertising its jobmanagers to the UK
National Grid Index Information
Service (ginfo.grid-support.ac.uk). In
addition if the test is successful, the
jobmanagers advertised to the GIIS
are listed.
•
GridFTP verifies that the GSIenabled FTP service can be used to
transfer a file to/from a remote host.
The GITS tests the service by
executing two globus-url-copy
commands; one to transfer a file to a
remote host, and one to retrieve the
file.
•
•
GSIssh verifies that you can use the
GSI–enabled ssh service to login and
run the command /bin/echo Hello on
the remote machine.
Comparison verifies that the
information advertised by a host's
Grid Resource Information Service
(GRIS) is being faithfully reported by
the UK National GIIS. This test
queries the GRIS of a host at a
specified site GIIS and at the UK
National GIIS, and compares the
results of the two queries in order to
perform this test.
•
.
Batch-Cancel verifies that you can
cancel the batch job submitted in
“Batch-Submit” using the globusjob-clean command.
•
GSIscp verifies that the GSI-enabled
scp service can be used to transfer a
file to/from a remote host. The GITS
tests the service by executing two
gsiscp commands; first to transfer a
file to a remote host, and then to
retrieve the file.
b) MDS tests
The following tests are used to check the MDS
status for hosts on the grid.
•
GRIS performs a test query against
the host's Grid Resource Information
Service (GRIS). The GITS executes
the command grid-info-search -x -h
hostname to perform this test.
4.
Some experiences of using the GITS
to monitor the L2G
During the course of the L2G sites have been
working very hard together by running the
GITS on a daily basis with a view to
integrating the resources on the Grid.
Interested readers may wish to go to the GIWG
homepage [2] which provides pointers to all
the L2G integration results. Using all these
integration test results to identify and resolve
problems at sites is by no means a simple task.
The GSC maintain a “watching brief” over the
daily integration test results to identify issues
on the Grid, and when appropriate sites are
contacted so that persistent integration
problems can be discussed. This section details
the key issues that can be drawn from our
experience of integrating/monitoring the Grid
using the software tools developed in this
project.
On the whole the problems shown up by the
daily integration test results tend to be fairly
mundane. Over half the problems raised during
monitoring the L2G were related to the
“testing infrastructure” maintained by
individual sites. Sites need to install and
maintain up to date versions of the GT and the
GITS, and ensure that the correct set of remote
hosts is tested each day. Although building this
“testing infrastructure” is not technically
demanding, it can however be a time
consuming and error prone operation. A
number of issues have surfaced.
•
Each individual involved in testing
has to apply for an account on each
remote host separately. This
complicates getting full test
coverage. A simplified means of
applying for accounts across all
L2G systems would help here. A
common gridmap-file, as used by
GridLab, may be the best solution,
although this requires a unified set
of terms and conditions of use.
•
The set of test hosts changes with
time. It would be useful to get the
names of the test hosts from a
central location, such as the GITS
Web Service.
•
There is no single package of
software to install for L2G sites.
This means that new versions of the
GITS are installed by hand when
people have the time. With the
recently released VOM tools, the
GITS Web Service tool (gqec.pl)
and guide lines for setting up Grid
enabled ssh, NTP, etc., it might be
time to consider a single software
distribution that must be installed
on all L2G machines.
•
The problem with automatically
running the tests, which requires
either frequent manual intervention
to create a proxy, or the use of very
long lived proxies, which represent
a security problem.
Another common source of problems was
issues related to firewalls and port range
usage. Some of these are easily resolved by
requests to system administrators or by
getting Globus to set the port range
environment variable. One technical issue
that has not been resolved is the frequent
timeouts that occur when testing between
systems that both limit the TCP port range.
This issue is currently under investigation.
Monitoring the GITS results can be a
difficult and time consuming process. While
the HTML output page for the tests provides
a clear summary of the results from one site
for one day, even viewing this needs a small
font and a large screen. An example of the
html output from the GITS is shown in
Figure 1. The task becomes much harder
when trying to compare results from
different sites and for various dates. At this
juncture it’s not clear how this issue could
be satisfactorily resolved. A more
sophisticated interface, such as the GITS
Web Service view, could be very useful in
this respect. Alternatively if all sites
uploaded their daily integration results into
the GITS Web Services it might be possible
to devise a automated means of analysing
and correlating the results from all the
participating sites to form an overview of the
status of the UK e-Science Grid on a
particular day.
Figure 1: An example of the html output from GITS
It’s worth highlighting that the GITS Web
Service has already proved to be invaluable
with respect to carrying out operational
monitoring. The historical data from the
tests stored in the database can help to
identify which issues are temporary or
persistent. In addition being able to work out
when something stopped working can help
locate the change that caused it in the first
place.
5.
Potential improvements to the GITS
At the time of writing the UK e-Science
community is reasonably happy with the GITS
in terms of the tests that it offers, and its
operation. After a year of development most of
the bugs have been resolved and it’s quite a
robust piece of software. There are, however, a
number of areas in which improvements could
be made. These are:
a) Process timeouts
An important issue that still needs to be
investigated in more detail is the way that the
GITS kills processes. A number of users have
noted that the GITS leaves a number of
hanging processes in its wake, and while this
does not prevent the GITS from functioning
correctly it is annoying and should be
addressed. Unfortunately this is one of the
most difficult areas of perl programming, and
it may take a while to resolve.
b) GT versioning
Some method of detecting the version of the
GT installed on local and remote hosts would
be very useful. This is by no means a simple
task (given the complex structure of the
Globus software distribution). In a grid in
which sites are running assorted
versions/releases of the GT this functionality
would be useful (if not essential) in term of
immediately identifying potential
incompatibilities.
c)
Batch jobmanager support
More complete support of batch jobmanagers
could be included in the GITS. For example,
the ability to check the status of the batch
queues on a remote host, and take an
appropriate course of action if the queues are
busy. Currently the GITS is unable to make
allowance for timeout’s due to full queues.
d) The results interface
Although the html output provided by the
GITS is useful in identifying issues, it is, as
noted in Section 4, somewhat too cumbersome.
It is worthwhile restating that maybe a more
sophisticated interface, such as the web service
view could represent a worthwhile
improvement to the existing script.
e)
Parallelisation
As more sites join the UK e-Science Grid,
performing daily integration test at each site
becomes a more time consuming process. To
circumvent this issue it’s possible to
simultaneously execute several GITS client
processes each testing hosts at a specific site,
and combine the results files when all the
client processes have completed. A wrapper
script is currently under development which
offers users of the GITS this functionality.
6.
The GITS Web Service
The GITS Web Service allows the various
sites running the GITS to upload their results
in XML format to a central database using
Web Service technology. The results stored in
the database can be queried using various predefined search criteria through the Web
Service. Access to the Web Service is
controlled so that only certain users can upload
results, however anyone can download the
results. Proxy certificates are used for
authentication allowing users to use batch
scripts to upload results. The results are
returned from the Web Service in the same
XML format as the GITS produces, so
applications can use results directly from the
GITS or from the Web Service. The Web
Service also provides the functionality to
convert the results from XML format to
HTML format. The XML schema that
describes the GITS results allows new tests to
be added without modification to the schema.
Various clients for the Web Service have been
provided including a Perl script, a C++/Qt GUI
tool that also acts as an interface for the GITS
and a Perl CGI script. It is hoped that the GITS
Web Service will be useful in a number of
areas: making it easier to monitor the GITS
results by collating all the results in one place
and by providing useful search tools, by
maintaining a log of the state of the Grid
allowing users to discover why some Grid
application may have failed in the past and to
allow applications to discover whether a
resource is working correctly before
attempting to use it.
7.
Future work
In this section we present and discuss some
strategic areas which we feel need to be
addressed. There are at least four potential
development areas.
a) The GITS
We would argue that a well defined set of
point to point tests, as provided by the existing
GITS, will continue to be a useful tool for
integration testing and operational monitoring
irrespective of the “level” of the UK e-Science
Grid. In addition it is worth noting that the
existing production UK Grid will continue to
be deployed for at least another 15 months. For
these reasons it is important to continue to
support and maintain the current GITS. Of
particular note in this respect are the suggested
technical improvements to the GITS identified
in Section 5.
The current GITS is only fully compatible with
version 2.x of the GT. Over the next year or so
a new prototype Grid will be developed and
deployed in the UK, and this will employ the
“next-generation” GT 3 based on Open Grid
Service Architecture mechanisms. An
interesting challenge will be to port the
existing integration software for use with a GT
3 Grid.
b) An applications test suite
It could be argued that even if your Grid
resources pass the GITS test that there is no
evidence that they can run significant or “real”
applications without issues. An application test
suite would essentially take the testing process
from the simple “point to point” level,
provided by the GITS, to something more
realistic.
A reasonable starting point could be a subset
(or all) of the applications identified in the
L2G Applications Work Package. A typical
test application might, for example, consist of
a Makefile, source codes, input data files, and
output files to allow the user to verify the test
run.
c)
Automating Operational
Monitoring
It was noted in Section 4 that the current
method of carrying out operational monitoring
of the Grid is cumbersome and time
consuming. As the number of participating
sites increases, this situation is exacerbated.
Perhaps the best solution that has been
suggested is to provide a software tool capable
of analysing and correlating all the data
uploaded in to the GITS Web Service, and thus
provide an automatically generated overview
of the status of the UK e-Science Gird each
day.
d) Parallel versions of the GITS tests
The development and implementation of
parallel versions of some of the GITS tests was
suggested some time ago. They would form a
really useful addition to the present test set,
and given that the base infrastructure is in
place would hopefully not require too much
effort to design and implement.
Acknowledgements
I would like to thank my colleagues involved
in integration testing at the GSC and at the
Regional e-Science Centres for their ongoing
support, contributions and suggestions.
References
1. The Grid Engineering Task Force,
http://www.grid-support.ac.uk/etf/
2. The Grid Integration Working Group,
http://www.gridsupport.ac.uk/etf/wg/integration-tests.html
3. The Globus project homepage,
http://www.globus.org
4. The homepage for the Grid Integration Test
Script, http://www.soton.ac.uk/~djb1/gits.html
7. The homepage for the GITS Web Service,
http://vermont.mvc.mcc.ac.uk/gqec/
6. The Globus project Monitoring and
Discovery Service,
http://www.globus.org/mds/
7. The Proc::Reliable perl module,
http://www.cpan.org/authors/id/D/DG/DGOL
D/
Download