Building the e-Science Grid in the UK: Providing a software toolkit to enable operational monitoring and Grid integration David Baker1, Simon Cox2, Jon Hillier3, Mark McKeown4 and Ron Fowler5 1 University of Southampton Information Systems Services Email: D.J.Baker@soton.ac.uk 2 University of Southampton Regional e-Science Centre 3 Formally of University of Oxford 4 University of Manchester 5 CCLRC Rutherford Appleton Laboratory Abstract We have, over the past eighteen months, been developing a software toolkit to enable the sites participating in the UK e-Science Grid to carry out operational monitoring of Grid resources and to facilitate the integration of the Grid. In this paper we present and discuss the design and implementation of the software tools throughout the development of the UK e-Science Grid, and comment on our experience of using these tools to do integration testing and operational monitoring. Finally we highlight current problems and suggest potential improvements to the tools, and additionally make suggestions for ways in which the project could be developed and diversified in the future. 1. will have access to the hardware and software resources to enable them to successfully carry out their research. Background The Grid Engineering Task Force (GETF) [1] was formed in October 2001 to guide the construction, testing and demonstration of a prototype (Level 1) e-Science Grid in the UK. It contains members from each of the 10 regional e-Science Centres. The GETF initially operated through working groups which put the key elements of the Grid infrastructure in place. The integration of the developing software/hardware infrastructure is a vital aspect of building an e-Science Grid. To address this issue, UK e-Science participants at Southampton University (Simon Cox and David Baker) setup the Grid Integration Working Group (GIWG) [2], under the coordination of David Baker, with the following objectives: • Enable all regional e-Science centres to successfully deploy the Globus Toolkit (GT) [3] on their local resources. • Enable all regional e-Science centres to successfully contribute Gridenabled resources to the UK wide Grid infrastructure. • Ensure that the UK Grid is a useful and usable resource, so that users collaborating on e-Science projects With these objectives in mind, the activities of this group were as follows: Define a set of point to point “integration tests” appropriate to the Level 1 Grid. For example it was important to ensure that simple jobs (e.g. Hello World) could be submitted from one site to another, and to ensure that files could be transferred “across” the Grid. Develop the software to enable the integration tests to be performed from each site on the Grid. In this respect the Grid Integration Test Script (GITS) [4] was written. Facilitate the integration of the Grid by encouraging sites to carry out the point to point tests to other sites on the Grid, and to publish their results on the GETF web site. This activity, of course, involved providing some basic troubleshooting advice to sites experiencing problems, and generally encouraging sites to become active members of the developing UK wide Grid. During the second phase of the development of the UK Grid, the Level 2 Grid (L2G), the GIWG continued it’s activities (“recast” as the Operational Monitoring Work Package). During this period we concentrated on developing, improving and supplementing the prototype software with a view to providing a robust and reliable toolkit to enable sites to carry out daily integration testing and to enable the Grid Support Centre (GSC) to monitor the status of the UK Grid (also known as “Operational Monitoring”) The reminder of this paper is organised as follows. Section 2 describes the key features, and design of the GITS developed by Baker et al. In Section 3 we outline the tests provided by the GITS. In Section 4 we present some experiences of carrying out operational monitoring at the GSC (Fowler) using the data provided by the GITS. In Section 5 we present and describe some potential technical improvements to the GITS. In Section 6 we describe the GITS Web Service [5], developed by McKeown, a tool developed to enable users to interpret and analyse the data provided by the GITS. Finally in Section 7 we present and discuss potential future challenges, and opportunities in this area of work. 2. transfers, and tests to check the Monitoring and Discovery Service (MDS) functionality of hosts on the Grid. MDS is the information services component of the GT [6]. It is envisaged, however, that the GITS should be regarded as a “toolkit”. In this respect users are able to select the most appropriate subset of tests for their needs. It is worth noting at this juncture that the present version of the GITS is only fully compatible with version 2.x of the GT, however it may provide some functionality with GT version 1. • Test execution is done using Proc::Reliable. Most of the tests (interoperability tests plus GRIS test) are performed using the perl module Proc::Reliable [7]. This is a perl module that ensures simple, reliable and configurable subprocess execution, and in particular ensures that process timeouts can be handled reliably in the GITS. It is, however, worth noting that while timeouts are handled reliably by the script, they occasionally leave behind dead processes in their wake, and this issue does need to be addressed. This discussed in Section 5 of this paper. • The GITS can output test results in html format. By default the GITS sends the test results to the screen, and the generation of html output is optional. Users of the GITS are, however, strongly recommended to use this feature since it provides them with an excellent way of presenting their results – with easy access to diagnostic information (see below). • The GITS can output test results in XML format. This is another optional feature, and it provides the user with a means of publishing their results in a format that is appropriate for use with the GITS Web Service. At this juncture is worth noting that the GITS Web Services is an extremely useful “companion” to the GITS. It enables users to store the results generated by the GITS. The use of historical data is invaluable in terms of interpreting and analysing the information provided by the Development of the Grid Integration Test Script (GITS) The automation of the integration tests was a high priority. As the number of machines in the Grid grows then performing the tests from each site would become progressively more time consuming, and burdensome. In reality without the GITS sites would probably not carry out the tests, or write their own script to automate the task. The development of the GITS was begun during the deployment of the Level 1 Grid, and due to portability issues perl was selected as the language of choice. During the development of the L2G the focus at Southampton has been to concentrate on improving the GITS to ensure that it is a more robust, and usable piece of software. The most important features of the GITS are: • A comprehensive set of integration/operation monitoring tests. The GITS now provides a wide range of tests that are appropriate to the services supported on the UK Grid. This includes point to point tests to check job submission and file GITS. The GITS Web Service is described in detail in Section 6 of this document. • The GITS provides data for both fork and batch jobmanagers. Most of the machines in the L2G run batch jobmanagers, and so this is an important feature of the GITS. It is, however, worth noting that some of the point to point tests may return a “failed” state for batch jobmanagers even when all is well. The batch system in question may not be able to return results fast enough for a test to succeed. At present the GITS is not yet able to check the status of batch queues on remote hosts, and determine if the queues are busy at the time of the test. The GITS is available for download, and it is packaged with some basic documentation and the Proc::Reliable perl module. The script may be copied and modified freely by any site under the terms of the GNU General Public License. In this section we briefly describe the tests done by the script; readers interested to know more should consult the documentation [4]. a) Interoperability tests The following “point to point” tests are used to check connectivity between hosts on the grid. • Ping verifies that you can successfully authenticate to the jobmanager(s) on a remote host using the following Globus command: globusrun –a –r hostname/jobmanager • Useful and accessible diagnostics. When a test fails it is important to have some clue as to why the test failed. Relevant diagnostic information is included in the html output for easy access. In this respect the functionality of the GITS is limited by the diagnostic information provided by the GT. It is important to mention, that despite its name, the GITS was written with two important uses in mind. These are: • • 3. Performing Integration tests. As discussed above the GITS allows sites to perform the point to point integration tests – the primary motivation for this work. Operational Monitoring. The GITS also provides a toolkit to allow sites to monitor the operational status of their own local Grid software infrastructure. The requirement for local Grid administrators to monitor the “dynamic status” or “heartbeat” of their own Grid machines can not be understated. In this respect it is envisaged that the GSC will in due course adopt the GITS to provide the “dynamic status” of the UK Grid to its users. Tests performed by the GITS Successfully authenticating to a specific jobmanager (fork or batch service) is a prerequisite to running the interoperability tests using that particular jobmanager. RSL-Hello verifies that you can run the command /bin/echo Hello World on a remote host using the globusrun command. • Hello World verifies that you can run the command /bin/echo Hello World on a remote host using globus-jobrun. • Stage verifies that you can stage a simple shell script to a remote host and then execute it using globus-jobrun. • RSL-Shell verifies that a remote shell script or executable can be run on a host using the globus-sh-exec command. • Batch-Submit verifies that you can submit the batch job, /bin/sleep 600, to a remote host using the globus-jobsubmit command. • Batch-Query verifies that you can query the status of the batch job submitted in “Batch-Submit” using the globus-job-status command. • • Batch-Retrieve verifies that you can retrieve the output from a short batch job, /bin/echo Hello there, submitted to a remote machine using the globusjob-submit command. The GITS uses the command globus-job-get-output to perform this test. • UK GIIS verifies that the host is advertising its Grid Resource Information Service (GRIS) to the UK National Grid Index Information Service (ginfo.grid-support.ac.uk). c) MDS tests for administrators The following MDS are useful to Grid/system administrators who need to debug their local MDS setup. GASS verifies that the Globus Access to Secondary Storage (GASS) can be used to transfer a file to/from a remote host. This is a difficult test to automate, and to simplify the process the GITS implements a “reverse GASS” transfers in which the GASS server is started on the user’s client machine. The GITS tests the service by executing two commands; the first to transfer a file to the remote host, and the second to transfer the file back to the client host. • Local GIIS verifies that the host is advertising its Grid Resource Information Service (GRIS) to the UK National Grid Index Information Service (ginfo.grid-support.ac.uk). • Jobmanagers verifies that the host is advertising its jobmanagers to the UK National Grid Index Information Service (ginfo.grid-support.ac.uk). In addition if the test is successful, the jobmanagers advertised to the GIIS are listed. • GridFTP verifies that the GSIenabled FTP service can be used to transfer a file to/from a remote host. The GITS tests the service by executing two globus-url-copy commands; one to transfer a file to a remote host, and one to retrieve the file. • • GSIssh verifies that you can use the GSI–enabled ssh service to login and run the command /bin/echo Hello on the remote machine. Comparison verifies that the information advertised by a host's Grid Resource Information Service (GRIS) is being faithfully reported by the UK National GIIS. This test queries the GRIS of a host at a specified site GIIS and at the UK National GIIS, and compares the results of the two queries in order to perform this test. • . Batch-Cancel verifies that you can cancel the batch job submitted in “Batch-Submit” using the globusjob-clean command. • GSIscp verifies that the GSI-enabled scp service can be used to transfer a file to/from a remote host. The GITS tests the service by executing two gsiscp commands; first to transfer a file to a remote host, and then to retrieve the file. b) MDS tests The following tests are used to check the MDS status for hosts on the grid. • GRIS performs a test query against the host's Grid Resource Information Service (GRIS). The GITS executes the command grid-info-search -x -h hostname to perform this test. 4. Some experiences of using the GITS to monitor the L2G During the course of the L2G sites have been working very hard together by running the GITS on a daily basis with a view to integrating the resources on the Grid. Interested readers may wish to go to the GIWG homepage [2] which provides pointers to all the L2G integration results. Using all these integration test results to identify and resolve problems at sites is by no means a simple task. The GSC maintain a “watching brief” over the daily integration test results to identify issues on the Grid, and when appropriate sites are contacted so that persistent integration problems can be discussed. This section details the key issues that can be drawn from our experience of integrating/monitoring the Grid using the software tools developed in this project. On the whole the problems shown up by the daily integration test results tend to be fairly mundane. Over half the problems raised during monitoring the L2G were related to the “testing infrastructure” maintained by individual sites. Sites need to install and maintain up to date versions of the GT and the GITS, and ensure that the correct set of remote hosts is tested each day. Although building this “testing infrastructure” is not technically demanding, it can however be a time consuming and error prone operation. A number of issues have surfaced. • Each individual involved in testing has to apply for an account on each remote host separately. This complicates getting full test coverage. A simplified means of applying for accounts across all L2G systems would help here. A common gridmap-file, as used by GridLab, may be the best solution, although this requires a unified set of terms and conditions of use. • The set of test hosts changes with time. It would be useful to get the names of the test hosts from a central location, such as the GITS Web Service. • There is no single package of software to install for L2G sites. This means that new versions of the GITS are installed by hand when people have the time. With the recently released VOM tools, the GITS Web Service tool (gqec.pl) and guide lines for setting up Grid enabled ssh, NTP, etc., it might be time to consider a single software distribution that must be installed on all L2G machines. • The problem with automatically running the tests, which requires either frequent manual intervention to create a proxy, or the use of very long lived proxies, which represent a security problem. Another common source of problems was issues related to firewalls and port range usage. Some of these are easily resolved by requests to system administrators or by getting Globus to set the port range environment variable. One technical issue that has not been resolved is the frequent timeouts that occur when testing between systems that both limit the TCP port range. This issue is currently under investigation. Monitoring the GITS results can be a difficult and time consuming process. While the HTML output page for the tests provides a clear summary of the results from one site for one day, even viewing this needs a small font and a large screen. An example of the html output from the GITS is shown in Figure 1. The task becomes much harder when trying to compare results from different sites and for various dates. At this juncture it’s not clear how this issue could be satisfactorily resolved. A more sophisticated interface, such as the GITS Web Service view, could be very useful in this respect. Alternatively if all sites uploaded their daily integration results into the GITS Web Services it might be possible to devise a automated means of analysing and correlating the results from all the participating sites to form an overview of the status of the UK e-Science Grid on a particular day. Figure 1: An example of the html output from GITS It’s worth highlighting that the GITS Web Service has already proved to be invaluable with respect to carrying out operational monitoring. The historical data from the tests stored in the database can help to identify which issues are temporary or persistent. In addition being able to work out when something stopped working can help locate the change that caused it in the first place. 5. Potential improvements to the GITS At the time of writing the UK e-Science community is reasonably happy with the GITS in terms of the tests that it offers, and its operation. After a year of development most of the bugs have been resolved and it’s quite a robust piece of software. There are, however, a number of areas in which improvements could be made. These are: a) Process timeouts An important issue that still needs to be investigated in more detail is the way that the GITS kills processes. A number of users have noted that the GITS leaves a number of hanging processes in its wake, and while this does not prevent the GITS from functioning correctly it is annoying and should be addressed. Unfortunately this is one of the most difficult areas of perl programming, and it may take a while to resolve. b) GT versioning Some method of detecting the version of the GT installed on local and remote hosts would be very useful. This is by no means a simple task (given the complex structure of the Globus software distribution). In a grid in which sites are running assorted versions/releases of the GT this functionality would be useful (if not essential) in term of immediately identifying potential incompatibilities. c) Batch jobmanager support More complete support of batch jobmanagers could be included in the GITS. For example, the ability to check the status of the batch queues on a remote host, and take an appropriate course of action if the queues are busy. Currently the GITS is unable to make allowance for timeout’s due to full queues. d) The results interface Although the html output provided by the GITS is useful in identifying issues, it is, as noted in Section 4, somewhat too cumbersome. It is worthwhile restating that maybe a more sophisticated interface, such as the web service view could represent a worthwhile improvement to the existing script. e) Parallelisation As more sites join the UK e-Science Grid, performing daily integration test at each site becomes a more time consuming process. To circumvent this issue it’s possible to simultaneously execute several GITS client processes each testing hosts at a specific site, and combine the results files when all the client processes have completed. A wrapper script is currently under development which offers users of the GITS this functionality. 6. The GITS Web Service The GITS Web Service allows the various sites running the GITS to upload their results in XML format to a central database using Web Service technology. The results stored in the database can be queried using various predefined search criteria through the Web Service. Access to the Web Service is controlled so that only certain users can upload results, however anyone can download the results. Proxy certificates are used for authentication allowing users to use batch scripts to upload results. The results are returned from the Web Service in the same XML format as the GITS produces, so applications can use results directly from the GITS or from the Web Service. The Web Service also provides the functionality to convert the results from XML format to HTML format. The XML schema that describes the GITS results allows new tests to be added without modification to the schema. Various clients for the Web Service have been provided including a Perl script, a C++/Qt GUI tool that also acts as an interface for the GITS and a Perl CGI script. It is hoped that the GITS Web Service will be useful in a number of areas: making it easier to monitor the GITS results by collating all the results in one place and by providing useful search tools, by maintaining a log of the state of the Grid allowing users to discover why some Grid application may have failed in the past and to allow applications to discover whether a resource is working correctly before attempting to use it. 7. Future work In this section we present and discuss some strategic areas which we feel need to be addressed. There are at least four potential development areas. a) The GITS We would argue that a well defined set of point to point tests, as provided by the existing GITS, will continue to be a useful tool for integration testing and operational monitoring irrespective of the “level” of the UK e-Science Grid. In addition it is worth noting that the existing production UK Grid will continue to be deployed for at least another 15 months. For these reasons it is important to continue to support and maintain the current GITS. Of particular note in this respect are the suggested technical improvements to the GITS identified in Section 5. The current GITS is only fully compatible with version 2.x of the GT. Over the next year or so a new prototype Grid will be developed and deployed in the UK, and this will employ the “next-generation” GT 3 based on Open Grid Service Architecture mechanisms. An interesting challenge will be to port the existing integration software for use with a GT 3 Grid. b) An applications test suite It could be argued that even if your Grid resources pass the GITS test that there is no evidence that they can run significant or “real” applications without issues. An application test suite would essentially take the testing process from the simple “point to point” level, provided by the GITS, to something more realistic. A reasonable starting point could be a subset (or all) of the applications identified in the L2G Applications Work Package. A typical test application might, for example, consist of a Makefile, source codes, input data files, and output files to allow the user to verify the test run. c) Automating Operational Monitoring It was noted in Section 4 that the current method of carrying out operational monitoring of the Grid is cumbersome and time consuming. As the number of participating sites increases, this situation is exacerbated. Perhaps the best solution that has been suggested is to provide a software tool capable of analysing and correlating all the data uploaded in to the GITS Web Service, and thus provide an automatically generated overview of the status of the UK e-Science Gird each day. d) Parallel versions of the GITS tests The development and implementation of parallel versions of some of the GITS tests was suggested some time ago. They would form a really useful addition to the present test set, and given that the base infrastructure is in place would hopefully not require too much effort to design and implement. Acknowledgements I would like to thank my colleagues involved in integration testing at the GSC and at the Regional e-Science Centres for their ongoing support, contributions and suggestions. References 1. The Grid Engineering Task Force, http://www.grid-support.ac.uk/etf/ 2. The Grid Integration Working Group, http://www.gridsupport.ac.uk/etf/wg/integration-tests.html 3. The Globus project homepage, http://www.globus.org 4. The homepage for the Grid Integration Test Script, http://www.soton.ac.uk/~djb1/gits.html 7. The homepage for the GITS Web Service, http://vermont.mvc.mcc.ac.uk/gqec/ 6. The Globus project Monitoring and Discovery Service, http://www.globus.org/mds/ 7. The Proc::Reliable perl module, http://www.cpan.org/authors/id/D/DG/DGOL D/