UK e-Science Technical Report Series ISSN 1751-5971 Report on Lightweight SRM Evaluation UK Grid Engineering Task Force David McBride, Imperial College, London Steven Young, Oxford e-Research Centre Tim Parkinson, University of Southampton David Wallom, Oxford e-Research Centre 03-May-2007 Abstract: The Lightweight SRM Evaluation is a project operated by the UK Grid Engineering Taskforce (ETF). Its purpose is to evaluate a lightweight Storage Resource Manager (SRM) implementation, namely the Disk Pool Manager (DPM) software developed by CERN, for suitability for production deployments on the UK National Grid Service infrastructure. DPM is lightweight in that it implements SRM protocol services to stand in front of pools of diskbased file systems rather than more “heavyweight” Mass Storage Systems (MSS) which might include tape archive systems as well as disk pools. It is also lightweight in that it is supposed to be low maintenance, and easy to run in a good performance mode. This evaluation finds that deployment of the software in the NGS environment should be straightforward, and recommends that the NGS should avail itself of the expertise that exists within the GridPP Storage group in planning, executing and supporting the rollout of SRM within NGS. The NGS DPM services could be “plugged into” the GridPP Storage group testing infrastructure and publish themselves to appropriate Index Information servers: storage accounting work has been done and should be adopted by NGS. The only successful deployments of DPM were via RPMs on Red Hat Enterprise Linux/Scientific Linux installations. The best approach for NGS sites might be to provision additional servers on which to deploy DPM services to stand in front of disk resources to be managed as pools. UK e-Science Technical Report Series Report UKeS-2007-03 Available from http://www.nesc.ac.uk/technical_papers/UKeS-2007-03.pdf Copyright © 2007 The University of Edinburgh. All rights reserved. UK Grid Engineering Taskforce (ETF): Report on Lightweight SRM Evaluation David McBride, Steven Young, Tim Parkinson, David Wallom Version Date Comments 0.1 2006-08-17 Initial Draft 0.2 2007-02-21 Second Draft 0.3 2007-05-02 Minor corrections following comments 0.4 2007-05-03 Final revisions 1 Introduction The Lightweight SRM Evaluation is a project operated by the UK Grid Engineering Taskforce (ETF) [1]. Its purpose is to evaluate a lightweight SRM implementation, namely the Disk Pool Manager (DPM) [2] software developed by CERN [3], for suitability for production deployments on the UK National Grid Service [4] infrastructure. The Storage Resource Manager (SRM) Interface is a specification managed by the Storage Resource Management Working Group [5], also known as the OGF (formerly GGF) Grid Storage Management Working Group. DPM is lightweight in that it implements SRM protocol services to stand in front of pools of disk-based file systems rather than more “heavyweight” Mass Storage Systems (MSS) which might include tape archive systems as well as disk pools. It is also lightweight in that it is supposed to be low maintenance, and easy to run in a good performance mode. Other SRM implementations include CASTOR and dCache [6][7]. We are using the ETF Grid Middleware Evaluation Criteria document [8] as a basis for this evaluation. Further evaluation criteria are also used specific to NGS. 2 General Information This section captures non-technical information relating to the product and how it may be used. 2.1 Provider What is the source of the software (commercial/research)? The software is being developed at CERN as part of the gLite project [9]. The source is available from a CVS repository, though the main distribution method for the software is via RPM packages that are built and distributed as part of gLite system installations. How mature is the provider and their development process (if this can be objectively determined)? The provider is well established, and their development processes appear to be basically sound; for example, they they make use of source-code control systems appropriately and provide an (albeit somewhat antiquated imake based) unattended build mechanism. Do they have a model for a sustainable future? Will they exist in 3 years time? It is highly likely that CERN and the DPM developers will still exist in 3 years time. Does the system currently have a large install base? How widely used is the system? DPM has been deployed by many LCG [10] sites. There is a GridPP wiki page [11] which provides a script for querying the WLCG top-level BDII which provides information about usage of a set of different SRM implementations. It also has example output from the script dating from 15 Dec 2006. About 2/3 of GridPP sites run DPM. 2.2 Licensing Which license is the system distributed under? The precise license terms are not entirely clear. Specific license documentation is not included in the DPM CVS repository [12]. However, the RPM specification files stored there indicate that the software is licensed under the terms of the GNU General Public License (GPL). A license version string is not specified. However, this may not indicate the intended license terms for the software it could simply be an artifact left over from another template file. That said, binary packages distributed by the gLite project themselves still advertise that the software is provided under the GPL in the package headers. The EGEE Software License [13] could also potentially apply as DPM is distributed as part of the gLite software distribution provided as part of the EGEE projects. For the time being it seems reasonable to assume that the code is available under the terms of the GNU General Public License, v2 [14] until evidence is found to the contrary. Does this limit users (e.g. commercial users, research councils, ...?) The GPL is a re-distribution license, not an end-user license. Thus, there are no restrictions on how DPM may itself by used. Does the licensing impose restrictions on any developed software (e.g. restrictive nature of GPL?) Yes; derivative works of the DPM software must be released under the terms of the same license. 2.3 Supported Platforms Is the software available on a variety of platforms? No. The source of the software is available, but the software is distributed for use under one specific Linux distribution (Scientific Linux) and currently is works only on one specific version (version 3). Which platforms have been tested in the evaluation? 32-bit and 64-bit Linux, specifically Red Hat Enterprise Linux, Scientific Linux and SUSE. The evaluation team put effort into attempting to build the software for Solaris [15]. What is the structure of the testbed on which the software was evaluated (e.g. number of sites, number of machines, number of processors per machine, architecture, operating system)? London e-Science Centre 32/64-bit (bi-arch) Red Hat Enterprise: Binary RPM installation: successful Attempted build of DPM on Solaris: unsuccessful Southampton 64-bit Red Hat: Binary RPM installation by-hand: problems with library dependencies from 32-bit RPMs; problems with MySQL database version dependency: successful (after much effort) Belfast SUSE 10 (two systems: one 32-bit, one 64-bit): Binary RPM installation: unsuccessful 32-bit Scientific Linux: Binary RPM installation: successful Oxford 32-bit Scientific Linux: Binary RPM installation using YAIM: successful. 2.4 Support How is the product supported if at all? Is commercial support available? How much does it cost? Is there a support community outside of the commercial organisation? Is there support target at end-users/system administrators/developers? Support channels were not investigated. Specifically we didn't investigate what levels of support the DPM software development team at CERN provide. We had no knowledge of any commercial support channels. Upstream support for NGS could be an issue. The developers' main priority is LCG/EGEE. Support for other architectures and non-LCG/gLite installations may be a lower priority. Within the UK the GridPP Storage group and GridPP sites generally have a wealth of experience and knowledge to share about DPM server deployments and do provide a support community. The community support provided by GridPP is probably best suited for system administrators. 3 Systems Management We consider here the issues that would impact on the administrators of the particular software product and on the administration of the resources upon which the product will be deployed. 3.1 Documentation for System Managers Is it available in multiple formats? Does it support keyword searches? We don't actually consider these factors to be significant, in general. Instead, we would deviate from the current ETF Evaluation Criteria document and suggest that having documentation in at least one format for which viewing and manipulation tools are readily available is sufficient. In the case of the DPM software, a quite detailed DPM Administrators Guide [16] exists. It is clearly tailored for those systems administrators who are deploying DPM as part of a larger LCG or gLite installation; however, it should be sufficient for anyone using the standard distributed binaries. Is it comprehensive? Is it accurate? The DPM administrators guide is comprehensive; although it does suggest using the gLite YAIM tool to handle installation and setup, it does enumerate the specific steps that are required if a systems administrator is installing by hand. So far as we can determine, the documentation is both accurate and up-to-date. Does it provide a quickstart guide? No explicit quick-start guide is provided as such; however, the documentation available should be sufficient. 3.2 Server Deployment How complex/easy is the software to install? What are the system requirements? What other products, libraries does the system depend on? How flexible are the installation requirements (e.g. does it need a specific OS version & release, Java environment) and are these easily supported? Do major changes need to be made to the systems environment (e.g. environment variables) to support the software? How well do these new requirements co-exist with other installed software? Will the product make use of any pre-installed libraries, etc. upon which it depends? The YAIM installation is straightforward with adequate documentation. It does require that the system is deployed with a Scientific Linux (v3) installation, Java and NTP. The YAIM installation probably assumes that the system is only intended to run the DPM services specified in the YAIM installation, though other services and activity could probably be run on top of the DPM systems, this doesn't line up with working practice for gLite deployments. An RPM installation on Red Hat Enterprise Linux and Scientific Linux is fairly straightforward, though we encountered issues on 64-bit installations. Great difficulties were experienced in attempts to install the software onto different RPM distributions (ie. SUSE) using the SL RPMs. We didn't attempt to build the software from source on other Linux platforms. DPM supports the use of mySQL or Oracle as a backend database for the DPM services, though we didn't do any investigation of DPM usage of Oracle. How complex is the software to manage (e.g. how easy is it to add new users)? New users are managed via the use of grid-mapfiles which are configured to be downloaded from VO servers (LDAP VO or VOMS). These methods of managing users is straightforward. How does the software co-exist with the individual server's and institutional firewalls if it needs external networking access (outgoing connections) or needs to be externally visibile (incoming connections)? Are the software's required ports (fixed / dynamic ranges) and protocols (UDP/TCP) well defined? Can the required ports be altered on deployment? The firewall requirements for the software are well documented and understood. There are a small set of ports for the services running on a DPM head node or DPM pool node. We didn't investigate whether these ports are configurable Does the documentation explain the purpose of these ports? Yes. Is the configuration of these ports to enable co-existence with a firewall straightforward? Yes. Based on this information, how easily could site administrators be persuaded to accommodate the software's requirements? Yes. How stable is the software? The stability of the successful installation was good. No instabilities observed. Usage and testing wasn't exhaustive. A testing architecture was discussed by the evaluation team. 3.3 Client Deployment Is there a client distribution? How lightweight is it? DPM client commands are available using the LCG/gLite User Interface (UI) installation. This could take the form of a LCG/gLite UI machine which is directly installed, or a relocateable tarball installation which can be installed on an x86 Linux installation. We didn't investigate on what range of Linux installations the relocateable UI tarball installation actually works. We didn't investigate compiling DPM client commands from source. It should be noted that client interfaces can be generated from the WSDL descriptions of the services. 3.4 Account Management How are users added to the server? Is there a management interface? How does the user apply for access? What is the probable impact of this on users' management and sharing of credentials? The software supports GSI authentication and VOMS. For the purposes of NGS, the software is able to support NGS users by including references to the NGS VO server which allows the software to create appropriate grid-mapfiles for the services. Does the software record accounting (usage) information? Are there tools to monitor/review this information? Can the accounting data be exported to other management systems? There are tools for accounting of disk pool usage by VO which are part of DPM information publishing [17]. The evaluation activity didn't investigate accounting information in any depth. 3.5 Reliability Are there any high availability features in the product? Can these be provided by simple server replication? HA and server replication were not considered by the evaluation. How stable is the product in a production environment? How stable is the product under load? How reliable is the product under load? No service instabilities were observed. 3.6 Distributed Management What facilities are provided for managing a Grid of machines (as opposed to managing an individual machine contributed to the Grid)? What support is there for monitoring, policy specification, system upgrades, backwards compatibility between versions? A YAIM/SL installation is maintained by RPM repositories against which the system updates itself on a (by default) daily basis. 3.7 External Services How easy is it to integrate and deploy third party services from other organisations or developers into the server infrastructure? Can client libraries be supplied so that they can be easily integrated into the existing client environment? Not considered by the evaluation. 3.8 Scalability How scalable is the infrastructure? Is system activity/management coordinated through a central point? During the evaluation how far was the scalability tested? How migh the introduction of further sites & machines alter performance? The questions of scalability (ie. questions about numbers of pools, size of file storage that can be supported by DPM, disk/file system performance limits, network bandwidth limits, and number of concurrent query limits, scalability of information provider structure) weren't considered in depth by the evaluation team. At the recent WLCG workshop [18] stress testing results from Glasgow were presented. 4 User Experience Examine how the user interacts with the established grid infrastructure. SRM clients commands could be used in a normal user's work, but examination of how SRM services fit into higher level file/replica catalog services probably needs to be considered more fully, though this was outside the scope of the DPM evaluation. 5 Developer Experience Not considered. 6 Technical Any software product will build upon a set of established technologies that may have established industrial support. The stability of the proposed solution needs to be examined with a view to deployment. 6.1 Architecture How well does the software map to a SOA? SRM is a web services protocol and the interfaces are implemented as Web Services, but the question of how well the software maps to a Service Orientated Architecture was not addressed by the evaluation. DPM does have a management service which is only intended for local support. 6.2 Standards & Specifications The evaluation activity didn't consider questions about the details of the SRM standard specification, or investivate DPM's support for different versions of the SRM specification (DPM provides SRM1 and SRM2). It would be interesting to know about differences in the SRM implementations provided by DPM, CASTOR and dCache, but again this was outside of the scope of the evaluation activity. SRM v1.1 is widely used, v2.2 is agreed and supposed to be used by LCG from 07Q2, v3.0 is defined but not yet implemented. 6.3 Security The DPM/SRM services use GSI and httpg. 7 Specific evaluation criteria for DPM with respect to the NGS How does SRM fit into the NGS middleware ecology? SRM fits into data services for NGS. It could be seen as a replacement for SRB, (there is discussion (and initial work?) on an SRM interface to SRB) or as a service which can coexist with SRB providing both 'islands' of grid data services. 8 Conclusions Summary that describes the key features of the product, its perceived strengths and weaknesses/drawbacks, and the principal issues that would need to be addressed before deployment within the NGS. Can the software be deployed within the NGS environment? Yes. DPM services are able to support the NGS as a VO so deployment of the software in the NGS environment should be straightforward. Can outgoing and incoming TCP ports be restricted to a specific range or identified ports? Yes. Would deployment of this software on the NGS require any changes to the existing software or introduce new software dependencies beyond the deployment of the new middleware? The only successful deployments of DPM were via RPMs on Red Hat Enterprise Linux/Scientific Linux installations. The best approach for NGS sites might be to provision additional servers on which to deploy DPM services to stand in front of disk resources to be managed as pools. The NGS should avail itself of the expertise that exists within the GridPP Storage group in planning, executing and supporting the rollout of SRM within NGS. The NGS DPM services could be “plugged into” the GridPP Storage group testing infrastructure and publish themselves to appropriate Index Information servers. Storage accounting work has been done and should be adopted by NGS. References [1] UK Grid Engineering Task Force (ETF), http://www.grids.ac.uk/ETF/ [2] Disk Pool Manager (DPM), https://twiki.cern.ch/twiki/bin/view/LCG/DpmGeneralDescription [3] European Council for Nuclear Research (CERN), http://cern.ch/ [4] UK National Grid Service (NGS), http://www.ngs.ac.uk/ [5] Storage Resource Management Working Group, http://sdm.lbl.gov/srm-wg/ [6] CERN Advanced STORage manager (CASTOR), http://castor.web.cern.ch/castor/ [7] dCache, http://www.dcache.org/ [8] Steven Newhouse, Alex Hardisty, David Berry, Bruce Beckles, Jonathan Giddy, Mark McKeown, David Wallom, and Neil Geddes, ETF Grid Middleware Evaluation Criteria, version 1.0, February 2005. [9] gLite, http://glite.web.cern.ch/glite/ [10] Large Hadron Collider Compute Grid (LCG), http://lcg.web.cern.ch/lcg/ [11] WLCG SRM usage page on GridPP wiki, http://www.gridpp.ac.uk/wiki/WLCG_SRM_usage [12] Disk Pool Manager CVS Repository, http://isscvs.cern.ch:8180/cgibin/cvsweb.cgi/LCG-DM/?cvsroot=lcgware [13] EGEE Software License, http://public.eu-egee.org/license/license2.html [14] GNU General Public License, version 2 (GPLv2), http://www.gnu.org/copyleft/gpl.html [15] DPM-on-Solaris page on GridPP wiki, http://www.gridpp.ac.uk/wiki/DPM-onSolaris [16] DPM Administrators Guide, https://twiki.cern.ch/twiki/bin/view/LCG/DpmAdminGuide [17] DPM Information Publishing page on GridPP wiki, http://www.gridpp.ac.uk/wiki/DPM_Information_Publishing [18] WLCG Collaboration Workshop, 22-26 January 2007, http://indico.cern.ch/sessionDisplay.py?sessionId=13&slotId=0&confId=3738#20 07-01-23