GridPP: A Project Overview David Britton, Imperial College. (On behalf of the GridPP collaboration) Abstract UK particle physicists are responding to the challenges set by the turn-on of the Large Hadron Collider (LHC) in 2007 by developing a Computing Grid in the UK. Increasingly large amounts of Monte Carlo data are required in preparation for Physics and Computing Technical Design Reports and to assure readiness for the stream of real data. The running experiments in the US are already providing UK physicists with unprecedented amounts of data, which provides both an immediate demand for the level of computing available from a Grid and an excellent arena in which to prepare for the LHC. GridPP is a £17m, 3-year project with the goal of establishing a prototype Grid for the UK Particle Physics community in close collaboration with the European DataGrid and the LHC Computing Grid project at CERN. Now, at the mid-point of the project, the move "from Web to Grid" is well underway and a pervasive prototype Grid testbed has been established. The GridPP project is a complex synthesis of hardware infrastructure, middleware and application development, and grid deployment, all coordinated in an international context. Many hurdles remain in the areas of scalability, robustness, security, accessibility, and functionality before the prototype Grid is complete. A second three-year project, GridPP2, has been proposed to then develop the Grid from "prototype to production" in the run-up to the start of the LHC. Overview GridPP [1] may be described in terms of the seven high-level areas shown at the top of Figure-1. Each area has a number of component elements shown by the lower boxes in the figure. GridPP Goal To develop and deploy a large scale science Grid in the UK for the use of the Particle Physics community + 1 CERN . , 1. 1 + , 2 DataGrid . LCG Creation . 1. 2 1. 3 , . 1. 4 , . 1. 5 2. 2 , , 2. 3 . 2. 4 . . 2. 5 , 2. 6 , 2. 7 , , , 3. 2 , ATLAS/LHCb GANGA/Gaudi . 2. 8 WP8 3. 3 , . . , . . . , . 3. 6 3. 7 , . 4. 2 4. 3 4. 4 , . 5. 2 , . 3. 8 . , , , 5. 3 Worldwide Particle Physics Integration . 5. 4 6. 1 , Presentation of GridPP . . 6. 2 7 Resources . , . Participation in related areas . 6. 3 7. 1 , Engagement of UK groups 7. 2 . 4. 6 . 7. 3 Attract new resources , , , Data Challenges , , Other Applications Figure-1: Project Areas and Elements Navigate down External link Link to goals , Monitoring of resources UK Grid Rollout , , , Deployment of resources UK e-Science Integration UKQCD QCD Application , , Open Source Implementation UK Testbed 4. 5 5. 1 + 6 Dissemination International Standards Tier-1 Centre . CDF/D0 SAM Framework , . , 3. 4 CMS Monte Carlo System 3. 5 4. 1 Tier-2 Centres BaBar Data Analysis WP7 . . + 5 Interoperability Tier-A Centre LHCb WP6 . , . WP5 . 3. 1 + 4 Infrastructure ATLAS WP4 Grid Deployment + 3 Applications WP3 Grid Technology . , WP2 Computing Fabric . 2. 1 + WP1 Applications . , . , + , The first four areas (CERN, DataGrid, Applications, and Infrastructure) represent areas in which GridPP invests significant resources. The latter three areas (Interoperability, Dissemination, and Resources) are management areas monitored at a high level to ensure the project evolves optimally in the national and international context. The pie chart in Figure-2 shows the distribution of project resources. The Operations category accounts for Management and Travel costs. 6/May/2003 Applications £2.08m Infrastructure £3.67m Operations £1.78m CERN £5.67m DataGrid £3.8m Figure-2: Project Resource Distribution there will be a combined Tier-0/1 centre at CERN. The main issues for the LCG project fabric area are associated with scaling hardware and software management up to the level of thousands of servers and hundreds of Terabytes of disk. The LCG Grid Technology group determines the overall project requirements; tracks external technology and middleware developments; and makes recommendations. In particular, a strong relationship with the European DataGrid (EDG) project has been established and the first LCG release is largely based on EDG middleware. This first LCG release also contains elements of the Virtual Data Toolkit (VDT) from the US. The LCG Deployment group is responsible for deploying and operating the LHC computing environment. This includes system support, Grid Operations, and user support for a worldwide Grid. The LCG project will produce a series of deployments, the first based upon the EDG middleware and later ones likely to rely on the FP6 EGEE project. The timeline is illustrated in Figure-3. Middleware Deployment Operations CERN/LCG GridPP funding was one of the major stimuli that started the LHC Computing Grid Project (LCG) at CERN, funding twenty-four three-year posts and £1.3m of hardware. The LCG project [2] is a Grid deployment project, organized internally in four areas: Applications, Fabric, Technology, and Deployment. The Applications area covers development of those parts of the LHC applications that are common across the experiments. Notable projects in this area are POOL (a common persistency framework); PI (a Physicist Interface project that encompasses the interfaces and tools by which the physicists will use the software); SEAL (providing the core libraries and services); SIMU (putting together a generic simulation framework and infrastructure); and SPI (Software Process and infrastructure project that will provide a common environment for physics software development). The LCG Fabric area is charged with prototyping the Tier-0/1 centre at CERN. Computing for the LHC follows a hierarchical model of Tier centres, starting with a Tier-0 centre at CERN where data are reconstructed, complimented by a network of National Tier-1 centres with substantial computing resources for selection, simulation and archiving of data, and then by more numerous Tier-2 centres that server as regional analysis centers. In practice, Resource Provider EDG User Community LCG 2003 Releases LCG Prototype-1 EDG 1.x (GT2, Condor-G) EGEE EGEE 2004 Prototype-2 LCG-1 (EDG-2) HEP/BIOMED/… 2005 Prototype-3 LCG-2 2006 Production-1 LCG-3 2007 Production-2 2008 Production-3 Migrate to OGSA/F. Follow GGF Standards Figure-3: Deployment Time-Line In the future, GridPP expects to rely on the LCG releases to provide the Grid middleware in the UK. This policy ensures that the UK Grid is genuinely part of the global Grid infrastructure envisaged for the LHC and will enable UK physicists the most immediate and sophisticated access to LHC data. The current status is that the first LCG release (LCG-1) is being rolled out. The initial LCG Grid Operations Centre is at Rutherford Appleton Lab, which is also one of the initial set of ten sites worldwide deploying the release. European DataGrid The UK is one of 6 major partners (with an additional 15 associate partners) in the European DataGrid project [3] that started in January 2001 with the goal of developing and testing a globally distributed technological infrastructure in the form of a computing Grid. Over the last year the EDG has increased focus on quality and successfully passed the 2nd EU review in February. Currently, the EDG2.0 release is in the final stages of integration and forms the basis of the LCG-1 release. One final release is planned before the project is completed in early 2004. The project is organized into a number of workpackages, of which eight (WP1-WP8) have relevance to High Energy Physics. The UK, though GridPP, currently provides the leader or deputy leader of five out of these eight workpackages and is active in all of them. In WP1 (Resource Management), effort in the UK is directed at installing and supporting resource brokers for the UK testbeds. There is also active work on defining quality assurance criteria and a link to work within the context of the core e-science programme on the making an OGSA-compliant resource broker. Provided by WP2 (Data Management), the Replica Manager in EDG2.0 provides three services: A (local) replica metadata catalogue, a replica location service, and a replica optimization service. registry that is then consulted by information consumers giving the impression of a single RDBMS for the whole virtual organization. This has been implemented within EDG2.0 with LDAP interface classes allowing information from LDAP information providers to flow via R-GMA to LDAP information servers. R-GMA has been used in WP7 to provide networkmonitoring information and has also been implemented in the CMS application to provide monitoring. WP4 (Fabric Management) in EDG is based upon the LCGF configuration software from the University of Edinburgh (EDG2.0 uses a modified version called LCFGng). Web Server mkxprof XML Profile Server Client rdxprof rdxprof DBM File LCFG Components Virtual Organization Membership Service Information Service Replica Metadata Catalog Replica Manager Storage Element Replica Location Service Replica Optimization Service Storage Element SE Monitor Network Monitor Figure-4: Replica Manager in EDG2.0 The UK has contributed to WP2 in two main areas. Firstly, a package OptorSim that simulates the architecture of the EU and allows optimization of strategies for file replication. This has shown that, with minimal tuning, an “Economic” model provides at least as good, and frequently better, performance than the best simple file replication strategies. Secondly, a package called Spitfire has been developed that allows secure access to metadata for Grid Middleware. WP3 (Information and Monitoring Services) is a UK led/dominated workpackage that has developed R-GMA, a relational implementation of the Grid Monitoring Architecture from GGF. Information producers register themselves in a ldxprof ldxprof Generic Generic Component User Interface or Worker Node Resource Broker LCFG Source Files Figure-5: The LCFG Architecture WP5 (Mass Storage Management) is another UK led and dominated workpackage. The issue here is to provide transparent access for Grid clients to data storage. This is implemented by a Storage Element contained in EDG2.0, which has interfaces to a number of mass storage system. The core of the storage element is flexible and extensible making it easy to support new protocols, features, and mass storage systems. Grid clients Grid Gridclients clients Control Info Data Xfer Storage Element MSS 1 MSS 2 Figure-5: WP5 Storage Element WP6 is the testbed group responsible for deploying and testing the middleware releases. At any time in the UK there are typically a number of overlapping testbeds (GridPP, EDG, LCG, together with experiment-specific testbeds). These are served by resource brokers located at Imperial College and, in the case of the EDG testbed, a resource broker in Lyon. Currently, the main issue is the transition from the application testbed running the EDG1.4x release to the LCG-1 testbed based on EDG2.0. A snapshot (Jul 25th 2003) of the UK testbed status is shown below. Figure-6: The UK Testbed The variable status of the nodes reflects the fact that this is a testbed, run largely on a best-effort basis, and not a production Grid. Nevertheless, all the testbed sites can be said to be truly on a Grid by virtue of their registration in a comprehensive resource information publication scheme, their accessibility via a set of globally enabled resource brokers, and the use of one of the first scalable mechanisms to support distributed virtual communities (VOMS). There are few such Grids in operation in the World today. WP7 (Network Services) in the strict EDG context covers the areas of testbed infrastructure, network and transport services, network monitoring, and Grid security. From a UK perspective there have been a wide range of activities including participation in joint projects and demonstrations, Grid middleware development, high speed data transport, provision of network performance monitoring using R-GMA and diagnostic services, and piloting the benefits of “better than best efforts IP” services. There has also been involvement in pivotal work that has led to the inception of UKLIGHT, a new optical network research infrastructure. Figure-7: TCP Throughput monitoring using R-GMA. The EDG HEP applications workpackage, WP8, is lead by the UK and works with the LHC experiments to grid-enable the high-energy physics applications. The UK developments will be covered in the next section, where the application developments funded by GridPP, involving both LHC and non-LHC experiments, is described. Applications The ultimate goal of GridPP is to provide a Grid for use by particle physicists in the UK. The experiment collaborations to which they belong are typically worldwide enterprises, often involving tens of countries, hundreds of institutions and up to several thousand physicists. The global nature and sheer scale of these collaborations means that GridPP must ensure that the UK Grid is fully compatible with our partners (and thus “Interoperability” appears at a high level in the ProjectMap shown in Figure-1). GridPP also cannot expect to develop applications of this type single-handedly but has tried to link into as many of the applications as possible by funding primarily the development of interfaces to the Grid [4]. GridPP has also encourage, with some success, joint work between different application groups. One such initiative has been the GANGA project, a joint ATLAS/LHCb user-Grid interface that will allow configuration, submission, monitoring, bookkeeping, output collection, and reporting of Grid jobs. Implemented using a Python software bus, this is a layer that sits on top of the Grid middleware and communicates with the individual experiment applications. CMS in the UK have taken a different route, producing a lightweight portal demonstrator GUIDO that allowed the submission of CMS jobs to the Grid testbed. Whilst the original portal was specific to the pre-grid CMS applications it has now been generalized to provide a simple submission portal for any selfcontained job. Effort in the UK on the three LHC experiments has been involved in various other development areas: In ATLAS, major contributions have been made to ATCOM, the Atlas Commander Monte Carlo production tool that provides users with the ability to perform on-demand Monte Carlo production of specific data sets. This tool will eventually be based on GANGA described earlier. In LHCb, there have been contributions to the DIRAC personal interface and to the LCG persistency framework POOL. CMS have used the WP3 product R-GMA to enable monitoring in the CMS applications and are currently leading the development of the CMS analysis framework. A major effort for all three LHC experiments has been the ongoing data challenges and, as will be described in a later section, the UK has made dominant contributions to the early data challenges for all three experiments. Turning to the non-LHC applications, a very successful joint initiative has been the CDF/DO SAM Grid project. Originally used by the D0 experiment, SAM allowed users worldwide to schedule data transfers from a central repository at Fermilab for local analysis jobs that would, on completion of the transfer, execute automatically. GridPP funded joint work to adapt this tool for use by CDF and to make it more Grid-like. SAM Grid is now based on Globus/Condor (vital elements of VDT used in the LCG-1 release) and allows both experiments to move either jobs to data or, as in the original SAM, data to jobs. Deployed on three continents, SAM Grid provides a genuinely functional Grid for the Fermilab experiments. Figure-8: SAM Grid for the CDF and D0 experiments. The BaBar experiment based at SLAC now has large amounts of data and a pressing need for computing resources. Here, the complication is to move in an adiabatic manner to a Grid without disrupting the on-going data processing. The experiment is in the process of re-defining its data model, to which GridPP effort is contributing, and there is also a joint post with CMS funded to work on the POOL persistency framework. Work is also in progress testing BaBar job submission via a resource broker at Imperial. The UKQCD collaboration aims to use the Grid for QCD calculations and the collaboration is developing the application, based on the EDG middleware, and the interface with the help of GridPP. Currently, sites at Swansea, Liverpool, Edinburgh and RAL are connected in a Grid. Infrastructure GridPP is developing two levels of resources: A prototype Tier-1 centre at RAL [5] and four distributed Tier-2 centres that will involve practically all HEP institutes in the UK. This structure reflects both the hierarchy of Tier centers described earlier and the UK context. The Tier-1 resource is totally managed by GridPP and allows current international commitments to be met. In contrast, the Tier-2 centres rely on resources funded from nonPPARC resources (primarily SRIF, and JREI) and, typically, are shared with other disciplines. Eventually, the integrated Tier-2 resources are likely to be considerably larger than the Tier-1 centre but from a management point of view there is considerable risk, and worse, uncertainty, associated with them. This illustrates one of the fundamental challenges of the Grid: The need to pull resources on to the Grid in a managed way, in addition to pushing out wholly owned resources. The current resources at RAL (~500 CPUs, ~80TB of usable disk, and 180TB of tape) provide an integrated service as a Tier-1 centre for LHC and a Tier-A centre for BaBar. The weekly usage for the Tier-1/A centre this calendar year is shown in the following figure: largest contribution but four tier-2 sites made very significant additions. Figure-10: UK contributions to phase-2 of the ATLAS data challenge. Figure-9: Weekly CPU usage at the Tier-1/A. The purple area represents the BaBar Tier-A usage; the green area shows the tail end of the LHCb data challenge; the large brown component in the penultimate bar represents the start of the latest CMS production. All three LHC experiments have completed their first major data challenge and CMS is currently preparing data for a second. The UK has made significant contributions to all of these, not only through the Tier-1 resources but also using the Tier-2 resources. For LHCb, 1/3 of all the events were produced in the UK, with the largest single contribution coming from the Tier-2 resources at Imperial. For the very early CMS data-challenge, UK was the 3rd largest contributor after CERN and the combined US, again with Imperial producing the largest contribution. In the second phase of the recent ATLAS data challenge, the UK was the largest producer worldwide. As can be seen in the figure below, the Tier-1 centre provided the To date, the data challenges, as with the BaBar Tier-A usage, has not been performed in a Gridlike manner. However, the UK testbed presented earlier has been developed in parallel and is a real functional Grid, albeit very much a prototype. Now we enter a transition period where the LCG-1 release will be more widely deployed and the experiments will increasingly rely on grid-technology to perform the ongoing work. GridPP is very conscious of the need to rollout this Grid in a controlled manner and with the necessary support mechanisms. This will be prime focus of the second half of the GridPP project. GridPP2 The current GridPP project will end in September 2004 but a follow on project, GridPP2 [6], has recently been proposed to cover the period up to September 2007. At that point the LHC should be turning on and a production Grid will be needed. At a high level, GridPP2 is designed to move the UK grid from Prototype to Production in phase with the releases planned from the current and future LCG projects (LCG2 will follow on from LCG in 2005). The speed at which UK physicists will be able to access data from the LHC in an efficient way will depend on continuing the close relationship between GridPP and LCG. The proposed investment in LCG from GridPP2 is less than half of that from GridPP1 but matches both the needs of the LCG2 project and the level at which the UK might be expected to contribute based on CERN membership. The hardware requirements for the LHC experiments in the UK have been profile (albeit with considerable uncertainty) and a plan developed to meet these needs within the context of GridPP2. About half of these requirements will be met through a totally managed Tier-1 resource at RAL. The remainder is assumed to become available through the Tier-2 centres described earlier. Although GridPP2 will not be directly investing in Tier-2 hardware, it will provide funding for posts to integrate these resource into the Grid and some support for operations and maintenance. The total hardware resources planned for 2007 are shown in the Figure-11. As at present, Applications will be developed in collaboration with the experiments with the emphasis in GridPP2 of providing the interface to the Grid and support for cross-experiment projects. Figure-11: Hardware planned by 2007. The greatest challenge for GridPP2 will be to successfully move from a prototype to a production Grid, challenging the boundaries of Scale, Functionality, and Robustness. A dedicated Production Team will be set up with a member in each of the Tier centers under the leadership of an Operations Manager with the prime function of overseeing the technical deployment of the Grid. A provisional outline of the GridPP2 project, using the Project Map format, is shown below. GridPP2 Goal To develop and deploy a large scale production quality grid in the UK for the use of the Particle Physics community + + 1 LCG . 1. 1 , Applications . 1. 2 . 1. 3 , . 1. 4 , . , , 2. 2 , . 2. 3 . . 2. 4 , . 2. 5 Network . 3. 2 3. 3 , . 3. 4 , . . 3. 5 PhenoGrid , . 4. 2 , . 4. 2 , . . 4. 3 , . . 4. 4 Other Applications . 5. 2 5. 3 , . 5. 4 , . 5. 5 , . 7. 1 . 6. 2 , . 6. 3 , . . 6. 4 7. 2 , . 7. 3 , . 7. 4 Technology Transfer , Navigate down External link Link to goals Figure-12: A provisional map of the GridPP2 Project References [1] http://www.gridpp.ac.uk/ [2] http://lcg.web.cern.ch/LCG/ [3] http://eu-datagrid.web.cern.ch/eu-datagrid/ , Engagement Running Experiment Support , Interoperability Monitoring , , Outreach Deployment Middleware Support , 6. 1 7 Dissemination Planning Grid Services UKQCD , , Infrastructure D0 , 5. 1 + 6 Management Rollout CDF CMS , 4. 1 + 5 Production Grid BaBar LHCb Information & Monitoring . , Ganga Security Grid Deployment 3. 1 + 4 Non-LHC Apps ATLAS Workload Management Grid Technology . 2. 1 + 3 LHC Apps Data & Storage Management Computing Fabric . + 2 Development [4] See the links available from: http://www.gridpp.ac.uk/eb/applications.html [5] http://www.gridpp.ac.uk/tier1a/ [6] http://www.gridpp.ac.uk/docs/gridpp2/ . , + ,