GridPP: A Project Overview Abstract David Britton, Imperial College.

advertisement
GridPP: A Project Overview
David Britton, Imperial College.
(On behalf of the GridPP collaboration)
Abstract
UK particle physicists are responding to the challenges set by the turn-on of the Large Hadron Collider
(LHC) in 2007 by developing a Computing Grid in the UK. Increasingly large amounts of Monte Carlo
data are required in preparation for Physics and Computing Technical Design Reports and to assure
readiness for the stream of real data. The running experiments in the US are already providing UK
physicists with unprecedented amounts of data, which provides both an immediate demand for the level
of computing available from a Grid and an excellent arena in which to prepare for the LHC.
GridPP is a £17m, 3-year project with the goal of establishing a prototype Grid for the UK Particle
Physics community in close collaboration with the European DataGrid and the LHC Computing Grid
project at CERN. Now, at the mid-point of the project, the move "from Web to Grid" is well underway
and a pervasive prototype Grid testbed has been established. The GridPP project is a complex synthesis of
hardware infrastructure, middleware and application development, and grid deployment, all coordinated
in an international context. Many hurdles remain in the areas of scalability, robustness, security,
accessibility, and functionality before the prototype Grid is complete. A second three-year project,
GridPP2, has been proposed to then develop the Grid from "prototype to production" in the run-up to the
start of the LHC.
Overview
GridPP [1] may be described in terms of the seven high-level areas shown at the top of Figure-1. Each
area has a number of component elements shown by the lower boxes in the figure.
GridPP Goal
To develop and deploy a large scale science Grid
in the UK for the use of the Particle Physics community
+
1
CERN
.
,
1. 1
+
,
2
DataGrid
.
LCG Creation
.
1. 2
1. 3
,
.
1. 4
,
.
1. 5
2. 2
,
,
2. 3
.
2. 4
.
.
2. 5
,
2. 6
,
2. 7
,
,
,
3. 2
,
ATLAS/LHCb
GANGA/Gaudi
.
2. 8
WP8
3. 3
,
.
.
,
.
.
.
,
.
3. 6
3. 7
,
.
4. 2
4. 3
4. 4
,
.
5. 2
,
.
3. 8
.
,
,
,
5. 3
Worldwide
Particle Physics
Integration
.
5. 4
6. 1
,
Presentation
of GridPP
.
.
6. 2
7
Resources
.
,
.
Participation in
related areas
.
6. 3
7. 1
,
Engagement
of UK groups
7. 2
.
4. 6
.
7. 3
Attract new
resources
,
,
,
Data Challenges
,
,
Other
Applications
Figure-1: Project Areas and Elements
Navigate down
External link
Link to goals
,
Monitoring of
resources
UK Grid Rollout
,
,
,
Deployment of
resources
UK e-Science
Integration
UKQCD
QCD Application
,
,
Open Source
Implementation
UK Testbed
4. 5
5. 1
+
6
Dissemination
International
Standards
Tier-1 Centre
.
CDF/D0
SAM Framework
,
.
,
3. 4
CMS
Monte Carlo
System
3. 5
4. 1
Tier-2 Centres
BaBar
Data Analysis
WP7
.
.
+
5
Interoperability
Tier-A Centre
LHCb
WP6
.
,
.
WP5
.
3. 1
+
4
Infrastructure
ATLAS
WP4
Grid Deployment
+
3
Applications
WP3
Grid Technology
.
,
WP2
Computing Fabric
.
2. 1
+
WP1
Applications
.
,
.
,
+
,
The first four areas (CERN, DataGrid,
Applications, and Infrastructure) represent areas
in which GridPP invests significant resources.
The latter three areas (Interoperability,
Dissemination, and Resources) are management
areas monitored at a high level to ensure the
project evolves optimally in the national and
international context. The pie chart in Figure-2
shows the distribution of project resources. The
Operations category accounts for Management
and Travel costs.
6/May/2003
Applications
£2.08m
Infrastructure
£3.67m
Operations
£1.78m
CERN
£5.67m
DataGrid
£3.8m
Figure-2: Project Resource Distribution
there will be a combined Tier-0/1 centre at
CERN. The main issues for the LCG project
fabric area are associated with scaling hardware
and software management up to the level of
thousands of servers and hundreds of Terabytes
of disk.
The LCG Grid Technology group determines
the overall project requirements; tracks external
technology and middleware developments; and
makes recommendations. In particular, a strong
relationship with the European DataGrid (EDG)
project has been established and the first LCG
release is largely based on EDG middleware.
This first LCG release also contains elements of
the Virtual Data Toolkit (VDT) from the US.
The LCG Deployment group is responsible for
deploying and operating the LHC computing
environment. This includes system support,
Grid Operations, and user support for a
worldwide Grid. The LCG project will produce
a series of deployments, the first based upon the
EDG middleware and later ones likely to rely on
the FP6 EGEE project. The timeline is
illustrated in Figure-3.
Middleware Deployment Operations
CERN/LCG
GridPP funding was one of the major stimuli
that started the LHC Computing Grid Project
(LCG) at CERN, funding twenty-four three-year
posts and £1.3m of hardware. The LCG
project [2] is a Grid deployment project,
organized internally in four areas: Applications,
Fabric, Technology, and Deployment. The
Applications area covers development of those
parts of the LHC applications that are common
across the experiments. Notable projects in this
area are POOL (a common persistency
framework); PI (a Physicist Interface project
that encompasses the interfaces and tools by
which the physicists will use the software);
SEAL (providing the core libraries and
services); SIMU (putting together a generic
simulation framework and infrastructure); and
SPI (Software Process and infrastructure project
that will provide a common environment for
physics software development).
The LCG Fabric area is charged with
prototyping the Tier-0/1 centre at CERN.
Computing for the LHC follows a hierarchical
model of Tier centres, starting with a Tier-0
centre at CERN where data are reconstructed,
complimented by a network of National Tier-1
centres with substantial computing resources for
selection, simulation and archiving of data, and
then by more numerous Tier-2 centres that
server as regional analysis centers. In practice,
Resource
Provider
EDG
User
Community
LCG
2003
Releases
LCG
Prototype-1
EDG 1.x
(GT2, Condor-G)
EGEE
EGEE
2004
Prototype-2
LCG-1
(EDG-2)
HEP/BIOMED/…
2005
Prototype-3
LCG-2
2006
Production-1
LCG-3
2007
Production-2
2008
Production-3
Migrate to OGSA/F. Follow GGF Standards
Figure-3: Deployment Time-Line
In the future, GridPP expects to rely on the LCG
releases to provide the Grid middleware in the
UK. This policy ensures that the UK Grid is
genuinely part of the global Grid infrastructure
envisaged for the LHC and will enable UK
physicists the most immediate and sophisticated
access to LHC data. The current status is that
the first LCG release (LCG-1) is being rolled
out. The initial LCG Grid Operations Centre is
at Rutherford Appleton Lab, which is also one
of the initial set of ten sites worldwide
deploying the release.
European DataGrid
The UK is one of 6 major partners (with an
additional 15 associate partners) in the
European DataGrid project [3] that started in
January 2001 with the goal of developing and
testing a globally distributed technological
infrastructure in the form of a computing Grid.
Over the last year the EDG has increased focus
on quality and successfully passed the 2nd EU
review in February. Currently, the EDG2.0
release is in the final stages of integration and
forms the basis of the LCG-1 release. One final
release is planned before the project is
completed in early 2004.
The project is organized into a number of
workpackages, of which eight (WP1-WP8) have
relevance to High Energy Physics. The UK,
though GridPP, currently provides the leader or
deputy leader of five out of these eight
workpackages and is active in all of them.
In WP1 (Resource Management), effort in the
UK is directed at installing and supporting
resource brokers for the UK testbeds. There is
also active work on defining quality assurance
criteria and a link to work within the context of
the core e-science programme on the making an
OGSA-compliant resource broker.
Provided by WP2 (Data Management), the
Replica Manager in EDG2.0 provides three
services: A (local) replica metadata catalogue, a
replica location service, and a replica
optimization service.
registry that is then consulted by information
consumers giving the impression of a single
RDBMS for the whole virtual organization.
This has been implemented within EDG2.0 with
LDAP interface classes allowing information
from LDAP information providers to flow via
R-GMA to LDAP information servers. R-GMA
has been used in WP7 to provide networkmonitoring information and has also been
implemented in the CMS application to provide
monitoring.
WP4 (Fabric Management) in EDG is based
upon the LCGF configuration software from the
University of Edinburgh (EDG2.0 uses a
modified version called LCFGng).
Web Server
mkxprof
XML Profile
Server
Client
rdxprof
rdxprof
DBM File
LCFG Components
Virtual Organization
Membership Service
Information Service
Replica Metadata
Catalog
Replica Manager
Storage
Element
Replica Location
Service
Replica Optimization
Service
Storage
Element
SE
Monitor
Network Monitor
Figure-4: Replica Manager in EDG2.0
The UK has contributed to WP2 in two main
areas. Firstly, a package OptorSim that
simulates the architecture of the EU and allows
optimization of strategies for file replication.
This has shown that, with minimal tuning, an
“Economic” model provides at least as good,
and frequently better, performance than the best
simple file replication strategies. Secondly, a
package called Spitfire has been developed that
allows secure access to metadata for Grid
Middleware.
WP3 (Information and Monitoring Services) is a
UK led/dominated workpackage that has
developed R-GMA, a relational implementation
of the Grid Monitoring Architecture from GGF.
Information producers register themselves in a
ldxprof
ldxprof
Generic
Generic
Component
User Interface or
Worker Node
Resource Broker
LCFG Source Files
Figure-5: The LCFG Architecture
WP5 (Mass Storage Management) is another
UK led and dominated workpackage. The issue
here is to provide transparent access for Grid
clients to data storage. This is implemented by a
Storage Element contained in EDG2.0, which
has interfaces to a number of mass storage
system. The core of the storage element is
flexible and extensible making it easy to support
new protocols, features, and mass storage
systems.
Grid
clients
Grid
Gridclients
clients
Control
Info
Data Xfer
Storage Element
MSS 1
MSS 2
Figure-5: WP5 Storage Element
WP6 is the testbed group responsible for
deploying and testing the middleware releases.
At any time in the UK there are typically a
number of overlapping testbeds (GridPP, EDG,
LCG, together with experiment-specific
testbeds). These are served by resource brokers
located at Imperial College and, in the case of
the EDG testbed, a resource broker in Lyon.
Currently, the main issue is the transition from
the application testbed running the EDG1.4x
release to the LCG-1 testbed based on EDG2.0.
A snapshot (Jul 25th 2003) of the UK testbed
status is shown below.
Figure-6: The UK Testbed
The variable status of the nodes reflects the fact
that this is a testbed, run largely on a best-effort
basis, and not a production Grid. Nevertheless,
all the testbed sites can be said to be truly on a
Grid by virtue of their registration in a
comprehensive resource information publication
scheme, their accessibility via a set of globally
enabled resource brokers, and the use of one of
the first scalable mechanisms to support
distributed virtual communities (VOMS). There
are few such Grids in operation in the World
today.
WP7 (Network Services) in the strict EDG
context covers the areas of testbed
infrastructure, network and transport services,
network monitoring, and Grid security. From a
UK perspective there have been a wide range of
activities including participation in joint
projects and demonstrations, Grid middleware
development, high speed data transport,
provision of network performance monitoring
using R-GMA and diagnostic services, and
piloting the benefits of “better than best efforts
IP” services. There has also been involvement
in pivotal work that has led to the inception of
UKLIGHT, a new optical network research
infrastructure.
Figure-7: TCP Throughput monitoring using R-GMA.
The EDG HEP applications workpackage, WP8,
is lead by the UK and works with the LHC
experiments to grid-enable the high-energy
physics applications. The UK developments will
be covered in the next section, where the
application developments funded by GridPP,
involving both LHC and non-LHC experiments,
is described.
Applications
The ultimate goal of GridPP is to provide a Grid
for use by particle physicists in the UK. The
experiment collaborations to which they belong
are typically worldwide enterprises, often
involving tens of countries, hundreds of
institutions and up to several thousand
physicists. The global nature and sheer scale of
these collaborations means that GridPP must
ensure that the UK Grid is fully compatible with
our partners (and thus “Interoperability” appears
at a high level in the ProjectMap shown in
Figure-1). GridPP also cannot expect to develop
applications of this type single-handedly but has
tried to link into as many of the applications as
possible by funding primarily the development
of interfaces to the Grid [4]. GridPP has also
encourage, with some success, joint work
between different application groups.
One such initiative has been the GANGA
project, a joint ATLAS/LHCb user-Grid
interface that will allow configuration,
submission, monitoring, bookkeeping, output
collection, and reporting of Grid jobs.
Implemented using a Python software bus, this
is a layer that sits on top of the Grid middleware
and communicates with the individual
experiment applications.
CMS in the UK have taken a different route,
producing a lightweight portal demonstrator
GUIDO that allowed the submission of CMS
jobs to the Grid testbed. Whilst the original
portal was specific to the pre-grid CMS
applications it has now been generalized to
provide a simple submission portal for any selfcontained job.
Effort in the UK on the three LHC experiments
has been involved in various other development
areas: In ATLAS, major contributions have
been made to ATCOM, the Atlas Commander
Monte Carlo production tool that provides users
with the ability to perform on-demand Monte
Carlo production of specific data sets. This tool
will eventually be based on GANGA described
earlier. In LHCb, there have been contributions
to the DIRAC personal interface and to the LCG
persistency framework POOL. CMS have used
the WP3 product R-GMA to enable monitoring
in the CMS applications and are currently
leading the development of the CMS analysis
framework. A major effort for all three LHC
experiments has been the ongoing data
challenges and, as will be described in a later
section, the UK has made dominant
contributions to the early data challenges for all
three experiments.
Turning to the non-LHC applications, a very
successful joint initiative has been the CDF/DO
SAM Grid project. Originally used by the D0
experiment, SAM allowed users worldwide to
schedule data transfers from a central repository
at Fermilab for local analysis jobs that would,
on completion of the transfer, execute
automatically. GridPP funded joint work to
adapt this tool for use by CDF and to make it
more Grid-like. SAM Grid is now based on
Globus/Condor (vital elements of VDT used in
the LCG-1 release) and allows both experiments
to move either jobs to data or, as in the original
SAM, data to jobs. Deployed on three
continents, SAM Grid provides a genuinely
functional Grid for the Fermilab experiments.
Figure-8: SAM Grid for the CDF and D0 experiments.
The BaBar experiment based at SLAC now has
large amounts of data and a pressing need for
computing resources. Here, the complication is
to move in an adiabatic manner to a Grid
without disrupting the on-going data processing.
The experiment is in the process of re-defining
its data model, to which GridPP effort is
contributing, and there is also a joint post with
CMS funded to work on the POOL persistency
framework. Work is also in progress testing
BaBar job submission via a resource broker at
Imperial.
The UKQCD collaboration aims to use the Grid
for QCD calculations and the collaboration is
developing the application, based on the EDG
middleware, and the interface with the help of
GridPP. Currently, sites at Swansea, Liverpool,
Edinburgh and RAL are connected in a Grid.
Infrastructure
GridPP is developing two levels of resources: A
prototype Tier-1 centre at RAL [5] and four
distributed Tier-2 centres that will involve
practically all HEP institutes in the UK. This
structure reflects both the hierarchy of Tier
centers described earlier and the UK context.
The Tier-1 resource is totally managed by
GridPP and allows current international
commitments to be met. In contrast, the Tier-2
centres rely on resources funded from nonPPARC resources (primarily SRIF, and JREI)
and, typically, are shared with other disciplines.
Eventually, the integrated Tier-2 resources are
likely to be considerably larger than the Tier-1
centre but from a management point of view
there is considerable risk, and worse,
uncertainty, associated with them. This
illustrates one of the fundamental challenges of
the Grid: The need to pull resources on to the
Grid in a managed way, in addition to pushing
out wholly owned resources.
The current resources at RAL (~500 CPUs,
~80TB of usable disk, and 180TB of tape)
provide an integrated service as a Tier-1 centre
for LHC and a Tier-A centre for BaBar. The
weekly usage for the Tier-1/A centre this
calendar year is shown in the following figure:
largest contribution but four tier-2 sites made
very significant additions.
Figure-10: UK contributions to phase-2 of
the ATLAS data challenge.
Figure-9: Weekly CPU usage at the Tier-1/A.
The purple area represents the BaBar Tier-A
usage; the green area shows the tail end of the
LHCb data challenge; the large brown
component in the penultimate bar represents the
start of the latest CMS production.
All three LHC experiments have completed
their first major data challenge and CMS is
currently preparing data for a second. The UK
has made significant contributions to all of
these, not only through the Tier-1 resources but
also using the Tier-2 resources. For LHCb, 1/3
of all the events were produced in the UK, with
the largest single contribution coming from the
Tier-2 resources at Imperial. For the very early
CMS data-challenge, UK was the 3rd largest
contributor after CERN and the combined US,
again with Imperial producing the largest
contribution. In the second phase of the recent
ATLAS data challenge, the UK was the largest
producer worldwide. As can be seen in the
figure below, the Tier-1 centre provided the
To date, the data challenges, as with the BaBar
Tier-A usage, has not been performed in a Gridlike manner. However, the UK testbed
presented earlier has been developed in parallel
and is a real functional Grid, albeit very much a
prototype. Now we enter a transition period
where the LCG-1 release will be more widely
deployed and the experiments will increasingly
rely on grid-technology to perform the ongoing
work. GridPP is very conscious of the need to
rollout this Grid in a controlled manner and with
the necessary support mechanisms. This will be
prime focus of the second half of the GridPP
project.
GridPP2
The current GridPP project will end in
September 2004 but a follow on project,
GridPP2 [6], has recently been proposed to
cover the period up to September 2007. At that
point the LHC should be turning on and a
production Grid will be needed. At a high level,
GridPP2 is designed to move the UK grid from
Prototype to Production in phase with the
releases planned from the current and future
LCG projects (LCG2 will follow on from LCG
in 2005). The speed at which UK physicists will
be able to access data from the LHC in an
efficient way will depend on continuing the
close relationship between GridPP and LCG.
The proposed investment in LCG from GridPP2
is less than half of that from GridPP1 but
matches both the needs of the LCG2 project and
the level at which the UK might be expected to
contribute based on CERN membership.
The hardware requirements for the LHC
experiments in the UK have been profile (albeit
with considerable uncertainty) and a plan
developed to meet these needs within the
context of GridPP2. About half of these
requirements will be met through a totally
managed Tier-1 resource at RAL. The
remainder is assumed to become available
through the Tier-2 centres described earlier.
Although GridPP2 will not be directly investing
in Tier-2 hardware, it will provide funding for
posts to integrate these resource into the Grid
and some support for operations and
maintenance. The total hardware resources
planned for 2007 are shown in the Figure-11.
As at present, Applications will be developed in
collaboration with the experiments with the
emphasis in GridPP2 of providing the interface
to the Grid and support for cross-experiment
projects.
Figure-11: Hardware planned by 2007.
The greatest challenge for GridPP2 will be to
successfully move from a prototype to a
production Grid, challenging the boundaries of
Scale, Functionality, and Robustness. A
dedicated Production Team will be set up with a
member in each of the Tier centers under the
leadership of an Operations Manager with the
prime function of overseeing the technical
deployment of the Grid. A provisional outline of
the GridPP2 project, using the Project Map
format, is shown below.
GridPP2 Goal
To develop and deploy a large scale production quality
grid in the UK for the use of the Particle Physics community
+
+
1
LCG
.
1. 1
,
Applications
.
1. 2
.
1. 3
,
.
1. 4
,
.
,
,
2. 2
,
.
2. 3
.
.
2. 4
,
.
2. 5
Network
.
3. 2
3. 3
,
.
3. 4
,
.
.
3. 5
PhenoGrid
,
.
4. 2
,
.
4. 2
,
.
.
4. 3
,
.
.
4. 4
Other
Applications
.
5. 2
5. 3
,
.
5. 4
,
.
5. 5
,
.
7. 1
.
6. 2
,
.
6. 3
,
.
.
6. 4
7. 2
,
.
7. 3
,
.
7. 4
Technology
Transfer
,
Navigate down
External link
Link to goals
Figure-12: A provisional map of the GridPP2 Project
References
[1] http://www.gridpp.ac.uk/
[2] http://lcg.web.cern.ch/LCG/
[3] http://eu-datagrid.web.cern.ch/eu-datagrid/
,
Engagement
Running
Experiment
Support
,
Interoperability
Monitoring
,
,
Outreach
Deployment
Middleware
Support
,
6. 1
7
Dissemination
Planning
Grid Services
UKQCD
,
,
Infrastructure
D0
,
5. 1
+
6
Management
Rollout
CDF
CMS
,
4. 1
+
5
Production Grid
BaBar
LHCb
Information &
Monitoring
.
,
Ganga
Security
Grid Deployment
3. 1
+
4
Non-LHC Apps
ATLAS
Workload
Management
Grid Technology
.
2. 1
+
3
LHC Apps
Data & Storage
Management
Computing Fabric
.
+
2
Development
[4] See the links available from:
http://www.gridpp.ac.uk/eb/applications.html
[5] http://www.gridpp.ac.uk/tier1a/
[6] http://www.gridpp.ac.uk/docs/gridpp2/
.
,
+
,
Download