WP8 HEP Applications Final Project evaluation of EDG middleware, and

advertisement
WP8 HEP Applications
Final Project evaluation of EDG middleware, and
summary of workpackage achievements
F HARRIS
(OXFORD/CERN)
Email: f.harris@cern.ch
DataGrid is a project funded by the European Commission
under contract IST-2000-25182
GridPP
Edinburgh 4 Feb 2004
Outline

Overview of objectives and achievements

Achievements by experiments and experiment independent team

Lessons learned

Summary of the exploitation of WP8 work and future HEP
applications activity in LCG/EGEE

Concluding comments

Questions and discussion
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 2
Overview of objectives and
achievements
OBJECTIVES




Continue work in ATF for internal
EDG architecture understanding
and development
Reconstitute
the
Application
Working Group(AWG) with the
principal aim of developing use
cases and defining a common highlevel common application layer
Work
with
LCG/GAG(Grid
Applications group) in further
refinement of HEP requirements
Developments of tutorials and
documentation for the user
community
ACHIEVEMENTS




Active participation by WP8.
Walkthroughs of HEP use cases
helped to clarify interfacing
problems.
This has worked very well.
Extension of use cases covering key
areas in Biomedicine and Earth
Sciences. Basis of first proposal for
common application work in EGEE
This group has produced a new
HEPCAL use case document
concentrating on ‘chaotic’ use of
grid by thousands of individual
users. In addition further refined the
original HEPCAL document
WP8 has played a substantial role in
course design, implementation and
delivery F Harris GridPP Edinburgh 4 Feb 2004 - n° 3
OBJECTIVES



ACHIEVEMENTS
Evaluate EDG Application Testbed,
and integrate into experiment tests
as appropriate.

Liase with LCG regarding EDG/LCG
integration and the development of
LCG service.

Continue work with experiments on
data challenges throughout the year.

Further successful evaluation of
1.4.n throughout summer.
Evaluation of 2.0 since October 20.
Very active cooperation. EIPs
helped testing of EDG components
on LCG Cert TB prior to LCG-1 start
late September, and did stress tests
on LCG-1.
All 6 experiments have conducted
data challenges of different scales
throughout 2003 on EDG or LCG
flavoured beds in an evolutionary
manner.
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 4
Comments on experiment work

Experiments are living in an
international multi-grid world
using other Grids (US
Grid3,Nordugrid..) in addition to
EDG and LCG, and have coped
with interfacing problems


DataTag project is very
important for inter-operability
(GLUE schema)

Having 2 running experiments
(in addition to the 4 LHC
experiments) involved in the
evaluations has proved very
positive

BaBar and work on Grid.it

D0 and work on EDG App TB
Used EDG software in a number
of grids



EDG Application Testbed
LCG Service (LCG-1 evolving to
LCG-2)
Italian Grid.it (identical with
LCG-1 release)
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 5
Evolution in the use of EDG App TB
and the LCG service (and Grid.it)


Mar-June 2003 EDG 1.4 evolved
with production use by Atlas, CMS
and LHCb with efficiencies ranging
from 40 – 90% - (higher in selfmanaged configurations)

LHCb
300K events Mar/Apr

Atlas
500K events May






CMS
2M events in LCG-0
configuration in summer
Sep 29 2003 LCG-1 service open,
(and Grid.it installed LCG-1 release)



Used EDG 2.0 job and data
management and partitioned MDS
(+VDT)
All experiments have installed
software in LCG-1 and accomplished
+ve tests
ATLAS,ALICE,BaBar,CMS performed
‘production’ runs on LCG-1/Grid.it
Oct 20 2003 EDG 2.0 was released
on App TB Oct 20

R-GMA instead of ‘partitioned’ MDS
D0 and CMS evaluated ‘monitoring’
features of R-GMA
D0 did ‘production’ running
December 2003 Move to EDG 2.1
on EDG App TB

Fixing known problems

Enhance Data Management
Feb 2004 LCG-2 service release



Enhanced data management
Will include R-GMA for job and
network monitoring
All experiments will use it for data
challenges
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 6
ALICE evaluation on LCG-1 and Grid.it
(September-November 2003)

Summary


Significant improvement in terms of stability with
respect to tests in Spring 2003



Projected load on LCG1 during ALICE
DC(start Feb 2004) when LCG-2 will be used
Simulation job(event) characteristics



80K primary particles generated
12 hr 1GHz CPU
Generated 2 GB data in 8 files
Performance was generally a step function for
batches (either close to 0 or close to 100). With
long jobs and multi files very sensitive to overall
system stability (good days and bad days!)
Jobs were sensitive to space on worker nodes

104 events (1 event/job)

Submit 1 job/3’ (20 jobs/h; 480 jobs/day)

Run 240 jobs in parallel

Generate 1 TB output/day

Test LCG MS

Parallel data analysis (AliEN/PROOF) including LCG
PLOT of PERFORMANCE with JOB
BATCHES IN SEP-NOV
100

80
60
40
20
0
Nr. Of jobs
Nr. Successes
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 7
ATLAS Evolution from 1.4 to LCG-1 and new
DC2 production system

Use of EDG 1.4.11 (mod for
RH7.3)




In May reconstructed 500 K
events in 250 jobs with 85%
1stpass efficiency

With privately managed
configuration of 7 sites in Italy,
Lyon and Cambridge

Have simulated 30000 events in
150 jobs of 200 events each (the
jobs required ~20 hours each
with efficiency ~80%)
LCG-2 plans

Main features of new DC2 system


Start around April
Common production database for all of ATLAS
Common ATLAS supervisor run by all
facilities/managers
Common data management system a la
Magda
Executors developed by middleware experts
(LCG, NorduGrid, US).
Final verification of data done by supervisor
p rod
DB
LCG-1(+ Grid.it) production



*
d at a m gt
syst em
su pervisor
prod m anager
*
execut or
F Harris
rep lica
cat alog
GridPP Edinburgh 4 Feb 2004 - n° 8
CMS work on LCG-0 and LCG-1

LCG-0 (summer 2003)





Components from VDT 1.1.6 and
EDG 1.4.11
VOMS

RLS

Monitoring: GridICE by DataTAG


R-GMA (for specific monitoring
functions)

14 sites configured and managed by
CMS
500K Pythia

1.5M CMSIM
2000 jobs
6000 jobs


Physics data produced

LCG-1

GLUE schemas and info providers
(DataTAG)



8 hr
10 hr.
Ran for 9 days on LCG-1 (2023/12/2003 – 7-12/1/2004)
In total 600,000 events (30-40 hr
jobs) were produced (ORCA) using
UIs in Padova and Bari
Sites used were mainly in Italy and
Spain
Efficiency around 75% over Xmas
hols (pretty pleased!)
Used GENIUS portal
LCG-2

Will be used within data
challenge starting in March
Comments on performance


Had substantial improvements in efficiency
compared to first EDG stress test (~80%)
Networking and site configuration were problems, as
was 1st version of RLS
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 9
LHCb results and status on EDG and LCG

Tests on the EDG1.4 application
testbed (Feb 2003):


Standard LHCb production tasks,
300K events produced;
~35% success rate. (TB support
running down)

Software installation by the running
job;

LCG tests(now):


Software installation/validation
either centralized or in a special
LHCb job;
The basic functionality tests are
OK:


EDG2.0 tests (November 2003):

Simple test jobs as well as standard
LHCb production tasks:


200 events per job, ~10 hours of CPU;
Submission of the jobs:



To EDG RB;
Directly to a CE with the CE status
information obtained from the CE
GRIS server:
This increased significantly the
stability of the system up to 90%
success rate.

LHCb production jobs with the
output stored in a SE and
registered in the RLS catalog;
Subsequent analysis jobs with
input taken from previously
stored files.
Scalability tests with large
number of jobs are prepared to
be run on the LCG-2 platform.

F Harris
GridPP Edinburgh 4 Feb 2004 - n° 10
BaBar work on EDG and Grid.it

Integrating grid tools in an experiment
running since 1999




Hope to have access to more
resources via grid paradigm (e.g.
Grid.it)
Operation on Grid.it with LCG-1 release

RB at CNAF

Farms at 8 Italian sites

1 week test with ~ 500 jobs

Have used grid tools for data
distribution (SLAC to IN2P3)

Strategy for first integration


Created ‘simulation’ RPM to be
installed at sites
Data output stored on closest SE



Data copied to Tier-1 or SLAC using
edg-copy
Logfiles via sandbox
Scheme first tested with EDG 1.4.11 on 5
Italian sites

60% success elsewhere

33% failures due to network
saturation due to simultaneous
requests to remote Objectivity
database (looking at distributed
solutions)
Positive experience with use of GENIUS portal


95% success at Ferrara(site with central
DB)
https://genius.ct.infn.it
Analysis applications have been successfully
tested on EDG App TB

input through sandbox

executable from Close SE

read Objy or Root data from local server

produce n-tuple and store to Close SE

send back output log file through sandbox
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 11
D0 evaluations on EDG App TB

Ran part of D0 Run 2 re-processing
on EDG resources

Job requirements




Results

Found EDG s/w generally
satisfactory for task (with
caveats)

Frequent software updates so don’t
use RPMs



Input-file is 700 Mbyte (2800 events)
1 event is 25 GHz.sec/event (Pentium
series assumed)
Job takes 19.4 hours on a 1 GHz
machine

Registered compressed tar archives in
RLS as grid files
Jobs retrieve and install them before
starting execution
Use RGMA for monitoring


Allows users and programs to publish
information for inspection by other
users, and for archiving in production
database
Found it invaluable for matching
output files to jobs in real production
with some jobs failing


Had to modify some EDG RM
commands due to their (Globus)
dependence on inbound IP
connectivity. Should be fixed in
2.1.x
Used ‘Classic’ SE s/w while
waiting for mass storage
developments
Very sensitive to RGMA
instabilities. Recently good
progress with RGMA, and can run
at ~90% efficiency when RGMA
is up
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 12
Summary of middleware evaluations
 Workload

Tests have shown that software is more robust and scalable


 Data


Problems with new sites during tests– VOs not set up properly (though
site accepted job)
Management
Tests so far with only 1 LRC per VO implemented
Performance needs enhancement


Stress tests were successful with up to 1600 jobs in multiple streams –
efficiencies over 90%
Has worked well with respect to functionality and scalability (have
registered ~100K files in ongoing tests)


management
Registrations and simple queries can take up to 10 seconds
We have lost (with GDMP) bulk transfer functions
Some functions needed inbound IP connectivity (Globus). D0 had
to program round this (problem since fixed)
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 13
Summary of middleware
evaluations(2)

Information System


Partitioned MDS has worked well for LCG following on from work
accomplished within EDG (BD II work)(? Limit 100 sites)
R-GMA work is very promising for ‘life after MDS’, but needs ‘hardening’.


Ongoing work using D0 requirements for testing - How will this continue in
LCG/EGEE framework?
Mass Storage support(crucial for data challenges)

We await ‘accepted’ uniform interface to disk and tape systems

Solution coming this week(?) with SRM/GFAL software

WP5 have made important contribution to the development of SRM interface


EDG 2.0 had mass storage access to CERN (Castor) and RAL(ADS)
The ‘Classic-SE’ has been a useful fallback (gridftp server) while waiting for
commissioning of developments
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 14
Other important grid system
management issues

Site Certification




Tests are intimately tied in with what is published in information service,
so relevant information must be checked
Tests should be run on a regular basis and sites failing should be put out
automatically
Site Configuration

The configuration of software is very complex to do manually

Only way to check it has been to run jobs, pray and then look at errors


Must have official, standard procedure as part of release
Could not middleware providers provide fewer parameters and provide
defaults? And also check parameters while running?
Space management and publishing

Running out of space on SEs and WNs is still a problem. Jobs need to
check availability before running
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 15
The Deliverables + ‘extra’ outputs
from WP8


The formal EU deliverables

D8.1
The original HEP requirements document

D8.2
‘Evaluation by experiments after 1st year’

D8.3
‘Evaluation by experiments after 2nd year’

D8.4
‘Evaluation after 3rd year’
Extra key documents (being used as input to EGEE)

HEPCAL Use cases
May 2002 (revised
Oct 2003)

AWG

AWG Enhanced use cases (for Biomed,ESA)

HEPCAL2 Use cases for analysis (several WP8 people)
Recommendations for middleware (June 2003)
Sep 2003

Generic HEP test suite used by EDG/LCG

Interfacing of 6 experiment systems to middleware

Ongoing consultancy from ‘loose cannons’ to all applications
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 16
Main lessons learned








Globally HEP applications feel it would have been ‘better’ to start with simpler
prototype software and develop incrementally in functionality and complexity
with close working relations with middleware developers
Applications should have played larger role in architecture in defining
interfaces(so we could all learn together!) . ‘Average’ user interfaces are at
too low a level. Defining high level APIs from beginning would have been very
useful
Formation of Task Forces(applications+middleware) was a very important step
midway in project
Loose Cannons(team of 5) were crucial to all developments. Worked across
experiments. This team comprised all the funded effort of WP8.
Information system is nerve centre of grid. We look to R-GMA developments
for long term solution to scaling problems (LCG think partitioned MDS could
work up to 100 sites)
We look to SRM/GFAL as solution to uniform mass storage interfacing
Must have flexible application s/w installation not coupled to grid s/w
installation (experiments working with LCG. See work of LHCb, ALICE, D0..)
We observe that site configuration, site certification and space management
on SEs and WNs are still outstanding problems that account for most of the
inefficiencies (down times)
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 17
Exploitation of the work of WP8, and future HEP
applications work in LCG/EGEE



All experiments have exploited the EDG middleware using WP8 effort
The HEPCAL and AWG documents are essential inputs to the future
LCG/EGEE work
Future developments will be in the context of LCG/EGEE
infrastructure


Experiments will use LCG/EGEE for data challenges in 2004



There will be a natural carrying over of experience and personnel within
the HEP community
ALICE
now
CMS Mar 1
Atlas and LHCb
April
In parallel within the ARDA project middleware will be ‘hardened’
(including EDG components) and evaluated in development TB
The NA4 activity in EGEE will include dedicated people for
interfacing middleware to experiments (8 people at CERN + others
distributed in the community)
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 18
Concluding comments
 Over
the past 3 years HEP community has moved from an R/D
approach to the use of the grid paradigm, to the use of grid
services in physics production systems in world-wide
configurations
 Experiments
are using several managed grids (LCG/EGEE,US
Grids, Nordugrid) so inter-operability is crucial
 Existing
software can be used in production services, with
parallel ‘hardening’ taking advantage of lessons learned (the
ARDA project in LCG/EGEE)
 THANKS


To the members of WP8 for all their efforts over 3 years
And to all the other WPs (middleware,testbed,networking,project
office) for their full support and cooperation
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 19
Questions and discussion
F Harris
GridPP Edinburgh 4 Feb 2004 - n° 20
Download