Performance of the LHC Computing Grid (LCG) June 22 2005

advertisement
Performance of the
LHC Computing Grid (LCG)
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Thanks:
Slides/pictures/text taken from several people including Les Robertson,
Jamie Shears, Bob Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont …
Caveat:
LCG means different things to different people and funding bodies.
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Contents:
• Description of the LCG, what the target are,
how it works.
• The monitoring that is currently in place
• The current and future metrics.
• The Service Challenges
• The testing and release procedure
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
The LHC
Mont Blanc, 4810 m
Downtown Geneva
LHCb
CERN sites
CMS
David Colling
Imperial College
ATLAS
Performance Workshop
NeSC
ALICE
June 22nd 2005
What is the LHC?
• LHC will collide beams of protons at an energy of 14 TeV
• Using the latest super-conducting technologies, it will
operate at about – 270ºC, just above the absolute zero of
temperature
•
LHC is due to
switch on in 2007
Four experiments,
with detectors as
‘big as cathedrals’:
ALICE
ATLAS
CMS
LHCb
The largest terrestrial scientific
With its 27
km circumference, ever
the accelerator
will be the
Endeavour
undertaken
largest superconducting installation in the world.
Due to start taking data in 2007
• Four detectors constructed and operated by international
collaborations of thousands of physicists, egineers and
technicians.
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Balloon
(30 Km)
Data Volume
CD stack with
1 year LHC data!
(~ 20 Km)
Data accumulating at ~15 PetaBytes/year
Equivalent to writing a CD every 2 seconds
50 CD-ROM
6 cm
Concorde
(15 Km)
= 35 GB
David Colling
Imperial College
Performance Workshop
NeSC
Mt. Blanc
(4.8 Km)
June 22nd 2005
The Role of LCG
• LCG is the system on which this data will be
analysed and similar volumes of MC simulation
generated
• High Energy Physics jobs have particular
characteristics e.g. they are “thankfully parallel”
• However, LCG and EGEE are very closely
linked and EGEE has a more general remit e.g.
biomed, earth obs, etc applications as well HEP
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Middleware and
Deployment
• Current Middleware based on EDG … but
hardened and extended
• New middleware being developed with the
EGEE project
• Deployment and monitoring is also done
jointly with EGEE
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
The System (ATLAS Case)
PC (2004) = ~1 kSpecInt2k
~Pb/sec
Event Builder
~100 Gb/sec
Event Filter
~7.5MSI2k
~3 Gb/sec raw
•Some data for calibration
and monitoring to institutes
•Calibrations flow back
Tier
~ 75MB/s/T1 raw for ATLAS
0
T0 ~5MSI2k

Tier 1
US Regional
Centre
Dutch Regional
Centre

UK Regional
Centre (RAL)
French Regional
Centre

~2MSI2k/T1
~2 PB/year/T1
622Mb/s links
10 Tier-1s reprocess
Tier 2
house simulation
Group Analysis
~5 PB/year
No simulation

Northern Tier
~200kSI2k
Tier2 Centre
Tier2 Centre
Tier2 Centre
~200kSI2k~200kSI2k~200kSI2k

~200 TB/year/T2
622Mb/s links
Lancaster
Liverpool Manchest Sheffield
~0.25TIPS
er
Physics data
cache
100 - 1000
Mb/s links
Desktop
David Colling
ImperialWorkstations
College
Each of ~30 Tier 2s have ~20 physicists
(range) working on one or more channels
Each Tier 2 should have the full AOD, TAG &
relevant Physics Group summary data
Tier 2 do bulk of simulation
Performance Workshop
NeSC
June 22nd 2005
edg-job-submit myjob.jdl
Myjob.jdl
JobType = “Normal”;
Executable = "$(CMS)/exe/sum.exe";
InputData
= "LF:testbed0-00019";
ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica
Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it";
DataAccessProtocol = "gridftp";
InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"};
OutputSandbox = {“sim.err”, “test.out”, “sim.log"};
Requirements = other. GlueHostOperatingSystemName == “linux" &&
other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ &&
other.GlueCEPolicyMaxWallClockTime > 10000;
Rank = other.GlueCEStateFreeCPUs;
The World as seen by the EDG
Now a happy user
Replica
of:
Location service
VOnow the
Confused
So
and
user
unhappy
knows
(Replicac
What
is
needed
is angrid
So
lets
introduce
some
This
is
the
world
without
Grids
server
Sites
are
not
identical.
user
about
what
machines
are
Catalogue)
automated
system
infrastructure…
Security
and

Different
Computers
there andsystem
can
Workload
anout
information

Different Storage
communicate
with
Management
A storage
element
WMS
Ausing
compute
RC element

Different
Files
them…
however
where
to
System Usage Policies
decide on

Different
submit
the
job
is
too
(Resource
execution
complex
a
decision
for
Broker)
location
user alone.
edg-job-get-output
<dg-job-id>
Each Site
consists
Job &
Input
Sandbox
David Colling
Imperial College
Performance Workshop
NeSC
Logging &
Bookkeeping
June 22 2005
nd
So what is actually
there now?
• Currently, 138 sites in 36 countries
• 14K cpus, ~10PB storage
• ~1000 registered users (>100 active users)
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Monitoring LCG/EGEE
Four forms of monitoring (+demos):
1. What are the state of a given site
2. What is currently in being used
3. Accounting… how many resources have
been used by a given Virtual Organisation.
4. EGEE quality assurance
These different activities are not always well connected
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
What is the state of site…
• Series of site functional tests run automatically at
every site … some involve asking a site questions
some by running jobs
• These tests are defined as critical or non-critical. If
a site consistently fails critical tests automated
messages are sent to the site and it will be removed
from the information system if the error is not
connected.
• Also analyses the information published by a site.
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
What is the state of site…
Information gathered at two GOCs
http://goc.grid.sinica.edu.tw/ and
http://goc.grid-support.ac.uk/gridsite/gocmain/
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
What is the state of site…
Maps as well…
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
What is currently
being used…GridIce
http://gridice2.cnaf.infn.it:50080/gridice/site/site.php
Kind of like the gstat asked for earlier today…
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Accounting … APEL
Uses the local batch system
logs and publishes information over RGMA
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Quality assurance…
Interrogates the logging and bookkeeping
• Overall Job success,
from January 2005
• Job Success rate =
Done(ok) /
(SubmittedCancelled)
• Results should be
validated
http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Quality assurance…
VO
• VOs
Registere
d
job
throughput and success
Cancelle
Done
d
Aborted
OK
Overal
Done
Succes l %
Done
OK
rate,
from
January
until
May-2005
s rate
throug OK per
per
%
hput
month
day
ATLAS
376314
1358
26251
306060
81,6
21,47
76515
2551
BABAR
2132
40
64
2004
95,8
0,14
501
17
174550
6138
18357
142075
84,4
9,96
35519
1184
CDF
556
7
76
165
30,1
0,01
41
1
CMS
56968
915
17076
31464
56,1
2,21
7866
262
1153972
3763
404882
556157
48,4
39,01
139039
4635
26281
597
2008
21823
85,0
1,53
5456
182
693
3
106
513
74,3
0,04
128
4
8540
480
764
7039
87,3
0,49
1760
59
432968
4015
99308
301063
70,2
21,12
75266
2509
6941
0
374
6171
88,9
0,43
1543
51
106635
1494
39386
51285
48,8
3,60
12821
427
David Colling
Total
2346550
Imperial College
18810
BIOMED
DTEAM
DZERO
ESR
GILDA
LHCB
MAGIC
OTHERS
Performance
Workshop 61,3
608652
1425819
NeSC
June 22nd 2005
356455
11882
Quality assurance…
Next step is to understand these failures
By the end of June will also will measure the
overhead caused by running via the LCG by
measuring the running time/total time
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Quality assurance…
Many other metrics have been suggested (especialy
in the UK) including:
• Number of user ( from different communities)
• Training quality
• Maintenance and reliability (already measured)
• Upgrade time etc etc
25
LCG-2_4_0
15
LCG-2_3_1
LCG-2_3_0
10
Sites
5
06/06/2005
30/05/2005
23/05/2005
16/05/2005
09/05/2005
02/05/2005
25/04/2005
18/04/2005
11/04/2005
04/04/2005
28/03/2005
21/03/2005
14/03/2005
07/03/2005
28/02/2005
21/02/2005
14/02/2005
07/02/2005
31/01/2005
0
24/01/2005
Sites at release
20
Date
UK sites only, target was 3 weeks
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Demos…
http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
How will we know if
we are going get there?
There are an ongoing set of service challenges
• Each Service Challenge growing in complexity
approaching the full production service…
• Currently we are between SC2 and SC3.
• SC2 only the T0 and T1.
• SC3 will involve 5 T2s as well and SC4 will
involve all T2 sites.
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Service Challenge 2
• Goal for throughput was >600MB/s daily average for 10 days was
achieved - Midday 23rd March to Midday 2nd April
•Not without outages, but system showed it could recover rate
again from outages
•Load reasonable evenly divided over sites (give network
bandwidth constraints of Tier-1 sites)
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Service Challenge 3
and beyond
Apr05 – SC2 Complete
June05 - Technical Design Report
Jul05 – SC3 Throughput Test
Sep05 - SC3 Service Phase
Dec05 – Tier-1 Network operational
Apr06 – SC4 Throughput Test
May06 –SC4 Service Phase starts
Sep06 – Initial LHC Service in stable operation
Apr07 – LHC Service commissioned
2005
SC2
SC3
preparation
setup
service
David Colling
Imperial College
2006
2007
cosmics
SC4
LHC Service Operation
Performance Workshop
NeSC
2008
First physics
First beams
Full physics run
June 22nd 2005
Testing and deployment
• Multi stage release
• New components first tested on the testing testbed.
Rapid feedback to developers. This testing to include
performance/scalability testing. Currently, this only at 4
(5) site. CERN, NIKHEF, RAL, Imperial (two
installations)
• Pre-production testbed
• Releases on to production every 3 months
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Conclusions
• Very hard deadline by which this must
work
• We are monitoring as much as we can to
try to understand where our current failures
come from.
• We have a release process that hopefully
will improve performance of future releases
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
http://goc.grid-support.ac.uk/gridsite/monitoring/
http://goc.grid.sinica.edu.tw/gstat/
http://gridice2.cnaf.infn.it:50080/gridice/site/site.php
http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html
David Colling
Imperial College
Performance Workshop
NeSC
June 22nd 2005
Download