Performance of the LHC Computing Grid (LCG) David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Thanks: Slides/pictures/text taken from several people including Les Robertson, Jamie Shears, Bob Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont … Caveat: LCG means different things to different people and funding bodies. David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Contents: • Description of the LCG, what the target are, how it works. • The monitoring that is currently in place • The current and future metrics. • The Service Challenges • The testing and release procedure David Colling Imperial College Performance Workshop NeSC June 22nd 2005 The LHC Mont Blanc, 4810 m Downtown Geneva LHCb CERN sites CMS David Colling Imperial College ATLAS Performance Workshop NeSC ALICE June 22nd 2005 What is the LHC? • LHC will collide beams of protons at an energy of 14 TeV • Using the latest super-conducting technologies, it will operate at about – 270ºC, just above the absolute zero of temperature • LHC is due to switch on in 2007 Four experiments, with detectors as ‘big as cathedrals’: ALICE ATLAS CMS LHCb The largest terrestrial scientific With its 27 km circumference, ever the accelerator will be the Endeavour undertaken largest superconducting installation in the world. Due to start taking data in 2007 • Four detectors constructed and operated by international collaborations of thousands of physicists, egineers and technicians. David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Balloon (30 Km) Data Volume CD stack with 1 year LHC data! (~ 20 Km) Data accumulating at ~15 PetaBytes/year Equivalent to writing a CD every 2 seconds 50 CD-ROM 6 cm Concorde (15 Km) = 35 GB David Colling Imperial College Performance Workshop NeSC Mt. Blanc (4.8 Km) June 22nd 2005 The Role of LCG • LCG is the system on which this data will be analysed and similar volumes of MC simulation generated • High Energy Physics jobs have particular characteristics e.g. they are “thankfully parallel” • However, LCG and EGEE are very closely linked and EGEE has a more general remit e.g. biomed, earth obs, etc applications as well HEP David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Middleware and Deployment • Current Middleware based on EDG … but hardened and extended • New middleware being developed with the EGEE project • Deployment and monitoring is also done jointly with EGEE David Colling Imperial College Performance Workshop NeSC June 22nd 2005 The System (ATLAS Case) PC (2004) = ~1 kSpecInt2k ~Pb/sec Event Builder ~100 Gb/sec Event Filter ~7.5MSI2k ~3 Gb/sec raw •Some data for calibration and monitoring to institutes •Calibrations flow back Tier ~ 75MB/s/T1 raw for ATLAS 0 T0 ~5MSI2k Tier 1 US Regional Centre Dutch Regional Centre UK Regional Centre (RAL) French Regional Centre ~2MSI2k/T1 ~2 PB/year/T1 622Mb/s links 10 Tier-1s reprocess Tier 2 house simulation Group Analysis ~5 PB/year No simulation Northern Tier ~200kSI2k Tier2 Centre Tier2 Centre Tier2 Centre ~200kSI2k~200kSI2k~200kSI2k ~200 TB/year/T2 622Mb/s links Lancaster Liverpool Manchest Sheffield ~0.25TIPS er Physics data cache 100 - 1000 Mb/s links Desktop David Colling ImperialWorkstations College Each of ~30 Tier 2s have ~20 physicists (range) working on one or more channels Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data Tier 2 do bulk of simulation Performance Workshop NeSC June 22nd 2005 edg-job-submit myjob.jdl Myjob.jdl JobType = “Normal”; Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed0-00019"; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other. GlueHostOperatingSystemName == “linux" && other. GlueHostOperatingSystemRelease == "Red Hat 6.2“ && other.GlueCEPolicyMaxWallClockTime > 10000; Rank = other.GlueCEStateFreeCPUs; The World as seen by the EDG Now a happy user Replica of: Location service VOnow the Confused So and user unhappy knows (Replicac What is needed is angrid So lets introduce some This is the world without Grids server Sites are not identical. user about what machines are Catalogue) automated system infrastructure… Security and Different Computers there andsystem can Workload anout information Different Storage communicate with Management A storage element WMS Ausing compute RC element Different Files them… however where to System Usage Policies decide on Different submit the job is too (Resource execution complex a decision for Broker) location user alone. edg-job-get-output <dg-job-id> Each Site consists Job & Input Sandbox David Colling Imperial College Performance Workshop NeSC Logging & Bookkeeping June 22 2005 nd So what is actually there now? • Currently, 138 sites in 36 countries • 14K cpus, ~10PB storage • ~1000 registered users (>100 active users) David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Monitoring LCG/EGEE Four forms of monitoring (+demos): 1. What are the state of a given site 2. What is currently in being used 3. Accounting… how many resources have been used by a given Virtual Organisation. 4. EGEE quality assurance These different activities are not always well connected David Colling Imperial College Performance Workshop NeSC June 22nd 2005 What is the state of site… • Series of site functional tests run automatically at every site … some involve asking a site questions some by running jobs • These tests are defined as critical or non-critical. If a site consistently fails critical tests automated messages are sent to the site and it will be removed from the information system if the error is not connected. • Also analyses the information published by a site. David Colling Imperial College Performance Workshop NeSC June 22nd 2005 What is the state of site… Information gathered at two GOCs http://goc.grid.sinica.edu.tw/ and http://goc.grid-support.ac.uk/gridsite/gocmain/ David Colling Imperial College Performance Workshop NeSC June 22nd 2005 What is the state of site… Maps as well… David Colling Imperial College Performance Workshop NeSC June 22nd 2005 What is currently being used…GridIce http://gridice2.cnaf.infn.it:50080/gridice/site/site.php Kind of like the gstat asked for earlier today… David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Accounting … APEL Uses the local batch system logs and publishes information over RGMA David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Quality assurance… Interrogates the logging and bookkeeping • Overall Job success, from January 2005 • Job Success rate = Done(ok) / (SubmittedCancelled) • Results should be validated http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Quality assurance… VO • VOs Registere d job throughput and success Cancelle Done d Aborted OK Overal Done Succes l % Done OK rate, from January until May-2005 s rate throug OK per per % hput month day ATLAS 376314 1358 26251 306060 81,6 21,47 76515 2551 BABAR 2132 40 64 2004 95,8 0,14 501 17 174550 6138 18357 142075 84,4 9,96 35519 1184 CDF 556 7 76 165 30,1 0,01 41 1 CMS 56968 915 17076 31464 56,1 2,21 7866 262 1153972 3763 404882 556157 48,4 39,01 139039 4635 26281 597 2008 21823 85,0 1,53 5456 182 693 3 106 513 74,3 0,04 128 4 8540 480 764 7039 87,3 0,49 1760 59 432968 4015 99308 301063 70,2 21,12 75266 2509 6941 0 374 6171 88,9 0,43 1543 51 106635 1494 39386 51285 48,8 3,60 12821 427 David Colling Total 2346550 Imperial College 18810 BIOMED DTEAM DZERO ESR GILDA LHCB MAGIC OTHERS Performance Workshop 61,3 608652 1425819 NeSC June 22nd 2005 356455 11882 Quality assurance… Next step is to understand these failures By the end of June will also will measure the overhead caused by running via the LCG by measuring the running time/total time David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Quality assurance… Many other metrics have been suggested (especialy in the UK) including: • Number of user ( from different communities) • Training quality • Maintenance and reliability (already measured) • Upgrade time etc etc 25 LCG-2_4_0 15 LCG-2_3_1 LCG-2_3_0 10 Sites 5 06/06/2005 30/05/2005 23/05/2005 16/05/2005 09/05/2005 02/05/2005 25/04/2005 18/04/2005 11/04/2005 04/04/2005 28/03/2005 21/03/2005 14/03/2005 07/03/2005 28/02/2005 21/02/2005 14/02/2005 07/02/2005 31/01/2005 0 24/01/2005 Sites at release 20 Date UK sites only, target was 3 weeks David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Demos… http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html David Colling Imperial College Performance Workshop NeSC June 22nd 2005 How will we know if we are going get there? There are an ongoing set of service challenges • Each Service Challenge growing in complexity approaching the full production service… • Currently we are between SC2 and SC3. • SC2 only the T0 and T1. • SC3 will involve 5 T2s as well and SC4 will involve all T2 sites. David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Service Challenge 2 • Goal for throughput was >600MB/s daily average for 10 days was achieved - Midday 23rd March to Midday 2nd April •Not without outages, but system showed it could recover rate again from outages •Load reasonable evenly divided over sites (give network bandwidth constraints of Tier-1 sites) David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Service Challenge 3 and beyond Apr05 – SC2 Complete June05 - Technical Design Report Jul05 – SC3 Throughput Test Sep05 - SC3 Service Phase Dec05 – Tier-1 Network operational Apr06 – SC4 Throughput Test May06 –SC4 Service Phase starts Sep06 – Initial LHC Service in stable operation Apr07 – LHC Service commissioned 2005 SC2 SC3 preparation setup service David Colling Imperial College 2006 2007 cosmics SC4 LHC Service Operation Performance Workshop NeSC 2008 First physics First beams Full physics run June 22nd 2005 Testing and deployment • Multi stage release • New components first tested on the testing testbed. Rapid feedback to developers. This testing to include performance/scalability testing. Currently, this only at 4 (5) site. CERN, NIKHEF, RAL, Imperial (two installations) • Pre-production testbed • Releases on to production every 3 months David Colling Imperial College Performance Workshop NeSC June 22nd 2005 Conclusions • Very hard deadline by which this must work • We are monitoring as much as we can to try to understand where our current failures come from. • We have a release process that hopefully will improve performance of future releases David Colling Imperial College Performance Workshop NeSC June 22nd 2005 http://goc.grid-support.ac.uk/gridsite/monitoring/ http://goc.grid.sinica.edu.tw/gstat/ http://gridice2.cnaf.infn.it:50080/gridice/site/site.php http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html David Colling Imperial College Performance Workshop NeSC June 22nd 2005