ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent Progresses in High Energy Physics Bolu, Turkey. June 23-25, 2004 1 Outline • Introduction ATLAS Experiment ATLAS Computing System • ATLAS Computing Timeline • • • ATLAS Data Challenges • • • DC2 Event Samples Data Production Scenario ATLAS New Production System Grid Flavors in Production System Windmill-Supervisor An Example of XML Messages • Windmill-Capone Screenshots • • • • • Grid Tools Conclusions 2 Introduction • Why Grid Computing: • Scientific research becomes more and more complex and international teams of scientists grow larger and larger • Grid technologies enables scientist to use remote computers and data storage systems to be able to retrieve and analyze the data around the world • Grid Computing power will be a key to the success of the LHC experiments • Grid computing is a challenge not only for particle physics experiments but also for biologists, astrophysicists and gravitational wave researchers 3 ATLAS Experiment • • • ATLAS (A Toroidal LHC Apparatus) experiment at the Large Hadron Collider at CERN will start taking data in 2007. proton-proton collisions with a 14 TeV center-of-mass energy ATLAS will study: • • • • • • SM Higgs Boson SUSY states SM QCD, EW, HQ Physics New Physics? Total amount of “raw” data: 1 PB/year Needs the GRID to reconstruct and analyze this data: Complex “Worldwide Computing Model” and “Event Data Model” • • • Raw Data @ CERN Reconstructed data “distributed” All members of the collaboration must have access to “ALL” public copies of the data ~2000 Collaborators ~150 Institutes 34 Countries 4 ATLAS Computing System PC (2004) = ~1 kSpecInt2k ~Pb/sec Event Builder 10 GB/sec Event Filter ~159kSI2k •Some data for calibration and monitoring to institutes 450 Mb/sec •Calibrations flow back ~ 300MB/s/T1 /expt Tier 1 US Regional Centre (R. Jones) Tier 0 Italian Regional Centre ~9 Pb/year/T1 No simulation T0 ~5MSI2k French Regional Centre UK Regional Centre (RAL) ~7.7MSI2k/T1 ~2 Pb/year/T1 622Mb/s Tier 2 Northern Tier ~200kSI2k Tier2 Centre Tier2 Centre Tier2 Centre ~200kSI2k~200kSI2k~200kSI2k ~200 Tb/year/T2 622Mb/s Lancaster Liverpool Manchest Sheffield ~0.25TIPS er Physics data cache Workstations 100 - 1000 MB/s Desktop Each Tier 2 has ~25 physicists working on one or more channels Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data Tier 2 do bulk of simulation 5 ATLAS Computing Timeline 2003 (D. Barberis) • POOL/SEAL release (done) • ATLAS release 7 (with POOL persistency) (done) • LCG-1 deployment (done) 2004 • ATLAS complete Geant4 validation (done) NOW • ATLAS release 8 (done) • DC2 Phase 1: simulation production 2005 • DC2 Phase 2: intensive reconstruction (the real challenge!) • Combined test beams (barrel wedge) • Computing Model paper 2006 • Computing Memorandum of Understanding • ATLAS Computing TDR and LCG TDR • DC3: produce data for PRR and test LCG-n 2007 • Physics Readiness Report • Start commissioning run • GO! 6 ATLAS Data Challenges Data Challenges --> generate and analyze simulated data with increasing scale and complexity using Grid (as much as possible) • Goal: • Validation of the Computing Model, the software, the data model, and to ensure the correctness of the technical choices to be made • Provide simulated data to design and optimize the detector • Experience gained these Data Challenges will be used to formulate the ATLAS Computing Technical Design Report • Status: • DC0 (December2001-June2002), DC1 (July2002-March2003) completed • DC2 ongoing • DC3, DC4 planned (one/year) 7 DC2 Event Samples Channel A0 A0a A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 Top Top (mis-aligned) Z B1 B2 B3 B4 B5 Jets Gamma + jet bb -> B Jets Gamma_jet H1 H2 H3 H4 H5 H6 H7 H8 H9 Higgs (130) Higgs (180) Higgs (120) Higgs (170) Higgs (170) Higgs (115) Higgs (115) MSSM Higgs MSSM Higgs M1 Minimum bias W Z + jet dijets W + 4 jets QCD Suzy Higss DC1 susy Total Decay Cuts (G. Poulard) Events (10**6) before filter Events (10**6) 1 e-e mu-mu tau-tau leptons no Pt cut Pt > 600 W -> leptons tau-tau Pt > 180 Pt > 20 mu6-mu6 Pt > 17 4 leptons 4 leptons gamma-gamma W-W tau-tau tau-tau b-b-A(300) b-b-A(115) 1 1 1 1 0.5 0.25 0.25 0.5 0.1 0.1 0.05 1 0.2 0.25 1 0.05 0.04 0.04 0.015 0.015 0.015 0.015 0.015 0.015 0.015 9.435 8 Data Production Scenario Input Event generation G4 simulation Detector response (G. Poulard) Output Generated events none Comments < 2 GB files Generated Events “part of” < 2 GB files Hits + MCTruth < 2 GB files Job duration limited to 24h! ~ 2000 jobs/day ~ 500 GB/day ~ 5 MB/s Hits + MCTruth 1 file Digits +MCTruth RDO (or BS) No MCTruth if BS (Generated events) ~ 2000 jobs/day Pile-up Hits “signal” +MCTruth Hits “min.b” Byte-stream “pile-up” data RDO 1 (or few) files BS Still some work Events mixing RDO or BS Several files BS “ Reconstruction RDO or BS ESD AOD production ESD AOD 1 file Several 10 files Digits +MCTruth RDO (or BS) Input: ~ 10 GB/job ~ 10 TB/day ~ 150 MB/s Streaming? 9 ATLAS New Production System prodDB AMI Don Quijote dms Windmill super jabber super super soap jabber LCG exe super LCG exe Capone Dulcinea RLS LCG G3 exe RLS NG jabber soap NG exe Lexor super LSF exe RLS Grid3 http://www.nordugrid.org/applications/prodsys/ LSF 10 Grids Flavors in Production System Regional Centres Connected to the LCG Grid centre Austria Canada UIBK TRIUMF, Vancouver Univ. Montreal Univ. Alberta Czech Republic CESNET, Prague University of Prague IN2P3, Lyon** France Germany FZK, Karlsruhe Greece Holland Hungary Israel Italy Japan Poland DESY University of Aachen University of Wuppertal GRNET, Athens NIKHEF, Amsterdam KFKI, Budapest Tel Aviv University** Weizmann Institute CNAF, Bologna INFN, Torino INFN, Milano INFN, Roma INFN, Legnaro ICEPP, Tokyo** Cyfronet, Krakow ** not yet in LCG-2 country centre Portugal Russia Spain LIP, Lisbon SINP, Moscow PIC, Barcelona IFIC, Valencia IFCA, Santander University of Barcelona Uni. Santiago de Compostela CIEMAT, Madrid UK UAM, Madrid CERN CSCS, Manno** Academia Sinica, Taipei NCU, Taipei RAL Cavendish, Cambridge Imperial, London Lancaster University Manchester University Sheffield University USA QMUL, London FNAL Centres BNL** Switzerland Taiwan L. Perini LCG: LHC Computing Grid, > 40 sites • Grid3: USA Grid, 27 sites • NorduGrid: Denmark, Sweden, • 07-May-04 country country Norway, Finland, Germany, Estonia, Slovenia, Slovakia, Australia, Switzerland, 35 sites in process of being connected centre China IHEP, Beijing India TIFR, Mumbai Pakistan NCP, Islamabad Hewlett Packard to provide “Tier 2-like” services for LCG, initially in Puerto Rico 11 Windmill-Supervisor • • • • Supervisor development team at UTA: Kaushik De, Nurcan Ozturk, Mark Sosebee supervisor-executor communication is via Jabber protocol developed for Instant Messaging XML (Extensible Markup Language ) messages are passed between supervisorexecutor supervisor-executor interaction: numJobsWanted executeJobs getExecutorData • getStatus • fixJob • killJob prod DB * data mgt system supervisor prod m anager * executor replica catalog • • • • Final verification of jobs is done by supervisor Windmill webpage: http://www-hep.uta.edu 12 An Example of XML Messages numJobWanted : supervisor-executor negotiation of number of jobs to process <?xml version="1.0" ?> <windmill type="request” user="supervisor" version="0.6"> <numJobsWanted> <minimumResources> <transUses>JobTransforms-8.0.1.2 Atlas-8.0.1 – software version</transUses> <cpuConsumption> <count>100000 - minimum CPU required for a production job</count> <unit>specint2000seconds - unit of CPU usage</unit> </cpuConsumption> <diskConsumption> <count>500 - maximum output file size</count> <unit>MB</unit> </diskConsumption> <ipConnectivity>no - IP connection required from CE </ipConnectivity> <minimumRAM> <count>256 - minimum physical memory requirement</count> <unit>MB</unit> </minimumRAM> </minimumResources> </numJobsWanted> </windmilll> <?xml version="1.0" ?> <windmill type="respond” user=“executor" version="0.8"> <numJobsWanted> <availableResources> <jobCount>5</jobCount> <cpuMax> <count>100000</count> <unit>specint2000</unit> </cpuMax> </availableResources> </numJobsWanted> </windmill> supervisor’s request executor’s respond 13 Windmill-Capone Screenshots 14 Grid Tools What tools are needed for a Grid site? An example: Grid3 - USA Grid • Joint project with USATLAS, USCMS, iVDGL, PPDG, GriPhyN • Components: • VDT based Classic SE (gridftp) Monitoring: Grid site Catalog, Ganglia, MonALISA • Two RLS servers and VOMS server for ATLAS • • • Installation: • • pacman –get iVDGL:Grid3 Takes ~ 4 hours to bring up a site from scratch VDT (Virtual Data Toolkit) version 1.1.14 gives: • • • • • • • • • • • • • • • • • • • • Virtual Data System 1.2.3 Class Ads 0.9.5 Condor 6.6.1 EDG CRL Update 1.2.5 EDG Make Gridmap 2.1.0 Fault Tolerant Shell (ftsh) 2.0.0 Globus 2.4.3 plus patches GLUE Information providers GLUE Schema 1.1, extended version 1 GPT 3.1 GSI-Enabled OpenSSH 3.0 Java SDK 1.4.1 KX509 2031111 Monalisa 0.95 MyProxy 1.11 Netlogger 2.2 PyGlobus 1.0 PyGlobus URL Copy 1.1.2.11 RLS 2.1.4 UberFTP 1.3 15 Conclusions • • • • • • Grid paradigm works; opportunistic use of existing resources, run anywhere, from anywhere, by anyone... Grid computing is a challenge, needs world wide collaboration Data production using Grid is possible, successful so far Data Challenges are the way to test the ATLAS computing model before the real experiment starts Data Challenges also provides data for Physics groups A learning and improving experience with Data Challenges 16