ATLAS Goals and Status Jim Shank US LHC OSG Technology Roadmap May 4-5th, 2005 Outline High level USATLAS goals Milestone Summary and Goals Identified Grid3 Shortcomings Workload Management Related Distributed Data Management Related VO Service Related 2 US ATLAS High Level Requirements Ability to transfer large datasets between CERN, BNL, and Tier2 Centers Ability to store and manage access to large datasets at BNL Ability to produce Monte Carlo at Tier2 and other sites and deliver to BNL Ability to store, serve, catalog, manage and discover ATLAS wide datasets by users and ATLAS file transfer services Ability to create and submit user analysis jobs to data locations Ability to access opportunistically non-dedicated ATLAS resources To deliver all these capabilities at a scale commensurate with available resources (CPU, storage, network, and human) and number of users 3 ATLAS Computing Timeline 2003 • POOL/SEAL release (done) • ATLAS release 7 (with POOL persistency) (done) • LCG-1 deployment (done) 2004 • ATLAS complete Geant4 validation (done) • ATLAS release 8 (done) • DC2 Phase 1: simulation production (done) 2005 NOW • DC2 Phase 2: intensive reconstruction (the real challenge!) LATE! • Combined test beams (barrel wedge) (done) • Computing Model paper (done) 2006 • Computing Memorandum of Understanding (in progress) • ATLAS Computing TDR and LCG TDR (starting) • DC3: produce data for PRR and test LCG-n • Physics Readiness Report 2007 • Start cosmic ray run • GO! Rome Physics Workshop 4 Grid Pressure • Grid3/DC2 very valuable exercise • Lots of pressure on me now to develop backup plans if the “grid middleware” does not come through with required functionality on our timescale. • Since manpower is short, this could mean pulling manpower out of our OSG effort. 5 Schedule & Organization Grid Tools & Services 2.3.4.1 Grid Service Infrastructure 2.3.4.2 Grid Workload Management 2.3.4.3 Grid Data Management 2.3.4.4 Grid Integration & Validation 2.3.4.5 Grid User Support ATLAS milestones CSC PRR CRR Start 2007 2006 TDR NOW 2005 DC2 6 Milestone Summary (I) 2.3.4.1 Grid Service Infrastructure May 2005 ATLAS software management service upgrade June 2005 Deployment of OSG 0.2 (expected increments after this) June 2005 ATLAS-wide monitoring and accounting service June 2005 ATLAS site certification service June 2005 LCG interoperability (OSG LCG) July 2005 SRM/dCache deployed on Tier2 Centers Sep 2005 LCG interoperability services for SC05 7 Milestone Summary (II) 2.3.4.2 Grid Workload Management June 2005: Defined metrics for submit host scalability met June 2005: Capone recovery possible without job loss July 2005: Capone2 WMS for ADA Aug 2005: Pre-production Capone2 delivered to Integration team Integration with SRM Job scheduling Provision of Grid WMS for ADA Integration with DDMS Sep 2005: Validated Capone2 + DDMS Oct 2005: Capone2 WMS with LCG Interoperability components 8 Milestone Summary (III) 2.3.4.3 Grid Data Management April 2005: Interfaces and functionality for OSG-based DDM June 2005: Integrate with OSG-based storage services with benchmarks Aug 2005: Interfaces to the ATLAS OSG workload management system (Capone) Nov 2005: Implementing storage authorization and policies 9 Milestone Summary (IV) 2.3.4.4 Grid Integration & Validation March 2005: Deployment of ATLAS on OSG ITB (Integration Testbed) June 2005: Pre-production Capone2 deployed and validated OSG ITB July 2005: Distributed analysis service implemented with WMS August 2005: Integrated DDM+WMS service challenges on OSG Sept 2005: CSC (formerly DC3) full functionality for production service Oct 2005: Large scale distributed analysis challenge on OSG Nov 2005: OSG-LCG Interoperability exercises with ATLAS Dec 2005: CSC full functionality pre-production validation 10 Capone + Grid3 Performance ATLAS DC2 Overview of Grids as of 2005-02-24 18:11:30 Grid submitted pending running finished failed efficiency Grid3 36 3 814 153028 46943 77 % NorduGrid 29 130 1105 114264 70349 62 % LCG 60 528 610 145692 242247 38 % 125 661 2529 412984 359539 53 % TOTAL Capone submitted & managed ATLAS jobs on Grid3 > 150K In 2004, 1.2M CPU-hours Grid3 sites with more than 1000 successful DC2 jobs: 20 Capone instances > 1000 jobs: 13 11 ATLAS DC2 and Rome Production on Grid3 <Capone Jobs/day> = 350 Max # jobs/day = 1020 12 Most Painful Shortcomings “Unprotected” Grid services GT2 Gram and GridFTP vulnerable to multiple users and VOs Frequent manual intervention of site administrators No reliable file transfer service or other data management services such as space management No policy-based authorization infrastructure No distinction between production and individual grid users Lack of a reliable information service Overall robustness and reliability (on client and server) poor 13 ATLAS Production System DDMS data management prodDB (CERN) AMI (Metadata) Don Quijote “DQ” Windmill Requirements for super super super VO and jabber soap soap LCG exe NG exe Core Services Dulcinea Lexor RLS LCG super Nordu Grid G3 exe Capone RLS jabber Grid3 Legacy exe RLS LSF WMS 15 Grid Data Management ATLAS Distributed Data Management System (DDMS) US ATLAS GTS role in this project: Provide input to design: expected interfaces, functionality, and scalability performance and metrics, based on experience with Grid3 and DC2 and compatibility with OSG services Integrate with OSG-based storage services Benchmark OSG implementation choices with ATLAS standards Specification and development of as-needed OSG specific components required for integration with the overall ATLAS system Introduce new middleware services as they mature Interfaces to OSG workload management system (Capone) Implementing storage authorization and policies for role-based usage (reservation, expiration, cleanup, connection to VOMS, etc) consistent with ATLAS data management tools and services. 16 DDMS Issues: File Transfers Dataset discovery and distribution for both production and analysis services AOD distribution from CERN to Tier1 Monte Carlo produced at Tier2 delivered to Tier1 Support role-based priorities for transfers and space 17 Reliable FT in DDMS MySQL Backend Services for agents and clients Schedules transfers later, DQ evolving M. Branco monitor and account resources Security and policy issues affecting core infrastructure Authentication and Authorization infrastructure needs to be in place Priorities for production managers & end-users need to be settable Roles to set group and user quotas, permissions 21 DDMS INTERFACING TO STORAGE LCG-required SRM functions LCG Deployment Group CMS input/comments not included yet 22 SRM v1.1 insufficient – mainly lack of pinning SRM v3 not required – and timescale too late Require Volatile, Permanent space; Durable not practical Global space reservation: reserve, release, update (mandatory LHCb, useful ATLAS,ALICE). Compactspace NN Permissions on directories mandatory Prefer based on roles and not DN (SRM integrated with VOMS desirable but timescale?) Directory functions (except mv) should be implemented asap Pin/unpin high priority srmGetProtocols useful but not mandatory Abort, suspend, resume request : all low priority Relative paths in SURL important for ATLAS, LHCb, not for ALICE Managing Persistent Services General summary: larger sites and ATLAS-controlled sites should let us run services like LRC, space management, replication. Other sites can be handled by being part of some multi-site domain managed by one of our sites -- ie their persistent service needs are covered by services running at our sites, and site-local actions like space management happen via submitted jobs (working with the remote service and therefore requiring some remote connectivity or gateway proxying) rather than local persistent services. 23 Scalable Remote Data Access ATLAS reconstruction and analysis jobs require acces to remote database servers at CERN, BNL, and elsewhere Presents additional traffic on network especially for large sites Suggest use of local mechanisms, such was web proxy caches, to minimize this impact 24 Workload Management LCG Core services contributions focused on LCG-RB, etc. Here address scalability and stability problems experienced on Grid3 Defined metrics; achieve by job batching or by DAG-appends 5000 active jobs from a single submit host Submission rate: 30 jobs/minute >90% job efficiency jobs accepted All of US production managed by 1 person Job state persistency mechanism Capone recovery possible without job loss 25 Other Considerations Require reliable information system Integration with Managed Storage Several resources in OSG will be SRM/dCache, which does not support 3rd party transfers (used heavily in Capone) NeST as an option for space management Job scheduling Static at present; change to matchmaking based on data location and job queue depth & policy Integration with ATLAS data management system (evolving) Current system works directly with file-based RLS New system will interface with new ATLAS dataset model; POOL file catalog interface 26 Grid Interoperability Interoperability with LCG A number of proof of principle exercises completed Progress of GLUE schema with LCG Publication of USATLAS resources (compute, storage) to LCG index service (BDII) Use of the LCG developed General Information Provider Submit ATLAS job from LCG to OSG (via an LCG Resource Broker) Submit from OSG site to LCG services directly via Condor-G Demonstration challenge under discussion with LCG in OSG Interoperability Interoperability with TeraGrid Initial discussions begun in OSG Interoperability Activity F. Luehring (US ATLAS) co-Chair Initial issues identified: authorization, allocations, platform dependencies 27 Summary We’ve been busy almost continuously with production since July 2004. Uncovered many shortcomings in Grid3 core infrastructure and our own services (eg. WMS scalability, no DDMS). Expect to see more ATLAS services and agents on gatekeeper nodes, especially for DDMS Need VO-scoped management of resources by the WMS and DDMS until middleware service supply these capabilities To be compatible with ATLAS requirements and OSG principles, we see resources partitioned (hosted dedicated VO & guest VO resources and general OSG opportunistic) to achieve reliability and robustness. Time is of the essence. A grid component, fully meeting our specs, delivered late is the same as a failed component. We will have already implemented a work around. 28 appendix Grid3 System Architecture RLS DonQuijote Mon-Servers MonALISA GridCat MDS SE ProdDB Windmill Capone gsiftp CE gram jobsch Chimera VDC Pegasus WN Grid3 Sites schedd Condor-G GridMgr Condor-G Elements of the execution environment used to run ATLAS DC2 on Grid3. The dashed (red) box indicates processes executing on the Capone submit host. Yellow boxes indicate VDT components. Capone inside 30 Capone Workload Manager Message interface Web Service & Jabber Translation layer Windmill (Prodsys) schema Distributed Analysis (ADA) -TBD Process execution Process eXecution Engine SBIR (FiveSight) Processes Grid interface Stub: local shell script testing Data Management -TBD Translation ADA Windmill Process Execution CPE-FSM PXE Data Management Jabber Stub Capone Finite State Machine Web Service Grid (GCE Client) Message protocols The Capone Architecture, showing three common component layers above selectable modules which interact with local and remote services. 31 32