Workshop on Spatiotemporal Databases for Geosciences, Biomedical sciences and Physical sciences “Replica Management in LCG” James Casey, Grid Deployment Group, CERN E-Science Institute, Edinburgh, 2nd November 2005 James.Casey@cern.ch Talk Overview • LHC and the Worldwide LCG Project • LHC Data Management Architecture • Replica Management Components • Storage • Catalog • Data Movement • User APIs & tools CERN Grid Deployment The LHC Experiments CMS ATLAS LHCb CERN Grid Deployment The Atlas Detector • The ATLAS collaboration is • • • • • ~2000 physicists from ~ 150 universities and labs from ~ 34 countries distributed resources remote development • The ATLAS detector is 26m long, • stands 20m high, • weighs 7000 tons • has 200 million readout channels • CERN Grid Deployment Atlas detector Data Acquisition • Multi-level trigger • Filters out background • Reduces data volume • Record data 24 hours a day, 7 days a week • Equivalent to writing a CD every 2 seconds CERN Grid Deployment Worldwide LCG Project - Rationale • • • • • Satisfies the common computing needs of the LHC experiments Need to support 5000 scientists at 500 institutes; Estimated project lifetime: 15 years; Processing requirements: 100,000 CPUs (2004 units); Traditional, centralised approached ruled out in favour of a globally distributed grid for data storage and analysis: • Costs of maintaining and upgrading a distributed system more easily handled - individual institutes and organisations can fund local computing resources and retain responsibility for these, while still contributing to the global goal. • No single points of failure. Multiple copies of data and automatic reassigning of tasks to available resources ensures optimal use of resources. Spanning all time zones also facilitates round-the-clock monitoring and support. From http://lcg.web.cern.ch/LCG/overview.html CERN Grid Deployment LCG Service Deployment Schedule Apr05 – SC2 Complete June05 - Technical Design Report Jul05 – SC3 Throughput Test Sep05 - SC3 Service Phase Dec05 – Tier-1 Network operational Apr06 – SC4 Throughput Test May06 –SC4 Service Phase starts Sep06 – Initial LHC Service in stable operation Apr07 – LHC Service commissioned 2005 SC2 SC3 preparation setup service 2006 2007 cosmics SC4 LHC Service Operation 2008 First physics First beams Full physics run CERN Grid Deployment Data Handling and Computation for Physics Analysis detector event filter (selection & reconstruction) reconstruction processed data event summary data raw data event reprocessing analysis batch physics analysis event simulation simulation interactive physics analysis CERN Grid Deployment les.robertson@cern.ch analysis objects (extracted by physics topic) WLCG Service Hierarchy Tier-0 – the accelerator centre • Data acquisition & initial processing • Long-term data curation • Distribution of data Tier-1 centres Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschungszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Didcot) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 – “online” to the data acquisition process high availability • Managed Mass Storage – grid-enabled data service • Data intensive analysis • National, regional support Tier-2 – ~100 centres in ~40 countries • Simulation • End-user analysis – batch and interactive CERN Grid Deployment Les Robertson How much data in one year? • Storage Space Balloon (30 Km) CD stack with 1 year LHC data! (~ 20 Km) • Data produced is ~15PB/year • Space provided at all tiers is ~80PB • Network bandwidth • 70 Gb/s to the big centres • Direct dedicated lightpaths to all centres • Used only for Tier-0 -> Tier-1 data distribution Concorde (15 Km) • Number of files • ~ 40 million files assuming 2GB files and it runs for 15 years CERN Grid Deployment Mt. Blanc (4.8 Km) Data Rates to Tier-1s for p-p running Centre ALICE ATLAS CMS LHCb Rate into T1 (pp) MB/s - 8% 10% - 100 CNAF, Italy 7% 7% 13% 11% 200 PIC, Spain - 5% 5% 6.5% 100 IN2P3, Lyon 9% 13% 10% 27% 200 GridKA, Germany 20% 10% 8% 10% 200 RAL, UK - 7% 3% 15% 150 BNL, USA - 22% - - 200 FNAL, USA - - 28% - 200 TRIUMF, Canada - 4% - - 50 NIKHEF/SARA, NL 3% 13% - 23% 150 Nordic Data Grid Facility 6% 6% - - 50 - - - - 1,600 ASGC, Taipei Totals These rates must be sustained to tape 24 hours a day, 100 days a year. Extra capacity is required to cater for backlogs / peaks. This is currently our biggest data management challenge. CERN Grid Deployment Problem definition in one line… • “…to distribute, store and manage the high volume of data produced as the result of running the LHC experiments and allow subsequent ‘chaotic’ analysis of the data” • Data comprises of • • • • Raw data ~90% Processed data ~10% “relational” metadata ~1% “middleware-specific” metadata ~ .001% • Main problem is movement of raw data • To the Tier-1 sites as an “online” process - volume • To the analysis sites – chaotic access pattern • We’re really dealing with the non-analysis use cases right now CERN Grid Deployment Replica Management Model • Write-Once/Read-Many files • Avoid issue of replica consistency • No mastering • User accesses data via a logical name • Actual filename on storage system is irrelevant • No strong authorization on storage itself • All users in a VO are considered the same – No usage of user identity on MSS • Storage uses unix permissions – Different users represent different “roles” e.g experiment production managers – group == VO • Simple user-initiated replication model • upload/replicate/download cycle CERN Grid Deployment Replica Management Model • All replicas are considered the same • a replica is “close” if • in same network domain • Explicitly made close to a particular cluster – By the information system – Or local environment variables • This is basically the model inherited from European DataGrid (EDG) Data Management software • Although all the software has been replaced! CERN Grid Deployment Replica Management components Each file has a unique Grid ID. Locations corresponding to the GUID are kept in the Replica Catalog. The file transfer service provides reliable asynchronous 3rd party file transfer. Users select data via metadata. This is in the Experiment Metadata Catalog. Experiment Metadata Catalog Client tools Transfer Service Storage Element Storage Element Replica Catalog The client interacts with the grid via the experiment framework, and LCG Files have replicas stored at APIs. Grid sites on Storage many Elements. CERN Grid Deployment Software Architecture • Layered Architecture • Experiments hook in at whatever layer they require • Focus on Core Services • Experiments integrate into their own replication framework • Not possible to provide a generic data management model for all four experiments • Provide C/python/perl APIs and some simple CLI tools • Data Management model still based on EDG model • Suggest change is trying to introduce a better security model • But our users don’t really care about it, only the performance penalty it gives them ! CERN Grid Deployment Software Architecture • LCG software model heavily influenced (by) EDG • First LCG middleware releases came directly out of the EDG project • Globus 2.4 is used as a basic lower layer • Gridftp for data movement • Globus GSI Security model and httpg for web service security • We are heavily involved in EGEE • We take components out of the EGEE gLite release and integrate them into the LCG release • And we write our own components we need to • But should be a very last resort! • (LCG Data Management team is ~2FTE) CERN Grid Deployment Layered Data Management APIs Experiment Framework User Tools lcg_utils Data Management (Replication, Indexing, Querying) GFAL Component Specific APIs Cataloging EDG LFC Storage SRM Classic SE CERN Grid Deployment Data transfer Globus Gridftp File Transfer Service Summary : What can we do? • Store the data • Managed Grid-accessible storage • Including interface to MSS • Find the data • Experiment metadata catalog • Grid replica catalogs • Access the data • LAN “posix-like” protocols • gridftp on the WAN • Move the data • Asynchronous high bandwidth data movement • Throughput more important that latency CERN Grid Deployment Storage Model • We must manage storage resources in an unreliable distributed large heterogeneous system • We must make the MSS at Tier-0/Tier-1 and the disk based storage appear the same to the users • Long lasting data intensive transactions • Can’t afford to restart jobs • Can’t afford to loose data, especially from experiments • Heterogeneity • Operating systems • MSS - HPSS, Enstore, CASTOR, TSM • Disk systems – system attached, network attached, parallel • Management Issues • Need to manage more storage with less people CERN Grid Deployment Storage Resource Manager (SRM) • Storage Resource Manager (SRM) • Collaboration between LBNL, CERN, FNAL, RAL, Jefferson Lab • Became the GGF Grid Storage Management Working Group http://sdm.lbl.gov/srm-wg/ • Provides a common interface to Grid Storage • Exposed as a Web Service • Negotiable transfer protocols (Gridftp, gsidcap, RFIO, …) • We use three different implementations • CERN CASTOR SRM – for CASTOR MSS • DESY/FNAL dCache SRM • LCG DPM – disk only lightweight SRM for Tier-2s CERN Grid Deployment SRM / MSS by Tier1 Centre SRM MSS Tape H/W Canada, TRIUMF dCache TSM France, CC-IN2P3 dCache HPSS STK Germany, GridKA dCache TSM LTO3 CASTOR CASTOR STK 9940B dCache DMF STK DPM N/A N/A Spain, PIC Barcelona CASTOR CASTOR STK Taipei, ASGC CASTOR CASTOR STK UK, RAL dCache ADS CASTOR(?) STK USA, BNL dCache HPSS STK USA, FNAL dCache ENSTOR STK Italy, CNAF Netherlands, NIKHEF/SARA Nordic Data Grid Facility CERN Grid Deployment Catalog Model • Experiments own and control the metadata catalog • All interaction with grid files is via a GUID (or LFN) obtained from their metadata catalog • Two models for tracking replicas • Single global replica catalog LHCb • Central metadata catalog stores pointers to site local catalogs which contain the replica information • ALICE/ATLAS/CMS • Different implementations used • LHC File Catalog (LFC), Globus RLS, experiment developed catalogs • This is a “simple” problem, but we keep revisting it CERN Grid Deployment Accessing the Data • Grid File Access Layer (GFAL) • originally a low-level I/O interface to Grid Storage • provides “posix-like” I/O abstraction • Now provides: • File Catalog abstraction • Information system abstraction • Storage Element Abstraction (EDG SE, EDG ‘Classic’ SE, SRM v1) • lcg_util • Provides a replacement for the EDG Replica Manager • Provides both direct C library calls and CLI tools • Is a thin wrapper on top of GFAL • Has extra experiment requested features compared to the EDG Replica Manager CERN Grid Deployment Managed Transfers • gLite File Transfer Service (FTS) is a fabric service • It provides point to point movement of SURLs • Aims to provide reliable file transfer between sites, and that’s it! • Allows sites to control their resource usage • Does not do ‘routing’ • Does not deal with GUID, LFN, Dataset, Collections • Provides • Sites with a reliable and manageable way of serving file movement requests from their VOs • Users with an async reliable data movement interface • VO developers with a pluggable agent-framework to monitor and control the data movement for their VO CERN Grid Deployment Summary • LCG will require a large amount of data movement • Production use-cases demand high-bandwidth distribution of data to many sites in a well-known pattern • Analysis use cases will provide chaotic, unknown replica access patterns • We have a solution for the first problem • This is out our main focus • Tier-1’s are “online” to the experiment • The second is under way • The accelerator is nearly upon us • And then it’s full service until 2020 ! CERN Grid Deployment Thank you http://lcg.web.cern.ch/LCG/ CERN Grid Deployment Backup Slides CERN Grid Deployment Computing Models CERN Grid Deployment Data Replication CERN Grid Deployment Types of Storage in LCG-2 • 3 “classes” of storage at sites • Integration of large (tape) MSS (at Tier 1 etc) – • Responsibility of site to make the integration • Large Tier 2’s – sites with large disk pools (100s Terabytes, many fileservers), need a flexible system • dCache provides a good solution • Needs effort to integrate and manage • Sites with smaller disk pools (1 – 10 Terabytes), less available management effort • Need a lightweight (install, manage) solution • LCG Disk Pool Manager is a solution for this problem CERN Grid Deployment Catalogs CERN Grid Deployment Catalog Model • Experiment responsibility to keep metadata catalog and replica catalog (either local or global) in sync • LCG tools only deal with global case, since each local case is different • LFC is able to be used as either a local or global catalog component • Workload Management picks sites with replica by querying a global Data Location Interface (DLI) • Can be provided by either • Experiment metadata catalog • Global grid replica catalog (e.g LFC) CERN Grid Deployment LCG File Catalog • Provides a filesystem-like view of grid files • Hierarchical namespace and namespace operations • Integrated GSI Authentication + Authorization • Access Control Lists (Unix Permissions and POSIX ACLs) • Fine grained (file level) authorization • Checksums • User exposed transaction API CERN Grid Deployment