Data Management at CERN’s Large Hadron Collider (LHC) Dirk Düllmann CERN IT/DB, Switzerland http://cern.ch/db http://pool.cern.ch D. Duellmann, CERN Data Management at the LHC 1 Outline • Short Introduction to CERN & LHC • Data Management Challenges • The LHC Computing Grid (LCG) • LCG Data Management Components • Object Persistency and the POOL Project • Connecting to the GRID – LCG Replica Location Service D. Duellmann, CERN Data Management at the LHC 2 CERN - The European Organisation for Nuclear Research The European Laboratory for Particle Physics • Fundamental research in particle physics • Designs, builds & operates large accelerators • Financed by 20 European countries (member states) + others (US, Canada, Russia, India, ….) ~€650M budget - operation + new accelerators 2000 staff + 6000 users (researchers) from all over the world • Next Major Research Project - LHC start ~2007 • 4 LHC Experiments, each with • 2000 physicists, 150 universities, apparatus costing ~€300M, computing ~€250M to setup, ~€60M/year to run • 10-15 year lifetime 27km Computer Centre D. Duellmann, CERN Data Management at the LHC Geneva 4 The LHC machine Two counter- circulating proton beams Collision energy 7+7 TeV 27 Km of magnets with a field of 8.4 Tesla Super-fluid Helium cooled to 1.9°K The world’s largest superconducting structure D. Duellmann, CERN Data Management at the LHC 5 online system multi-level trigger filter out background reduce data volume from 40TB/s to 500MB/s D. Duellmann, CERN Data Management at the LHC 6 LHC Data Challenges • 4 large experiments, • Data rates: • Total data volume: 10-15 year lifetime 500MB/s – 1.5GB/s 12-14PB / year • Several hundred PB total ! • Analysed by thousands of users world-wide • Data reduced from “raw data” to “analysis data” in a small number of well-defined steps D. Duellmann, CERN Data Management at the LHC 7 event filter (selection & reconstruction) detector Data Handling and Computation for Physics Analysis processed data event summary data raw data event reprocessing batch physics analysis event simulation CER N interactive physics analysis les.robertson@cern.ch analysis objects (extracted by physics topic) Estimated DISK Capacity at CERN Estimated Mass Storage at CERN Mass Storage Disk 7000 140 6000 5000 80 TeraBytes 100 Other experiments 60 40 LHC 20 Other experiments 4000 3000 LHC 2000 0 2010 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1000 1998 0 1998 1999 2000 2001 2002 2003 Year 2004 2005 2006 2007 2008 2009 year CPU Estimated CPU Capacity at CERN 6,000 Planned capacity evolution at CERN 5,000 Other experiments 4,000 K SI95 PetaBytes 120 3,000 LHC 2,000 1,000 0 1998 Moore’s law 1999 2000 2001 2002 2003 2004 year 2005 2006 2007 2008 2009 2010 2010 Multi Tiered Computing Models - Computing Grids Lab m Uni x regional group CERN Tier 1 Lab a Uni a UK USA Tier3 physics department The LHC Tier 1 Computing Tier2 Centre France Uni n CERN ………. Italy Desktop ………. Lab b Germany Lab c physics group Uni y Uni b les.robertson@cern.ch LHC Data Models • LHC data models are complex! Event • Typically hundreds (500-1000) of structure types (classes in OO) • Many relations between them • Different access patterns Tracker • LHC experiments rely on OO technology • OO applications deal with networks of objects • Pointers (or references) are used to describe inter object relations • Need to support this navigational model in our data store D. Duellmann, CERN TrackList Track Track Track Track Track Data Management at the LHC Calor. HitList Hit Hit Hit Hit Hit 11 What is POOL? • POOL is the common persistency framework for physics applications at the LHC • Pool Of persistent Objects for LHC • Hybrid Store – Object Streaming & Relational Database • Eg ROOT I/O for object streaming - complex data, simple consistency model (write once) • Eg RDBMS for consistent meta data handling - simple data, transactional consistency • Initiated in April 2002 • Ramping up over the last year from 1.5 FTE to ~10 FTE • Common effort between LHC experiments and the CERN Database group • project scope and architecture and development • => Rapid feedback cycles between project and its users • First larger data productions starting now! D. Duellmann, CERN Data Management at the LHC 12 Component Architecture • POOL (as most other LCG software) is based on a strict component software approach • Components provide technology neutral APIs • Communicate with other components only via abstract component interfaces • Goal: Insulate the very large experiment software systems from concrete implementation details and technologies used today • POOL user code is not dependent on any implementation libraries • No link time dependency on any implementation packages (e.g. MySQL, Root, Xerces-c..) • Component implementations are loaded at runtime via a plug-in infrastructure • POOL framework consists of three major, weakly coupled, domains D. Duellmann, CERN Data Management at the LHC 13 POOL Components POOL API Storage Service FileCatalog Collections ROOT I/O Storage Svc XML Catalog Explicit Collection RDBMS Storage Svc MySQL Catalog Implicit Collection EDG Replica Location Service D. Duellmann, CERN Data Management at the LHC 14 POOL Generic Storage Hierarchy • A application may access databases (eg streaming files) from one or more file catalogs • Each database is structured into containers of one specific technology (eg ROOT trees or RDBMS Tables) • POOL provides a “Smart Pointers” type pool::Ref<UserClass> • to transparently load objects from the back end into a client side cache • define persistent inter object associations across file or technology boundaries D. Duellmann, CERN Data Management at the LHC POOL Context FileCatalog Database Container Object 15 Data Dictionary & Storage Dictionary Generation C++ Header Abstract DDL GCC-XML Code Generator Data I/O D. Duellmann, CERN Technology Other Clients LCG dictionary Gateway I/O CINT dictionary LCG dictionary code Reflection Data Management at the LHC dependent 16 POOL File Catalog • Files are referred to inside POOL via a unique and immutable file identifier which is system generated at file creation time • This allows to provide stable inter-file reference • FileID are implemented as Global Unique Identifier (GUID) • Allows to create consistent sets of files with internal references - without requiring a central ID allocation service • Catalog fragments created independently can later be merged without modification to corresponding data file Logical Naming LFN1 LFN2 FileID PFN1, technology PFN2, technology PFNn, technology LFNn Object Lookup File Identity and metadata D. Duellmann, CERN Data Management at the LHC 17 EDG Replica Location Services - Basic Functionality Users may assign aliases to the GUIDs. These are kept in the Replica Metadata Catalog. Files have replicas stored at many Grid sites on Storage Elements. Replica Metadata Catalog Replica Manager Storage Element D. Duellmann, CERN Storage Element james.casey@cern.ch Each file has a unique GUID. Locations corresponding to the GUID are kept in the Replica Location Service. Replica Location Service The Replica Manager provides atomicity for file operations, assuring consistency of SE and catalog contents. Data Management at the LHC 18 Interactions with other Grid Middleware Components Resource Broker Virtual Organization Membership Service Information Service Replica Metadata Catalog Replica Manager Replica Location Service james.casey@cern.ch User Interface or Worker Node Replica Optimization Service Storage Element D. Duellmann, CERN Storage Element Applications and users interface to data SE through the Replica Manager Network Monitor either Monitor directly or through the Resource Broker. Data Management at the LHC 19 RLS Service Goals • To offer production quality services for LCG 1 to meet the requirements of forthcoming (and current!) data challenges • e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC’04 • To provide distribution kits, scripts and documentation to assist other sites in offering production services • To leverage the many years’ experience in running such services at CERN and other institutes • Monitoring, backup & recovery, tuning, capacity planning, … • To understand experiments’ requirements in how these services should be established, extended and clarify current limitations • Not targeting small-medium scale DB apps that need to be run and administered locally (to user) D. Duellmann, CERN Data Management at the LHC 20 Conclusions • Data Management at LHC remains a significant challenge because of data volume, project lifetime, complexity of S/W and H/W setups. • The LHC Computing Grid (LCG) approach is based on eg the EDG and GLOBUS Middleware projects and uses a strict component approach for physics application software • The LCG-POOL project has developed a technology neutral persistency framework which is currently being integrated into the experiment production systems • In conjunction with POOL a data catalog production service is provided to support several upcoming data productions in the 100 of terabyte area D. Duellmann, CERN Data Management at the LHC 21 LHC Software Challenges • Experiment software systems are large and complex • Developed by teams of expert developers • Permanent evolution and improvement for years… • Analysis is performed by many end user developers • Often participating only for short time • Usually without strong computer science background • Need simple and stable software environment • Need to manage change over a long project lifetime • Migration to new software, implementation languages • New computing platforms, storage media • New computing paradigms ??? • Data management system needs to be designed such confine the impact of unavoidable change during the project D. Duellmann, CERN Data Management at the LHC 23