dullmann - Computational Information Systems Laboratory

advertisement
Data Management at CERN’s
Large Hadron Collider (LHC)
Dirk Düllmann
CERN IT/DB, Switzerland
http://cern.ch/db
http://pool.cern.ch
D. Duellmann, CERN
Data Management at the LHC
1
Outline
• Short Introduction to CERN & LHC
• Data Management Challenges
• The LHC Computing Grid (LCG)
• LCG Data Management Components
• Object Persistency and the POOL Project
• Connecting to the GRID – LCG Replica Location Service
D. Duellmann, CERN
Data Management at the LHC
2
CERN - The European Organisation for Nuclear Research
The European Laboratory for Particle Physics
• Fundamental research in particle physics
• Designs, builds & operates large accelerators
• Financed by 20 European countries (member states)
+ others (US, Canada, Russia, India, ….)
 ~€650M budget - operation + new accelerators
 2000 staff + 6000 users (researchers) from all over the world
• Next Major Research Project - LHC start ~2007
• 4 LHC Experiments, each with
• 2000 physicists, 150 universities, apparatus costing ~€300M,
computing ~€250M to setup, ~€60M/year to run
• 10-15 year lifetime
27km
Computer Centre
D. Duellmann, CERN
Data Management at the LHC
Geneva
4
The LHC machine
Two counter- circulating
proton beams
Collision energy 7+7 TeV
27 Km of magnets
with a field of 8.4 Tesla
Super-fluid Helium
cooled to 1.9°K
The world’s largest superconducting structure
D. Duellmann, CERN
Data Management at the LHC
5
online system
multi-level trigger
filter out background
reduce data volume from
40TB/s to 500MB/s
D. Duellmann, CERN
Data Management at the LHC
6
LHC Data Challenges
• 4 large experiments,
• Data rates:
• Total data volume:
10-15 year lifetime
500MB/s – 1.5GB/s
12-14PB / year
• Several hundred PB total !
• Analysed by thousands of users world-wide
• Data reduced from “raw data” to “analysis data” in
a small number of well-defined steps
D. Duellmann, CERN
Data Management at the LHC
7
event filter
(selection &
reconstruction)
detector
Data Handling and
Computation for
Physics Analysis
processed
data
event
summary
data
raw
data
event
reprocessing
batch
physics
analysis
event
simulation
CER
N
interactive
physics
analysis
les.robertson@cern.ch
analysis objects
(extracted by physics topic)
Estimated DISK Capacity at CERN
Estimated Mass Storage at CERN
Mass Storage
Disk
7000
140
6000
5000
80
TeraBytes
100
Other
experiments
60
40
LHC
20
Other experiments
4000
3000
LHC
2000
0
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1000
1998
0
1998
1999
2000
2001
2002
2003
Year
2004
2005
2006
2007
2008
2009
year
CPU
Estimated CPU Capacity at CERN
6,000
Planned capacity
evolution at CERN
5,000
Other experiments
4,000
K SI95
PetaBytes
120
3,000
LHC
2,000
1,000
0
1998
Moore’s law
1999
2000
2001
2002
2003
2004
year
2005
2006
2007
2008
2009
2010
2010
Multi Tiered Computing Models - Computing Grids
Lab m
Uni x
regional group
CERN Tier 1
Lab a
Uni a
UK
USA
Tier3
physics
department
The LHC
Tier
1
Computing
Tier2
Centre
France
Uni n
CERN
……….
Italy
Desktop
……….
Lab b
Germany
Lab c
physics group


Uni y
Uni b
les.robertson@cern.ch

LHC Data Models
• LHC data models are complex!
Event
• Typically hundreds (500-1000) of
structure types (classes in OO)
• Many relations between them
• Different access patterns
Tracker
• LHC experiments rely on
OO technology
• OO applications deal with networks of
objects
• Pointers (or references) are
used to describe inter object relations
• Need to support this navigational
model in our data store
D. Duellmann, CERN
TrackList
Track
Track Track
Track
Track
Data Management at the LHC
Calor.
HitList
Hit
Hit
Hit
Hit
Hit
11
What is POOL?
•
POOL is the common persistency framework for physics applications at the LHC
• Pool Of persistent Objects for LHC
•
Hybrid Store – Object Streaming & Relational Database
• Eg ROOT I/O for object streaming
- complex data, simple consistency model (write once)
• Eg RDBMS for consistent meta data handling
- simple data, transactional consistency
•
Initiated in April 2002
• Ramping up over the last year from 1.5 FTE to ~10 FTE
•
Common effort between LHC experiments and the CERN Database group
• project scope and architecture and development
• => Rapid feedback cycles between project and its users
•
First larger data productions starting now!
D. Duellmann, CERN
Data Management at the LHC
12
Component Architecture
• POOL (as most other LCG software) is based on a strict component
software approach
• Components provide technology neutral APIs
• Communicate with other components only via abstract component
interfaces
• Goal: Insulate the very large experiment software
systems from concrete implementation details and
technologies used today
• POOL user code is not dependent on any implementation libraries
• No link time dependency on any implementation packages
(e.g. MySQL, Root, Xerces-c..)
• Component implementations are loaded at runtime via a plug-in
infrastructure
• POOL framework consists of three major, weakly coupled, domains
D. Duellmann, CERN
Data Management at the LHC
13
POOL Components
POOL API
Storage Service
FileCatalog
Collections
ROOT I/O
Storage Svc
XML
Catalog
Explicit
Collection
RDBMS
Storage Svc
MySQL
Catalog
Implicit
Collection
EDG Replica
Location Service
D. Duellmann, CERN
Data Management at the LHC
14
POOL Generic Storage Hierarchy
• A application may access databases
(eg streaming files) from one or
more file catalogs
• Each database is structured into
containers of one specific technology
(eg ROOT trees or RDBMS Tables)
• POOL provides a “Smart Pointers”
type pool::Ref<UserClass>
• to transparently load objects from
the back end into a client side cache
• define persistent inter object
associations across file or technology
boundaries
D. Duellmann, CERN
Data Management at the LHC
POOL Context
FileCatalog
Database
Container
Object
15
Data Dictionary & Storage
Dictionary
Generation
C++
Header
Abstract
DDL
GCC-XML
Code Generator
Data I/O
D. Duellmann,
CERN
Technology
Other Clients
LCG
dictionary
Gateway
I/O
CINT
dictionary
LCG dictionary code
Reflection
Data Management at the LHC
dependent
16
POOL File Catalog
•
Files are referred to inside POOL via a unique and immutable file identifier
which is system generated at file creation time
• This allows to provide stable inter-file reference
•
FileID are implemented as Global Unique Identifier (GUID)
• Allows to create consistent sets of files with internal references
- without requiring a central ID allocation service
• Catalog fragments created independently can later be merged without
modification to corresponding data file
Logical Naming
LFN1
LFN2
FileID
PFN1, technology
PFN2, technology
PFNn, technology
LFNn
Object Lookup
File Identity and
metadata
D. Duellmann, CERN
Data Management at the LHC
17
EDG Replica Location Services
- Basic Functionality
Users may assign aliases to the
GUIDs. These are kept in the
Replica Metadata Catalog.
Files have replicas stored at
many Grid sites on Storage
Elements.
Replica Metadata
Catalog
Replica Manager
Storage
Element
D. Duellmann, CERN
Storage
Element
james.casey@cern.ch
Each file has a unique GUID.
Locations corresponding to the
GUID are kept in the Replica
Location Service.
Replica Location
Service
The Replica Manager provides
atomicity for file operations, assuring
consistency of SE and catalog
contents.
Data Management at the LHC
18
Interactions with other Grid
Middleware Components
Resource Broker
Virtual Organization
Membership Service
Information Service
Replica Metadata
Catalog
Replica Manager
Replica Location
Service
james.casey@cern.ch
User Interface or
Worker Node
Replica Optimization
Service
Storage
Element
D. Duellmann, CERN
Storage
Element
Applications and users interface to data
SE
through the Replica
Manager
Network
Monitor either
Monitor
directly or through the Resource
Broker.
Data Management at the LHC
19
RLS Service Goals
•
To offer production quality services for LCG 1 to meet the
requirements of forthcoming (and current!) data challenges
• e.g. CMS PCP/DC04, ALICE PDC-3, ATLAS DC2, LHCb CDC’04
•
To provide distribution kits, scripts and documentation to assist
other sites in offering production services
•
To leverage the many years’ experience in running such services
at CERN and other institutes
• Monitoring, backup & recovery, tuning, capacity planning, …
•
To understand experiments’ requirements in how these services
should be established, extended and clarify current
limitations
•
Not targeting small-medium scale DB apps that need to be run
and administered locally (to user)
D. Duellmann, CERN
Data Management at the LHC
20
Conclusions
• Data Management at LHC remains a significant challenge because of
data volume, project lifetime, complexity of S/W and H/W setups.
• The LHC Computing Grid (LCG) approach is based on eg the EDG and
GLOBUS Middleware projects and uses a strict component approach
for physics application software
• The LCG-POOL project has developed a technology neutral
persistency framework which is currently being integrated into the
experiment production systems
• In conjunction with POOL a data catalog production service is
provided to support several upcoming data productions in the 100 of
terabyte area
D. Duellmann, CERN
Data Management at the LHC
21
LHC Software Challenges
• Experiment software systems are large and complex
• Developed by teams of expert developers
• Permanent evolution and improvement for years…
• Analysis is performed by many end user developers
• Often participating only for short time
• Usually without strong computer science background
• Need simple and stable software environment
• Need to manage change over a long project lifetime
• Migration to new software, implementation languages
• New computing platforms, storage media
• New computing paradigms ???
• Data management system needs to be designed such confine the impact
of unavoidable change during the project
D. Duellmann, CERN
Data Management at the LHC
23
Download