Nuclear Physics Data Management Needs Bruce G. Gibbard SLAC DMW2004 Workshop

advertisement
Nuclear Physics Data
Management Needs
Bruce G. Gibbard
SLAC DMW2004 Workshop
16-18 March 2004
2
Overview
 Addressing a class of Nuclear Physics (NP) experiments
utilizing large particle detector systems to study accelerator
produced reactions
o
Examples at: BNL (RHIC), JLab, CERN (LHC)
 Technologies & data management needs of this branch of NP
are quite similar to HEP
 Integrating across its four experiments, the Relativistic
Heavy Ion Collider (RHIC) at BNL is currently the most
prolific producer of data
o
o
o
o
Study of very high energy collisions of heavy ions (up to Au on Au)
High nucleon count, high energy => high multiplicity
High multiplicity, high luminosity and fine detector granularity =>
very high data rates
Raw data recording at up to ~250 MBytes/sec
17 March 2004
B. Gibbard
Digitized Event In STAR at RHIC
17 March 2004
B. Gibbard
3
4
IT Activities of Such NP Experiments
 Support the basic computing infrastructure for experimental
collaboration
o
o
o
Typically large, 100’s of physicist, and internationally distributed
Manage & distribute code, design, cost, & schedule databases
Facilitate communication, documentation and decision making
 Store, process, support analysis of, and serve data
o
Online recording of Raw data 
o
o
o
o
Generation and recording of Simulated data
Construction of Summary data from Raw and Simulated data
Iterative generation of Distilled Data Subsets from Summary data 
Serve Distilled Data Subsets and analysis capability to widely
distributed individual physicists 
Data Intensive Activities
17 March 2004
B. Gibbard
5
Raw Data
Detect
or
System
Calibrations
& Conditions
DB
Reconstruction
Generic
Computing
Model
Data Handling
Limited
Bookkeeping
DB
Summary
Data
Meta-Data
Provenance
Data Mining
Physics Tag
DB
Skimmed
Streamed
Distilled
Data
Individual
Analysis
17 March 2004
Derived
Physics
Data
Display
B. Gibbard
Physics Based Indices
Final Results
6
Data Volumes in Current RHIC Run
 Raw Data (PHENIX)
o
o
Peak rates to 120 MBytes/sec
First 2 months of ’04, Jan & Feb
• 109 Events
• 160 TBytes
o
Project ~ 225 TBytes of Raw data for Current Run
 Derived Data (PHENIX)
o
o
Construction of Summary Data from Raw Data then
production of distilled subsets from that Summary Data
Project ~270 TBytes of Derived data
 Total (all of RHIC) = 1.2 PBytes for Current Run
o
o
STAR = PHENIX
BRAHMS + PHOBOS = ~ 40% of PHENIX
17 March 2004
B. Gibbard
7
RHIC Raw Data Recording Rate
120MBytes/sec
PHENIX
120MBytes/sec
STAR
17 March 2004
B. Gibbard
Current RHIC Technology
Tertiary Storage
o
o
o
StorageTek / HPSS
4 Silos – 4.5 PBytes (1.5 PBytes currently filled)
1000 MB/sec theoretical native I/O bandwidth
Online Storage
o
Central NFS served disk
• ~170 TBytes of FibreChannel Connected RAID 5
• ~1200 MBytes/sec served by 32 SUN SMP’s
o
Distributed disk
• ~300 TBytes of SCSI/IDE
• Locally mounted on Intel/Linux farm nodes
 Compute
o
o
~1300 Dual Processor Red Hat Linux / Intel Nodes
~2600 CPU’s => ~1,400 kSPECint2K (3-4 TFLOPS)
17 March 2004
B. Gibbard
8
Projected Growth in Capacity Scale
 Moore’s Law effect of component replacement in experiment
DAQ’s & in computing facilities => ~X6 increase in 5 years
3000
2500
Disk (TBytes)
2000
Disk Volume at RHIC
1500
1000
500
0
2001
2002
2003
2004
2005
2006
2007
2008
Year
 Not yet fully specified requirements of RHIC II and eRHIC
upgrades are likely to accelerate growth
17 March 2004
B. Gibbard
9
NP Analysis Limitations (1)
 Underlying the Data Management issue
o
Events (interactions) of interest are rare relative to minimum
bias events
• Threshold / phase space effect for each new energy domain
o
o
Combinatorics of large multiplicity events of all kinds
confound selection of interesting events
Combinatorics also create backgrounds to signals of interest
 Two analysis approaches
o
Topological: typically with
• Many qualitative &/or quantitative constraints on data sample
• Relatively low background to signal
• Modest number of events in final analysis data sample
o
Statistical: frequently with
• More poorly constrained sample
• Large background (signal is small difference between large numbers)
• Large number of events in final analysis data sample
17 March 2004
B. Gibbard
10
11
NP Analysis Limitations (2)
 It seems that it is less frequently possible to do
Topological Analyses in NP than in HEP so Statistical
Analyses are more often required
o
o
o
Evidence for this is rather anecdotal – not all would agree
To the extent that it is true, final analysis data sets tend to be
large
These are the data sets accessed very frequently by large
numbers of users … thus exacerbating the data management
problem
 In any case the extraction and the delivery of distilled
data subsets to physicists for analysis currently most
limits NP analyses
17 March 2004
B. Gibbard
12
Grid / Data Management Issues
 Major RHIC experiments are moving (have moved)
complete copies of Summary Date to regional
analysis centers
o
o
o
STAR: to LBNL via Grid Tools
PHENIX: to Riken via Tape/Airfreight
Evolution toward more sites and full dependence on Grid
 RHIC, JLab, and NP at the LHC are all very
interested and active in Grid development
o
Including high performance reliable Wide Area data
movement / replication / access services
17 March 2004
B. Gibbard
13
Conclusions
 NP and HEP accelerator/detector experiments have
very similar Data Management requirements
 NP analyses of this type currently tend to be more
Data than CPU limited
 “Mining” of Summary Data and affording end users
adequate access (both Local and Wide Area) to the
resulting distillate currently most limits NP analysis
 It is expected that this will remain the case for the
next 4-6 years through
o
o
Upgrades of RHIC and Jlab
Start-up of LHC
with Wide Area access growing in importance
relative to Local access
17 March 2004
B. Gibbard
Download