presentation

advertisement
Using distributed resources in
STAR
An Overview of our tools, architecture and
experience …
Jérôme Lauret,
Gabriele Carcassi, Richard Casella,
Eftathiadis Eftratios, Eric Hjort, Doug
Olson, Jeff Porter
Jérôme Lauret, CHEP03, March 2003
STAR overview
The STAR collaboration
450 members, 45 institutions (3 more in a week ... )
Large data sets : 33 M events last year (Au+Au 200 AGeV), 55 M
events this year so far and only half way through (planned 150 M)
Represents Pbytes of data per year (total) and growing ...
A demanding user community : data mining done in a few month ...
also want to cross-compare with past years data sets ...
6 to 7 FTE for the entire computing staf
Data-reduction : pass1 between 5 to 10 reduction , micro-DST extra
gain of 2 to 5 ; typical pass 1 storage size ~ 13 TB (last year)
About 20 TB of centralized storage available .
2 main facilities / processing centers : RCF/BNL and PDSF/NERSC
Jérôme Lauret, CHEP03, March 2003
STAR overview
137,000 pads ~ 70 M
pixel
If Zero suppressed
~ 10 M pixels
Event size comparable
to an image taken by a
Digital Camera ...
Jérôme Lauret, CHEP03, March 2003
Getting to Distributed Disk ...
The situation then ...
NFS resident storage did not scale well ($$ issue)
IO bottlenecks reduces overall farm utilization efficiency
Some storage on distributed disk (DD = local to our processing nodes)
~ 15 TB actually ... very tempting ...
Not GRID but …
Problem(s) are the same ...
Sociological : accessing data not visible from the user stand point
(node are non-interactive). Cannot use find, wild card ls or recursive
scanning script ...
Adiabatic changes needed … (Physics first)
Infrastructure : accurate inventory, easy access with minimal amount
of knowledge from user, rapid / real time deployment of data set
Analysis : assumed statistically and data sets driven
Jérôme Lauret, CHEP03, March 2003
Where we need(ed) to go ...
Needed
•A good and efficient file Catalog / Replica Catalog
•A user interface to submit jobs on the Grid ...
•Maximizing the usage of the 2 STAR main processing sites
(implies resource brokering, monitoring …)
•Other components needed (VO management, …)
The tools to distribute the data around but also to make the
result available back to the users … HRM / DRM
Jérôme Lauret, CHEP03, March 2003
What are RMs ?
Hierarchical Resource Managers / Storage Resource
Managers
Grid middle ware developed by the Scientific Data Management
Resource group (SDMR) in collaboration with STAR
Includes DRM, TRM, HRM
Software handling the data transfer for you ...
http://sdm.lbl.gov/indexproj.php?ProjectID=SRM
Talk by Alex Sim 4:50 – Center Hall 115
Will not go through the details ... It works great !! We like it ...
Looking for improving and expanding transfer capabilities ... Poster P6
Design of a High Performance Data Replication in the Grid
Environment for the STAR Collaboration
Dantong Yu
Jérôme Lauret, CHEP03, March 2003
File Cataloging
STAR had a file Catalog ... in flat table format ; MySQL back-end
Currently, 1.7 M entries (logical names) and 2.4 M replicas
Queries became problematic at ~ 800k entries (10 seconds order)
Needed a new (temporary) approach
Could support million of entries
Would not necessarily be centralized ...
Needed to contain the information we had before
Complex MetaData contains information about
The Run : magnetic field, collision, detector configuration, trigger setup
The Production : conditions, library version
...
MUST support our distributed disk approach ...
Nothing on the market at the time
Jérôme Lauret, CHEP03, March 2003
FileCatalog
Schedule
First proof-of-principle in 2001 ...
Nikita Soldatov
Pushed it in early 2002 (had to fight some momentum) Adam Kisiel
In service for more than a year
DD program made it an essential tool and breaking the first user
psychological barrier (enforced end of 2002)
Basic design relies
MetaData in separate tables we call dictionaries
A main table for the logical name FileData
One table holding the FileLocations which includes
site (BNL, LBL, ...), node
type of storage (NFS, HPSS, local ...)
Jérôme Lauret, CHEP03, March 2003
FileCatalog - Design
RunParams
File
Locations
1.N
FileData
Production
Conditions
FileTypes
1.N
Storage, site, node and path
forms the unique key for
FileLocations
Storage
Types
1.N
N.1
Storage
Sites
1.N
N.1
Meta Data
Locations / Replicas
Jérôme Lauret, CHEP03, March 2003
HPSS
NFS
local
FileCatalog – Learning experience
MySQL based
Avoid as much as possible table locking and index rebuild
INSERT in tables with un-referenced index ID delayed
All UPDATE are in low priority (delayed and on the stack)
Instead of deleting an entry, we mark FileLocations availability=0
A replication only adds an entry in FileLocations (optimized)
Supports ancestry (actually needed)
Regression test
150K files distributed on local disks over 100 nodes
No problem with simultaneous connections
Takes less than 10 seconds to get the records, check all files and update
the Catalog ...
Most of the time spent in …
if ( ! -e $file ){ nop;}
Jérôme Lauret, CHEP03, March 2003
Distributing files ... and Catalog
Server interacts with
HPSS, sorting, threads,
restore files on cache
Client Script
adds records
D
Pftp on local disk
D
DataCarousel
Update FileLocations
Mark {un-}available
Spider and update *
Control Nodes
D
D
FileCatalog Management
Jérôme Lauret, CHEP03, March 2003
Distributing files ... and Catalog
File distribution is 6 month-old production and sturdy ...
•Files added to the Catalog as they are produced and disk populated
•Spidering done on demand, ucheck once an hour, full check only once
a day (stable)
Learning lessons
•Turns out that not all analysis are statistically driven …
Automate population ?? Magic algorithm ??
Best bet is to look at user’s usage pattern – Mixed technology ??
• Resource sharing = resource blocking
Facilities are shared … More replicas ?? Pre-emption ??
Smarter distribution ; re-distribute
Mercedes Lopez Noriega
•API : one complete in perl/DBI but a partial C++ interface (does not
access the full complexity). Should have tried to use schemas ...
Jérôme Lauret, CHEP03, March 2003
Catalog - Web Front end
Jérôme Lauret, CHEP03, March 2003
FileCatalog - What's next ?
Test Distributed Catalog in real life (and spare time) ...
Master-Slave replication. Extensive experience here
At any given site, will have N Catalogs, 1 slave per site
One Master-Copy (will) contain Merged records
Interface for connections already set (XML ; one API)
For now
Deploy at PDSF (on-going)
WAIT for a replacement & a GRID aware viable solution
… or +1 FTE
Good enough for now … and for …
Jérôme Lauret, CHEP03, March 2003
The STAR Scheduler
Resource Broker
Poster P9
For details, MUST see Poster P9 ...
Gabriele Carcassi
Fully interfaced with our FileCatalog
Flexible XML U-JDL also provides hand-shaking with (any) db
The scheduler
•Interfaces with the FileCatalog ; Query Resolver ; implementation is
modular
•Split the job into N sub-jobs according to where the files OR where a
resource is available ...
<?xml version=”1.0” encoding=”utf-8” ?>
<job>
<command>root4star -q -b myMacro.C\(\“$FILELIST\”\)</command>
<stdout URL=”file:/star/u/xxx/work/$JOBID.out”>
<input URL=”catalog:star.bnl.gov?production=P03ia,storage=local,
filetype=daq_reco_MuDst” nFiles=”2000”/>
<output fromScratch=”*.root” toURL=”file:/star/u/xxx/work/”/>
</job>
Jérôme Lauret, CHEP03, March 2003
Scheduler Architecture Poster P9
U-JDL
JobInitializer
MySQL Server
FileCatalog
Interface
Ganglia
MDS
Monitoring
Policy
Dispatcher
LSF/Condor-G
Perl
Module
Do not address yet how
Files are returned to
users … if returned.
Next
Jérôme Lauret, CHEP03, March 2003
Ganglia / MDS
Ganglia : a distributed Monitoring System ...
http://ganglia.sourceforge.net/
Ganglia information / MDS
• MDS = Monitoring and Discovery Service
• Why ?
Security Issues ... Not adequate for cross-site propagation
Mesh of dependencies may become complex
• Phase 1 done
Eftathiadis Eftratios
Provider vs Schema issues : debugged and now works
Information pushed into MDS and checked
•Need much much more work to be usable for resource brokering …
publishing delays issues
service availability issue
Jérôme Lauret, CHEP03, March 2003
Speaking of security ...
Before long, we will need to address MySQL security issues
(data integrity)
Not only a FileCatalog issue
STAR has ALL Calibration in MySQL and already 10 mirrors
Million of records, 10x GB of (reduced) data
MySQL 4.x - X509 certificates ...
Being investigate
Richard Casella, Jeff Porter
Strategy for now : what is really there and what works
What we need is encrypted database replication …
Should investigate GT3/ OGSA
Collaborative effort needed and welcomed ...
Jérôme Lauret, CHEP03, March 2003
Conclusion
We have learn and gained a lot ...
Nicely preparing our Users to the Grid (local scale distributed resource,
fixed U-JDL) without them noticing it.
We learn ourselves lessons on resource sharing & blocking
In a position to refine our resource brokering
Will are ready for components swapping …
STAR Scheduler : Components can be replaced by Grid middle ware
Submission to Condor-G tested ; more experience in the months to come
Waiting for a stable Replica Catalog (but not stuck)
Last but not least : learning to work with one another, its
merits and limitations
Jérôme Lauret, CHEP03, March 2003
Download