Using distributed resources in STAR An Overview of our tools, architecture and experience … Jérôme Lauret, Gabriele Carcassi, Richard Casella, Eftathiadis Eftratios, Eric Hjort, Doug Olson, Jeff Porter Jérôme Lauret, CHEP03, March 2003 STAR overview The STAR collaboration 450 members, 45 institutions (3 more in a week ... ) Large data sets : 33 M events last year (Au+Au 200 AGeV), 55 M events this year so far and only half way through (planned 150 M) Represents Pbytes of data per year (total) and growing ... A demanding user community : data mining done in a few month ... also want to cross-compare with past years data sets ... 6 to 7 FTE for the entire computing staf Data-reduction : pass1 between 5 to 10 reduction , micro-DST extra gain of 2 to 5 ; typical pass 1 storage size ~ 13 TB (last year) About 20 TB of centralized storage available . 2 main facilities / processing centers : RCF/BNL and PDSF/NERSC Jérôme Lauret, CHEP03, March 2003 STAR overview 137,000 pads ~ 70 M pixel If Zero suppressed ~ 10 M pixels Event size comparable to an image taken by a Digital Camera ... Jérôme Lauret, CHEP03, March 2003 Getting to Distributed Disk ... The situation then ... NFS resident storage did not scale well ($$ issue) IO bottlenecks reduces overall farm utilization efficiency Some storage on distributed disk (DD = local to our processing nodes) ~ 15 TB actually ... very tempting ... Not GRID but … Problem(s) are the same ... Sociological : accessing data not visible from the user stand point (node are non-interactive). Cannot use find, wild card ls or recursive scanning script ... Adiabatic changes needed … (Physics first) Infrastructure : accurate inventory, easy access with minimal amount of knowledge from user, rapid / real time deployment of data set Analysis : assumed statistically and data sets driven Jérôme Lauret, CHEP03, March 2003 Where we need(ed) to go ... Needed •A good and efficient file Catalog / Replica Catalog •A user interface to submit jobs on the Grid ... •Maximizing the usage of the 2 STAR main processing sites (implies resource brokering, monitoring …) •Other components needed (VO management, …) The tools to distribute the data around but also to make the result available back to the users … HRM / DRM Jérôme Lauret, CHEP03, March 2003 What are RMs ? Hierarchical Resource Managers / Storage Resource Managers Grid middle ware developed by the Scientific Data Management Resource group (SDMR) in collaboration with STAR Includes DRM, TRM, HRM Software handling the data transfer for you ... http://sdm.lbl.gov/indexproj.php?ProjectID=SRM Talk by Alex Sim 4:50 – Center Hall 115 Will not go through the details ... It works great !! We like it ... Looking for improving and expanding transfer capabilities ... Poster P6 Design of a High Performance Data Replication in the Grid Environment for the STAR Collaboration Dantong Yu Jérôme Lauret, CHEP03, March 2003 File Cataloging STAR had a file Catalog ... in flat table format ; MySQL back-end Currently, 1.7 M entries (logical names) and 2.4 M replicas Queries became problematic at ~ 800k entries (10 seconds order) Needed a new (temporary) approach Could support million of entries Would not necessarily be centralized ... Needed to contain the information we had before Complex MetaData contains information about The Run : magnetic field, collision, detector configuration, trigger setup The Production : conditions, library version ... MUST support our distributed disk approach ... Nothing on the market at the time Jérôme Lauret, CHEP03, March 2003 FileCatalog Schedule First proof-of-principle in 2001 ... Nikita Soldatov Pushed it in early 2002 (had to fight some momentum) Adam Kisiel In service for more than a year DD program made it an essential tool and breaking the first user psychological barrier (enforced end of 2002) Basic design relies MetaData in separate tables we call dictionaries A main table for the logical name FileData One table holding the FileLocations which includes site (BNL, LBL, ...), node type of storage (NFS, HPSS, local ...) Jérôme Lauret, CHEP03, March 2003 FileCatalog - Design RunParams File Locations 1.N FileData Production Conditions FileTypes 1.N Storage, site, node and path forms the unique key for FileLocations Storage Types 1.N N.1 Storage Sites 1.N N.1 Meta Data Locations / Replicas Jérôme Lauret, CHEP03, March 2003 HPSS NFS local FileCatalog – Learning experience MySQL based Avoid as much as possible table locking and index rebuild INSERT in tables with un-referenced index ID delayed All UPDATE are in low priority (delayed and on the stack) Instead of deleting an entry, we mark FileLocations availability=0 A replication only adds an entry in FileLocations (optimized) Supports ancestry (actually needed) Regression test 150K files distributed on local disks over 100 nodes No problem with simultaneous connections Takes less than 10 seconds to get the records, check all files and update the Catalog ... Most of the time spent in … if ( ! -e $file ){ nop;} Jérôme Lauret, CHEP03, March 2003 Distributing files ... and Catalog Server interacts with HPSS, sorting, threads, restore files on cache Client Script adds records D Pftp on local disk D DataCarousel Update FileLocations Mark {un-}available Spider and update * Control Nodes D D FileCatalog Management Jérôme Lauret, CHEP03, March 2003 Distributing files ... and Catalog File distribution is 6 month-old production and sturdy ... •Files added to the Catalog as they are produced and disk populated •Spidering done on demand, ucheck once an hour, full check only once a day (stable) Learning lessons •Turns out that not all analysis are statistically driven … Automate population ?? Magic algorithm ?? Best bet is to look at user’s usage pattern – Mixed technology ?? • Resource sharing = resource blocking Facilities are shared … More replicas ?? Pre-emption ?? Smarter distribution ; re-distribute Mercedes Lopez Noriega •API : one complete in perl/DBI but a partial C++ interface (does not access the full complexity). Should have tried to use schemas ... Jérôme Lauret, CHEP03, March 2003 Catalog - Web Front end Jérôme Lauret, CHEP03, March 2003 FileCatalog - What's next ? Test Distributed Catalog in real life (and spare time) ... Master-Slave replication. Extensive experience here At any given site, will have N Catalogs, 1 slave per site One Master-Copy (will) contain Merged records Interface for connections already set (XML ; one API) For now Deploy at PDSF (on-going) WAIT for a replacement & a GRID aware viable solution … or +1 FTE Good enough for now … and for … Jérôme Lauret, CHEP03, March 2003 The STAR Scheduler Resource Broker Poster P9 For details, MUST see Poster P9 ... Gabriele Carcassi Fully interfaced with our FileCatalog Flexible XML U-JDL also provides hand-shaking with (any) db The scheduler •Interfaces with the FileCatalog ; Query Resolver ; implementation is modular •Split the job into N sub-jobs according to where the files OR where a resource is available ... <?xml version=”1.0” encoding=”utf-8” ?> <job> <command>root4star -q -b myMacro.C\(\“$FILELIST\”\)</command> <stdout URL=”file:/star/u/xxx/work/$JOBID.out”> <input URL=”catalog:star.bnl.gov?production=P03ia,storage=local, filetype=daq_reco_MuDst” nFiles=”2000”/> <output fromScratch=”*.root” toURL=”file:/star/u/xxx/work/”/> </job> Jérôme Lauret, CHEP03, March 2003 Scheduler Architecture Poster P9 U-JDL JobInitializer MySQL Server FileCatalog Interface Ganglia MDS Monitoring Policy Dispatcher LSF/Condor-G Perl Module Do not address yet how Files are returned to users … if returned. Next Jérôme Lauret, CHEP03, March 2003 Ganglia / MDS Ganglia : a distributed Monitoring System ... http://ganglia.sourceforge.net/ Ganglia information / MDS • MDS = Monitoring and Discovery Service • Why ? Security Issues ... Not adequate for cross-site propagation Mesh of dependencies may become complex • Phase 1 done Eftathiadis Eftratios Provider vs Schema issues : debugged and now works Information pushed into MDS and checked •Need much much more work to be usable for resource brokering … publishing delays issues service availability issue Jérôme Lauret, CHEP03, March 2003 Speaking of security ... Before long, we will need to address MySQL security issues (data integrity) Not only a FileCatalog issue STAR has ALL Calibration in MySQL and already 10 mirrors Million of records, 10x GB of (reduced) data MySQL 4.x - X509 certificates ... Being investigate Richard Casella, Jeff Porter Strategy for now : what is really there and what works What we need is encrypted database replication … Should investigate GT3/ OGSA Collaborative effort needed and welcomed ... Jérôme Lauret, CHEP03, March 2003 Conclusion We have learn and gained a lot ... Nicely preparing our Users to the Grid (local scale distributed resource, fixed U-JDL) without them noticing it. We learn ourselves lessons on resource sharing & blocking In a position to refine our resource brokering Will are ready for components swapping … STAR Scheduler : Components can be replaced by Grid middle ware Submission to Condor-G tested ; more experience in the months to come Waiting for a stable Replica Catalog (but not stuck) Last but not least : learning to work with one another, its merits and limitations Jérôme Lauret, CHEP03, March 2003