STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25th, 2004 Tokyo, Japan Some Remarkable Numbers Characteristics of four physics experiments targeted by GriPhyN: Application First Data Data Volume (TB/yr) User Community SDSS 1999 10 100s LIGO 2002 250 100s ATLAS/ CMS 2005 5,000 1000s Source: GriPhyN Proposal, 2000 Stork: Making Data Placement a First Class Citizen in the Grid 2 Even More Remarkable… “ ..the data volume of CMS is expected to subsequently increase rapidly, so that the accumulated data volume will reach 1 Exabyte (1 million Terabytes) by around 2015.” Source: PPDG Deliverables to CMS Stork: Making Data Placement a First Class Citizen in the Grid 3 More Data Intensive Applications.. Genomic information processing applications Biomedical Informatics Research Network (BIRN) applications Cosmology applications (MADCAP) Methods for modeling large molecular systems Coupled climate modeling applications Real-time observatories, applications, and datamanagement (ROADNet) Stork: Making Data Placement a First Class Citizen in the Grid 4 Need for Data Placement Data placement: locate, move, stage, replicate, cache data; allocate and deallocate storage space for data; clean-up everything afterwards. Stork: Making Data Placement a First Class Citizen in the Grid 5 State of the Art FedEx Manual Handling Using Simple Scripts RFT, LDR, SRM Data placement activities are simply regarded as “second class citizens” in the Grid world. Stork: Making Data Placement a First Class Citizen in the Grid 6 Our Goal Make data placement activities “first class citizens” in the Grid just like the computational jobs! They will be queued, scheduled, monitored, managed and even checkpointed. Stork: Making Data Placement a First Class Citizen in the Grid 7 Outline Introduction Methodology Grid Challenges Stork Solutions Case Studies Conclusions Stork: Making Data Placement a First Class Citizen in the Grid 8 Methodology • Stage-in • Execute the Job • Stage-out Individual Jobs Stork: Making Data Placement a First Class Citizen in the Grid 9 Methodology Stage-in • Stage-in • Execute the Job • Stage-out Execute the job Stage-out Individual Jobs Stork: Making Data Placement a First Class Citizen in the Grid 10 Methodology Allocate space for input & output data Stage-in • Stage-in • Execute the Job • Stage-out Execute the job Release input space Stage-out Release output space Individual Jobs Stork: Making Data Placement a First Class Citizen in the Grid 11 Methodology Allocate space for input & output data Stage-in • Stage-in • Execute the Job • Stage-out Execute the job Release input space Data Placement Jobs Stage-out Release output space Computational Jobs Stork: Making Data Placement a First Class Citizen in the Grid 12 Methodology Compute C Job Queue DAG specification DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. D A B F C E E DAG Executer Stork: Making Data Placement a First Class Citizen in the Grid DaP Job Queue 13 Grid Challenges Heterogeneous Resources Different Job Requirements Hiding Failures from Applications Overloading Limited Resources Stork: Making Data Placement a First Class Citizen in the Grid 14 Stork Solutions to Grid Challenges Interaction with Heterogeneous Resources Flexible Job Representation and Multilevel Policy Support Run-time Adaptation Failure Recovery and Efficient Resource Utilization Stork: Making Data Placement a First Class Citizen in the Grid 15 Interaction with Heterogeneous Resources Modularity, extendibility Plug-in protocol support A library of “data placement” modules Inter-protocol Translations Stork: Making Data Placement a First Class Citizen in the Grid 16 Direct Protocol Translation Stork: Making Data Placement a First Class Citizen in the Grid 17 Protocol Translation using Disk Cache Stork: Making Data Placement a First Class Citizen in the Grid 18 Stork Solutions to Grid Challenges Interaction with Heterogeneous Resources Flexible Job Representation and Multilevel Policy Support Run-time Adaptation Failure Recovery and Efficient Resource Utilization Stork: Making Data Placement a First Class Citizen in the Grid 19 Flexible Job Representation and Multilevel Policy Support ClassAd job description language Flexible and extendible Persistent Allows global and job level policies Stork: Making Data Placement a First Class Citizen in the Grid 20 Sample Stork submit file [ ] Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… …… Max_Retry = 10; Restart_in = “2 hours”; Stork: Making Data Placement a First Class Citizen in the Grid 21 Stork Solutions to Grid Challenges Interaction with Heterogeneous Resources Flexible Job Representation and Multilevel Policy Support Run-time Adaptation Failure Recovery and Efficient Resource Utilization Stork: Making Data Placement a First Class Citizen in the Grid 22 Run-time Adaptation Dynamic protocol selection [ ] [ ] dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/foo.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/foo.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/foo.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/foo.dat”; Stork: Making Data Placement a First Class Citizen in the Grid 23 Stork Solutions to Grid Challenges Interaction with Heterogeneous Resources Flexible Job Representation and Multilevel Policy Support Run-time Adaptation Failure Recovery and Efficient Resource Utilization Stork: Making Data Placement a First Class Citizen in the Grid 24 Failure Recovery and Efficient Resource Utilization Hides failures from users “retry” mechanism “kill and restart” mechanism Control number of concurrent transfers from/to any storage system Prevents overloading Space allocation and De-allocations Make sure space is available Stork: Making Data Placement a First Class Citizen in the Grid 25 Case Study I: SRB-UniTree Data Pipeline Transfer ~3 TB of DPOSS data (2611 x 1.1 GB files) from SRB @SDSC to UniTree @NCSA No direct interface Only way for transfer is building a pipeline Stork: Making Data Placement a First Class Citizen in the Grid 26 SRB-UniTree Data Pipeline Submit Site SRB Server UniTree Server SRB get UniTree put Grid-FTP/ DiskRouter third-party SDSC Cache NCSA Cache Stork: Making Data Placement a First Class Citizen in the Grid 27 Failure Recovery UniTree not responding Diskrouter reconfigured and restarted Stork: Making Data Placement a First Class Citizen in the Grid SDSC cache reboot & UW CS Network outage Software problem 28 Case Study -II Stork: Making Data Placement a First Class Citizen in the Grid 29 Dynamic Protocol Selection Stork: Making Data Placement a First Class Citizen in the Grid 30 Conclusions Regard data placement as real jobs. Treat computational and data placement jobs differently. Introduce a specialized scheduler for data placement. End-to-end automation, fault tolerance, run-time adaptation, multilevel policy support. Stork: Making Data Placement a First Class Citizen in the Grid 31 Future work Enhanced interaction between Stork and higher level planners co-scheduling of CPU and I/O Enhanced authentication mechanisms More run-time adaptation Stork: Making Data Placement a First Class Citizen in the Grid 32 You don’t have to FedEx your data anymore.. Stork delivers it for you! For more information URL: • http://www.cs.wisc.edu/condor/stork Email to: • kosart@cs.wisc.edu Stork: Making Data Placement a First Class Citizen in the Grid 33