Reliable and Efficient Data Placement in a Grid Environment PhD Research Summary Tevfik Kosar IBM TJ Watson Research Center June 22nd, 2004 Grid Computing “Distributed computing across networks using open standards supporting heterogeneous resources” - IBM Reliable and Efficient Data Placement in a Grid Environment Motivations for Grid Computing Increase Capacity Improve Efficiency / Reduce Costs Reduce “Time to Results” Provide Reliability / Availability Support Heterogeneous systems Enable Collaborations … Reliable and Efficient Data Placement in a Grid Environment Future of Grid “Grid is hot because it's the right technology for its time and within the next five years it will be a de facto part and parcel of virtually every major financial markets firm's infrastructure..” ` - Grid Computing in Financial Markets: Moving Beyond Compute Intensive Applications, Tabb Group Reliable and Efficient Data Placement in a Grid Environment Moving Beyond Compute-Intensive Applications “While the compute-intensive segment is growing, the vast amount of new grid growth will not come from compute-intensive solutions, but from data and service grids whose application we believe to be much wider than traditional compute grids.” ` - Grid Computing in Financial Markets: Moving Beyond Compute Intensive Applications, Tabb Group Reliable and Efficient Data Placement in a Grid Environment What about Science? Genomic information processing applications Biomedical Informatics Research Network (BIRN) applications Cosmology applications (MADCAP) Methods for modeling large molecular systems Coupled climate modeling applications Real-time observatories, applications, and datamanagement (ROADNet) Reliable and Efficient Data Placement in a Grid Environment Some Remarkable Numbers Characteristics of four physics experiments targeted by GriPhyN: Application First Data Data Volume User (TB/yr) Community SDSS 1999 10 100s LIGO 2002 250 100s ATLAS/ CMS 2005 5,000 1000s Source: GriPhyN Proposal, 2000 Reliable and Efficient Data Placement in a Grid Environment Even More Remarkable… “ ..the data volume of CMS is expected to subsequently increase rapidly, so that the accumulated data volume will reach 1 Exabyte (1 million Terabytes) by around 2015.” Source: PPDG Deliverables to CMS Reliable and Efficient Data Placement in a Grid Environment Access to Remote Data Remote I/O Move application close to data Move data close to application Move both data and application Reliable and Efficient Data Placement in a Grid Environment Access to Remote Data Remote I/O Move application close to data Move data close to application Move both data and application Remote I/O does not scale well for large data sets! Reliable and Efficient Data Placement in a Grid Environment Access to Remote Data Remote I/O Move application close to data Move data close to application Move both data and application Remote I/O does not scale well for large data sets! Storage sites do not always have sufficient computational power nearby! Reliable and Efficient Data Placement in a Grid Environment Need to move data around TB TB PB PB Reliable and Efficient Data Placement in a Grid Environment While doing this.. Locate the data Access heterogeneous resources Face with all kinds of failures Allocate and de-allocate storage Move the data Clean-up everything All of these need to be done reliably and efficiently! Reliable and Efficient Data Placement in a Grid Environment Goal Data placement is crucial in a Grid environment. Current approaches regard it as a side affect of computation. Data placement must be regarded as a first class citizen in the Grid just like the computational jobs. Reliable and Efficient Data Placement in a Grid Environment Approach Regard data placement activities as full fledged jobs. Design and implement a system to reliably and efficiently schedule, execute, monitor, and manage them. Reliable and Efficient Data Placement in a Grid Environment Outline Introduction Background The Concept Data Placement Subsystem Progress Made Contributions Future Work Reliable and Efficient Data Placement in a Grid Environment Background CPU BUS HARDWARE LEVEL I/O PROCESSOR MEMORY CONTROLLER DISK Reliable and Efficient Data Placement in a Grid Environment Background OPERATING SYSTEMS LEVEL I/O SUBSYSTEM I/O SCHEDULER CPU SCHEDULER I/O CONTROL SYSTEM CPU HARDWARE LEVEL I/O PROCESSOR MEMORY DMA CONTROLLER DISK Reliable and Efficient Data Placement in a Grid Environment BUS Background BATCH SCHEDULERS DISTRIBUTED SYSTEMS LEVEL I/O SUBSYSTEM OPERATING SYSTEMS LEVEL I/O CPU SCHEDULER SCHEDULER I/O CONTROL SYSTEM CPU HARDWARE LEVEL I/O PROCESSOR MEMORY DMA CONTROLLER DISK Reliable and Efficient Data Placement in a Grid Environment BUS Background DISTRIBUTED SYSTEMS LEVEL BATCH SCHEDULERS DATA PLACEMENT SUBSYSTEM I/O SUBSYSTEM OPERATING SYSTEMS LEVEL I/O CPU SCHEDULER SCHEDULER I/O CONTROL SYSTEM CPU HARDWARE LEVEL I/O PROCESSOR MEMORY DMA CONTROLLER DISK Reliable and Efficient Data Placement in a Grid Environment BUS Outline Introduction Background The Concept Data Placement Subsystem Progress Made Contributions Future Work Reliable and Efficient Data Placement in a Grid Environment The Concept • Stage-in • Execute the Job • Stage-out Individual Jobs Reliable and Efficient Data Placement in a Grid Environment The Concept Allocate space for input & output data Stage-in • Stage-in • Execute the Job • Stage-out Execute the job Release input space Individual Jobs Stage-out Release output space Reliable and Efficient Data Placement in a Grid Environment Traditional Schedulers Not aware of characteristics and semantics of data placement jobs Executable = /tmp/foo.exe Arguments = a b c d Executable = globus-url-copy Arguments = gsiftp://host1/f1 . gsiftp://host2/f2 Any difference? Reliable and Efficient Data Placement in a Grid Environment Understanding Job Characteristics & Semantics Job_type = transfer, reserve, release? Source and destination hosts, files, protocols to use? Determine concurrency level Can select alternate protocols Can select alternate routes Can tune network parameters (tcp buffer size, I/O block size, # of parallel streams) … Reliable and Efficient Data Placement in a Grid Environment The Concept Allocate space for input & output data Stage-in • Stage-in • Execute the Job • Stage-out Execute the job Release input space Individual Jobs Stage-out Release output space Reliable and Efficient Data Placement in a Grid Environment The Concept Allocate space for input & output data Stage-in • Stage-in • Execute the Job • Stage-out Execute the job Release input space Stage-out Data Placement Jobs Computational Jobs Release output space Reliable and Efficient Data Placement in a Grid Environment Outline Introduction Background The Concept Data Placement Subsystem Progress Made Contributions Future Work Reliable and Efficient Data Placement in a Grid Environment USER JOB DESCRIPTIONS USER PLANNER JOB DESCRIPTIONS USER JOB DESCRIPTIONS PLANNER COMPUTATION SCHEDULER DATA PLACEMENT SCHEDULER STORAGE SYSTEMS COMPUTE NODES USER JOB DESCRIPTIONS PLANNER COMPUTATION SCHEDULER C. JOB LOG FILES DATA PLACEMENT SCHEDULER RESOURCE BROKER/ POLICY ENFORCER D. JOB LOG FILES STORAGE SYSTEMS USER JOB DESCRIPTIONS PLANNER COMPUTATION SCHEDULER C. JOB LOG FILES DATA PLACEMENT SCHEDULER RESOURCE BROKER/ POLICY ENFORCER D. JOB LOG FILES STORAGE SYSTEMS DATA MINER FEEDBACK MECHANISM NETWORK MONITORING TOOLS Transfer time (T) vs Probability (t < T) 1 0.8 0.6 0.4 0.2 Transfer time (T) (minutes) 15.8 14.9 12.7 10.3 9.5 8.6 7.9 7.3 6.9 6.6 6.1 5.8 5.6 5.4 5.2 5.0 4.8 0 4.6 Probabilty (t < T) (%) 1.2 USER JOB DESCRIPTIONS PLANNER COMPUTATION SCHEDULER C. JOB LOG FILES DATA PLACEMENT SCHEDULER RESOURCE BROKER/ POLICY ENFORCER D. JOB LOG FILES STORAGE SYSTEMS DATA MINER FEEDBACK MECHANISM NETWORK MONITORING TOOLS USER COMPUTATION SCHEDULER C. JOB LOG FILES DATA PLACEMENT SCHEDULER RESOURCE BROKER/ POLICY ENFORCER D. JOB LOG FILES STORAGE SYSTEMS DATA MINER FEEDBACK MECHANISM NETWORK MONITORING TOOLS DATA PLACEMENT SUBSYSTEM JOB DESCRIPTIONS PLANNER Outline Background Related Work The Concept Data Placement Subsystem Progress Made Contributions Future Work Reliable and Efficient Data Placement in a Grid Environment USER COMPUTATION SCHEDULER C. JOB LOG FILES DATA PLACEMENT SCHEDULER RESOURCE BROKER/ POLICY ENFORCER D. JOB LOG FILES STORAGE SYSTEMS DATA MINER Implemented FEEDBACK MECHANISM NETWORK MONITORING TOOLS DATA PLACEMENT SUBSYSTEM JOB DESCRIPTIONS PLANNER Separation of Jobs DAG specification DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. Reliable and Efficient Data Placement in a Grid Environment Separation of Jobs DAG specification DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. D A B F C E Workflow Manager Reliable and Efficient Data Placement in a Grid Environment Separation of Jobs Compute C Job Queue DAG specification DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. D A B F C E E Workflow Manager Reliable and Efficient Data Placement in a Grid Environment DaP Job Queue Separation of Jobs Condor Job Queue DAG specification DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C D A B F C E E DAGMan Reliable and Efficient Data Placement in a Grid Environment Stork Job Queue Stork: Data Placement Scheduler Most important component of the data placement subsystem. Understands the characteristics and semantics of data placement jobs. Can make smart scheduling decisions for reliable and efficient data placement. Reliable and Efficient Data Placement in a Grid Environment Support for Heterogeneity Protocol translation using Stork memory buffer. Reliable and Efficient Data Placement in a Grid Environment Support for Heterogeneity Protocol translation using Stork Disk Cache. Reliable and Efficient Data Placement in a Grid Environment Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… …… Max_Retry = 10; Restart_in = “2 hours”; ] Reliable and Efficient Data Placement in a Grid Environment Run-time Adaptation Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ] Reliable and Efficient Data Placement in a Grid Environment Run-time Adaptation -2 Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs tcp_bs p = 1024KB; = 1024KB; = 4; //block size //TCP buffer size ] Reliable and Efficient Data Placement in a Grid Environment Failure Recovery and Efficient Resource Utilization Fault tolerance Control number of concurrent transfers from/to any storage system Just submit a bunch of data placement jobs, and then go away.. Prevents overloading Space allocation and De-allocations Make sure space is available Reliable and Efficient Data Placement in a Grid Environment Case Study -I Reliable and Efficient Data Placement in a Grid Environment Dynamic Protocol Selection Reliable and Efficient Data Placement in a Grid Environment Runtime Adaptation Before Tuning: • parallelism = 1 • block_size = 1 MB • tcp_bs = 64 KB After Tuning: • parallelism = 4 • block_size = 1 MB • tcp_bs = 256 KB Reliable and Efficient Data Placement in a Grid Environment Case Study -II: SRB-UniTree Data Pipeline Transfer ~3 TB of DPOSS data from SRB @SDSC to UniTree @NCSA No common interface Network and storage limitations A data pipeline created with Stork Reliable and Efficient Data Placement in a Grid Environment Management Site (at UW) A D SRB Server (at SDSC) UniTree Server (at NCSA) 1 Gb/s, 0.4 ms 100 Mb/s, 0.6 ms 100 Mb/s, 66.7 ms SDSC Cache NCSA Cache (20 GB Disk space) (20 GB disk space) B C Control flow Data flow Comparing Pipelines Configuration End-to-end rate 1 staging node 40 Mb/s 2 staging nodes (not tuned) 25.6 Mb/s 2 staging nodes (tuned) 47.6 Mb/s Reliable and Efficient Data Placement in a Grid Environment Failure Recovery UniTree not responding SDSC cache reboot & UW CS Network outage Diskrouter reconfigured and restarted Software problem Profiling Data Transfer Protocols and Servers Get a better understanding of transfers How time is spent at kernel level during transfers? Profiled GridFTP and NeST servers using “oprofile” and by changing different parameters Reliable and Efficient Data Placement in a Grid Environment Percentage of CPUof TimeCPU time Percentage GridFTP Read: 6.5 MB/s Write: 7.8 MB/s 45.0 40.0 35.0 30.0 25.0 20.0 15.0 10.0 5.0 0.0 Idle Ethern et Driver Interru pt Handli Read From GridFTP 15.9 40.9 Write To GridFTP 44.5 1.5 Libc Globus Oprofil e IDE File I/O Rest of Kernel 10.3 8.1 2.7 2.7 4.0 2.0 13.4 4.9 16.8 3.8 5.1 0.3 3.8 19.3 CPU Percentage Percentageof of Server CPU time 60.0 50.0 40.0 30.0 20.0 10.0 0.0 Idl e Ethernet Dri ver Interrupt Handl i ng Li bc NeST Oprofi l e IDE Fi l e I/O Rest of Kernel Read From NeST 12.5 44.2 10.2 10.4 1.1 3.5 3.0 2.1 12.9 Wri te To NeST 57.7 1.0 4.3 12.6 6.7 3.7 0.3 1.7 12.0 c) d) Outline Introduction Background The Concept Data Placement Subsystem Progress Made Contributions Future Work Reliable and Efficient Data Placement in a Grid Environment Contributions Short term: Provide a system for reliable and efficient data placement for the use of Grid community Already deployed at: Soon will be deployed at: NCSA, WCER, UW-HEP, NOAO, OSU Projects: USCMS, DPOSS, SDSS, Blast CERN, ISI, SLAC, Caltech, LOCI, BMRB Projects: CMS, BaBar, Quest, IBP In CERN package it will be distributed to 40 countries and 70 institutions Reliable and Efficient Data Placement in a Grid Environment Contributions Medium Term: Introduce a new concept to the distributed systems community “Regard data placement as a first class citizen” Profiling work: better understanding of storage systems and transfer protocols Characterization of data placement jobs Provide and apply a set of policies for storage systems Reliable and Efficient Data Placement in a Grid Environment Contributions Long Term: Serve as a basis for further research in data placement area Two of our papers are already being studies in classes: Eg. Graduate level “Scheduling in Distributed Systems” class at OSU Reliable and Efficient Data Placement in a Grid Environment Future Work Get a better understanding of data placement jobs Extend the Profiling work Study real workloads Define the set of scheduling decisions specific to them and apply Reliable and Efficient Data Placement in a Grid Environment Future Work - II Define a set of policies for storage systems and apply Use results of profiling work Consider user concerns Prevent overloading, avoid failures Provide efficient usage and load balancing.. Reliable and Efficient Data Placement in a Grid Environment Future Work - III Collect useful information, interpret and feed back to the scheduling system Collect and interpret log files Interact with network monitoring tools Increase reliability and efficiency Run-time adaptation Reliable and Efficient Data Placement in a Grid Environment Future Work - IV Better coordination of computational and data resources Study ways to interact/integrate the data placement scheduler with higher level planners and computational schedulers More reliable and efficient data processing systems/pipelines Reliable and Efficient Data Placement in a Grid Environment Conclusions Data placement is crucial in a distributed computing environment. Current approaches regard it as a side affect of computation. It must be regarded as a first class citizen just like the computational jobs. Regard data placement activities as full fledged jobs. Reliable and Efficient Data Placement in a Grid Environment Conclusions Distinguish data placement jobs from computational jobs. Design and implement a data placement subsystem to reliably and efficiently schedule, execute, monitor, and manage them. Reliable and Efficient Data Placement in a Grid Environment Thank you for listening.. Questions?