Data-Driven Batch Scheduling John Bent Ph.D. Oral Exam Department of Computer Sciences University of Wisconsin, Madison May 31, 2005 Batch Computing Highly successful, widely-deployed technology • Frequently used across the sciences • As well as in video production, document processing, data mining, financial services, graphics rendering Evolved into much more than a job queue • • • • Manages multiple distributed resources Handles complex job dependencies Checkpoints and migrates running jobs Enables transparent remote execution Environmental transparency is becoming problematic Problem with Environmental Transparency Emerging trends • • • • Data sets are getting larger [Roselli00,Zadok04,LHC@CERN] Data growth outpacing increase in processing ability [Gray03] More wide-area resources are available [Foster01,Thain04] Users push successful technology beyond design [Wall02] Batch schedulers are CPU-centric • • • • Ignore data dependencies Data movement happens as a side-effect of job placement Great for compute-intensive workloads executing locally Horrible for data-intensive workloads executing remotely How to run data-intensive workloads remotely? Approaches to Remote Execution Direct I/O: Direct remote data access • Simple and reliable • Large throughput hit Prestaging Data: Manipulating environment • Allows remote execution without performance hit • High user burden • Fault tolerance, restarts are difficult Distributed File Systems: AFS, NFS, etc. • Designed for wide-area networks and sharing • Infeasible across autonomous domains • Policies not appropriate for batch workloads Common Fundamental Problem Layers upon CPU-centric scheduling Data planning external to the scheduler • “Second class citizen in batch scheduling” Data-aware scheduling policies needed • Knowledge of I/O behavior of batch workloads • Support from the file system • Data-driven batch scheduler Outline Intro Profiling data-intensive batch workloads File system support Data-driven scheduling Conclusions and contributions Selecting Workloads to Profile General characteristics of batch workloads? Select a representative suite of workloads • • • • • • • BLAST IBIS CMS Hartree-Fock Nautilus AMANDA SETI@home biology ecology physics chemistry molecular dynamics astrophysics astronomy Key Observation: Batch-Pipeline Workloads Workloads consist of many processes • Complex job and data dependencies • Three types of data and data sharing Vertical relationships: Pipelines • Processes often are vertically dependent • Output of parent becomes input to child Horizontal relationships: Batches • Users often submit multiple pipelines • May be read sharing across siblings Batch-Pipeline Workloads Batch width Endpoint Endpoint Endpoint Endpoint Batch dataset Pipeline Endpoint Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Endpoint Endpoint Endpoint Endpoint Batch dataset Pipeline Observations Large proportion of random access • IBIS, CMS close to 100%, HF ~ 80% No isolated pipeline is resource intensive • Max memory is 500 MB • Max disk bandwidth is 8 MB/s Amdahl and Gray balances skewed • Drastically overprovisioned in terms of I/O bandwidth and memory capacity Workload Observations These pipelines are NOT run in isolation • Submitted in batch widths of 100s, 1000s • Run in parallel on remote CPUs Large amount of I/O sharing • Batch and pipeline I/O are shared • Endpoint I/O is relatively small Scheduler must differentiate I/O types • Scalability implications • Performance implications Bandwidth needed for Direct I/O Storage center (1500 MB/s) Commodity disk (40 MB/s) • Bandwidth needed for all data (i.e. endpoint, batch, pipeline). • Only seti, ibis, and cms can scale to greater than 1000. Eliminate Pipeline Data Storage center (1500 MB/s) Commodity disk (40 MB/s) • Endpoint and batch data is accessed at submit site. • Better . . . Eliminate Batch Data Storage center (1500 MB/s) Commodity disk (40 MB/s) • Bandwidth needed for endpoint and pipeline data. • Better . . . Complete I/O Differentiation Storage center (1500 MB/s) Commodity disk (40 MB/s) • Only endpoint traffic is incurred at archive storage. • Best. All can scale to batch width greater than 1000. Outline Intro Profiling data-intensive batch workloads File system support • Develop new distributed file system: BAD-FS • Develop capacity-aware scheduler • Evaluate against CPU-centric scheduling Data-driven scheduling Conclusions and contributions Why a New Distributed File System? Internet Home store User needs mechanism for remote data access Existing distributed file systems seem ideal • Easy to use • Uniform name space • Designed for wide-area networks But . . . • Not practically deployable • Embedded decisions are wrong Existing DFS’s make bad decisions Caching • Must guess what and how to cache Consistency • Output: Must guess when to commit • Input: Needs mechanism to invalidate cache Replication • Must guess what to replicate BAD-FS makes good decisions Removes the guesswork • Scheduler has detailed workload knowledge • Storage layer allows external control • Scheduler makes informed storage decisions Retains simplicity and elegance of DFS • Uniform name space • Location transparency Practical and deployable Batch-Aware Distributed File System: Practical and Deployable User-level; requires no privilege Packaged as a modified batch system SGE SGE SGE SGE BADFS Internet SGE SGE SGE SGE Home store A new batch system which includes BAD-FS General; will work on all batch systems Modified Batch System Compute node Compute node Compute node Compute node CPU CPU CPU CPU Manager Manager Manager Manager Storage Manager BAD-FS Storage Manager BAD-FS Storage Manager BAD-FS Storage Manager 1) Storage managers 2) Batch-Aware Distributed File System 3) Expanded job description language 1 2 Home storage 3 4 Job queue 4) Data-aware scheduler Data-aware Scheduler Scheduler Requires Knowledge Remote cluster knowledge • Storage availability • Failure rates Workload knowledge • Data type (batch, pipeline, or endpoint) • Data quantity • Job dependencies Requires Storage Control BAD-FS exports explicit control via volumes • Abstraction allowing external storage control • Guaranteed allocations for workload data • Specified type: either cache or scratch Scheduler exerts control through volumes • Creates volumes to cache input data • Subsequent jobs can reuse this data • Creates volumes to buffer output data • Destroys pipeline, copies endpoint • Configures workload to access appropriate volumes Knowledge Plus Control Enhanced performance • I/O scoping • Capacity-aware scheduling Improved failure handling • Cost-benefit replication Simplified implementation • No cache consistency protocol I/O Scoping Technique to minimize wide-area traffic Allocate volumes to cache batch data Allocate volumes for pipeline and endpoint Compute node Compute node Extract endpoint AMANDA: 200 MB pipeline 500 MB batch 5 MB endpoint BAD-FS Scheduler Steady-state: Only 5 of 705 MB traverse wide-area. Capacity-Aware Scheduling Technique to avoid over-allocations • Over-allocated batch data causes WAN thrashing • Over-allocated pipeline data causes write failures Scheduler has knowledge of • Storage availability • Storage usage within the workload Scheduler runs as many jobs as can fit Capacity-Aware Scheduling Endpoint Endpoint Endpoint Endpoint Pipeline Endpoint Pipeline Endpoint Pipeline Endpoint Pipeline Endpoint Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Pipeline Batch dataset Batch dataset Batch dataset Pipeline Batch dataset Endpoint Endpoint Endpoint Endpoint Capacity-aware scheduling 64 batch-intensive synthetic pipelines • Four processes within each pipeline • Four different batch datasets • Vary size of batch data 16 compute nodes Improved failure handling Scheduler understands data semantics • Data is not just a collection of bytes • Losing data is not catastrophic • Output can be regenerated by rerunning jobs Cost-benefit replication • Replicates only data whose replication cost is cheaper than cost to rerun the job Simplified implementation Data dependencies known Scheduler ensures proper ordering Build a distributed file system • With cooperative caching • But without a cache consistency protocol Real workload experience Setup • Batch-width of 64 • 16 compute nodes • Emulated wide-area Configuration • Remote I/O • AFS-like with /tmp • BAD-FS Result is order of magnitude improvement Outline Intro Profiling data-intensive batch workloads File system support Data-driven scheduling • • • • • Codify simplifying assumptions Define data-driven scheduling policies Develop predictive analytical models Evaluate Discuss Conclusions and contributions Simplifying Assumptions Predictive models • Only interested in relative (not absolute) accuracy Batch-pipeline workloads are “canonical” • Mostly uniform and homogeneous • Combine pipeline and endpoint data into private Environment • Assume homogeneous compute nodes Target scenario • Single workload on a single compute cluster Predictive Accuracy: Choosing Relative over Absolute Develop predictive models to guide scheduler • Absolute accuracy not needed • Relative accuracy to select between different possible schedules Predictive model does not consider • • • • Latencies Disk bandwidths Buffer cache size or bandwidth Other “lower” characteristics such as buses, CPUs, etc Simplify model and retain relative predictive accuracy • These effects are more uniform across different possible schedules • Network and CPU utilization make largest difference The Trade-off? Loss of absolute accuracy Canonical B-P Workload WWidth WPrivate WPrivate WBatch WDepth WRun +/WVar WRun +/WVar WPrivate WPrivate WBatch WRun +/WVar WRun +/WVar WPrivate WPrivate Represented using six variables. Target Compute Platform CFailure CRemote CStorage CStorage CLocal CCPU Storage Server Compute Cluster Represented using five variables. Scheduling Objective Maximize throughput for user workloads Easy without data considerations • Run as many jobs as possible Potential problems with data considerations • WAN overutilization by resending batch data • Data barriers can reduce CPU utilization • Concurrency limits can reduce CPU utilization Need data-driven policies to avoid these problems Five Scheduling Allocations Define allocations to avoid problems • • • • • All: Has no constraints; allows CPU-centric scheduling AllBatch: Avoids overutilizing WAN and barriers AllPrivate: Avoids concurrency limits Slice: Avoids overutilizing WAN Minimal: Attempts to maximize concurrency Each allocation might incur other problems Which problems are incurred by each? Workflow for All Allocation Full concurrency No batch data refetch No barriers Requires most storage Minimum storage needed: WDepthWBatch + WWidthWPrivate(WDepth+1) Workflow for AllBatch allocation Limited concurrency No batch data refetch No barriers Minimum storage needed: WDepthWBatch + 2WPrivate Workflow for AllPrivate allocation Full concurrency No batch data refetch Barriers possible Minimum storage needed: WBatch + WWidthWPrivate(WDepth+1) Workflow for Slice allocation Limited concurrency No batch data refetch Barriers possible Minimum storage needed: WBatch + WWidthWPrivate +WPrivate Workflow for Minimal allocation Limited concurrency Batch data refetch Barriers possible Smallest storage footprint Minimum storage needed: WBatch + 2WPrivate Selecting an allocation Multiple allocations may be possible • Dependent on WWidth and WDepth • Dependent of WBatch and WPrivate Which allocations are possible? Which possible allocation is preferrable? Maximum Private Volume Size (% of CStorage) WWidth = 3, WDepth = 3 Only Minimal Possible Minimal, Slice, AllBatch Possible All Five Allocations Possible Maximum Batch Volume Size (% of CStorage) Analytical Modelling How to select when multiple are possible? Use analytical models to predict runtimes Use 6 workload and 5 environmental constants • WWidth, WDepth, WBatch, WPrivate, WRun, WVar • CStorage, CFailure, CRemote, CLocal, CCPU All Predictive Model TTotal TColdPhase VW arm TW armPhase TColdPhase TColdBatch TPrivateRead TPrivateW rite TCompute TColdBatch BW(C Remote,CLocal) DTotBatch 1 BW(C Remote,CLocal) 1 1 CRemote CLocal DTotBatch WDepth WBatch TPrivateRead BW(C Remote,CLocal) WPrivate CLocal WPrivate WDepth 1 TPrivateW rite CLocal WPrivate WDepth 1 BW(C Remote,CLocal) WPrivate TCompute WRun WDepth WW idth VW arm 1 min( VExec, CCPU ) VExec WW idth TW armPhase TW armBatch TPrivateRead TPrivateW rite TCompute TW armBatch CLocal DTotBatch AllBatch Predictive Model TTotal TColdPhase VWarm TWarmPhase TColdPhase TColdBatch T Pr ivate Re ad TPrivateWrite TCompute TColdBatch BW(CRemote,CLocal) DTotBatch BW(CRemote,CLocal) 1 1 1 CRemote CLocal DTotBatch WDepth WBatch T Pr ivate Re ad BW(CRemote,CLocal) WPrivate CLocal WPrivate WDepth 1 T Pr ivateWrite CLocal WPrivate WDepth 1 BW(CRemote,CLocal) WPrivate TCompute WRun WDepth WWidth VWarm 1 min(VExec, CCPU ) CStorage DTotBatch VExec 2WPrivate TWarmPhase TWarmBatch TPrivateRead TPrivateWrite TCompute TWarmBatch CLocal DTotBatch AllPrivate Predictive Model TTotal TColdPhase VWarm TWarmPhase TColdPhase TColdBatch T Pr ivate Re ad TPrivateWrite TCompute TColdBatch BW(CRemote,CLocal) DTotBatch 1 BW(CRemote,CLocal) 1 1 CRemote CLocal DTotBatch WDepth WBatch T Pr ivate Re ad BW(CRemote,CLocal) WPrivate CLocal WPrivate WDepth 1 T Pr ivateWrite CLocal WPrivate WDepth 1 BW(CRemote,CLocal) WPrivate WRun WVar TCompute WRun WDepth VBatch CStorage DTotPrivate VBatch WBatch DTotPrivate WDepth 1 WPrivate WWidth WWidth VWarm 1 min( V Exec , C CPU ) VExec WWidth TWarmPhase TWarmBatch TPrivateRead TPrivateWrite TCompute TWarmBatch CLocal DTotBatch Slice Predictive Model TTotal TColdPhase VWarm TWarmPhase TColdPhase TColdBatch T Pr ivate Re ad TPrivateWrite TCompute TColdBatch BW(CRemote,CLocal) DTotBatch BW(CRemote,CLocal) 1 1 1 C C Slice has twoD phases of concurrency W W Remote TotBatch Local Depth Batch T BW(C ,C ) W C W • At start and end of workload T C W W 1 BW(C • Steady-state in W middle W the T W W Pr ivate Re ad Pr ivateWrite Local Remote Local Private Run Private Local Depth Private WDepth 1 ,CLocal) WPrivate Remote Var Run Depth VBatch CStorage DSliceData VBatch WBatch DSliceData WPrivate WWidth VExec WPrivate Compute Concurrency limited by remaining storage after allocating one volume for each pipeline W W W C V At start and end, fewerW pipelines are allocated W V 1 Other allocations min( are consistent V more , C ) Storage Batch Private Width Exec Private Width Warm ExecAve CPU VExecAve n n DNumJobs 2n VExec n n CStorage WBatch n 2WPrivate DNumJobs WDepth WWidth TWarmPhase TWarmBatch TPrivateRead TPrivateWrite TCompute TWarmBatch CLocal DTotBatch Minimal Predictive Model TTotal VCycles T Total ( Slice, SubWorkloadi ) i 1 WWidth VCycles V Exec CStorage WBatch VExec 2 W Private i VCycles WWidth ( SubWorkloadi ) { WWidth mod VExec, i VCycles VExec, Variables Used Workload Variables • WWidth and WDepth • WBatch and WPrivate • WRun and WVar Environmental Variables • CLocal and CRemote • CCPU and CStorage CFailure not considered in the models Modelling Failure Effect Almost uniform effect across all allocations • Estimate TTotal • Multiply by CFailure to estimate failed pipelines • Estimate TTotal for remaining failed pipelines Only one difference across allocations • For All and AllBatch, estimates for re-running failed pipelines do not include a “cold” phase Evaluation Winnow the allocations Define three synthetic workloads • Batch-intensive, private-intensive, mixed Simulate workloads and allocations across • Workload variables (WBatch, etc) • Environmental variables (CFailure, etc) Compare modelled predictions with results Winnowing the Allocations: Removing All and AllPrivate Remove All from consideration • Lacks constraint • No possible schedule results in problems • Scheduling is uninteresting Remove AllPrivate from consideration • Is a strict subset of Slice • When possible, Slice can run at full CPU utilization • No reason to prefer AllPrivate over Slice • AllPrivate more likely to suffer barriers • No benefit to holding onto “old” pipeline data Evaluation Framework: Baseline Characteristics Workloads • Batch-intensive • WBatch = 883 GB, WPrivate = 420 GB • Private-intensive • WBatch = 150 GB, WPrivate = 5,250 GB • Mixed • WBatch = 225 GB, WPrivate = 1,050 GB • Common across all workloads • WWidth = 350, WDepth = 5 • WRun = 3000s, WVar = 10% Environment • CCPU = 50, CStorage = 250 GB • CLocal = 12 MB/s, CRemote = 1 MB/s • CFailure = 0% Maximum Private Volume Size (% of CStorage) Synthetic Workloads Private-Intensive possible only in Minimal and AllBatch Batch-Intensive possible only in Minimal and Slice Mixed possible in Minimal, AllBatch, and Slice Maximum Batch Volume Size (% of CStorage) Experimental Evaluation Nine experiments across 3 workloads using 11 variables • • • • • • • • • WWidth:CCPU WDepth WBatch:CStorage WPrivate:CStorage WRun WVar:WRun WVar:(WRun●1000) CLocal:CRemote CFailure example results Compare throughput loss of worst to modelled best • Worst possible prediction routinely exceeds 50% • Modelled best routinely within 5%, never exceeds 30% Throughput Simulated Throughput Loss Throughput (pipes per minute) Sensitivity to WWidth:CCPU in Mixed Modelled WWidth:CCPU Predictive accuracy Throughput Simulated Throughput Loss Throughput (pipes per minute) Sensitivity to CFailure in Mixed Modelled CFailure Predictive accuracy Discussion Selecting appropriate allocation is important • A naïve CPU-centric approach that exceeds storage can lead to orders of magnitude throughput loss • Even a poorly choosen data-driven schedule can lead to upwards of 50% throughput loss Predictive modelling is highly accurate • Never predicts an allocation that exceeds storage • Consistently predicts the highest throughput allocation • Consistently avoids “bad” allocations What about non-canonical? Canonical scheduling still useful • Many workloads may be “almost” canonical • Canonical scheduling will work but might leave unnecessary pockets of CPU / storage • Similar to PBS which often elicits overly conservative runtime estimates Avoids deadlock or failures CPU-Centric Scheduling Problem 20% 20% 20% 20% 70% 70% •Naïve CPU-centric scheduler starts both, one will fail. •Naïve capacity-aware scheduler starts both, can finish neither. •Canonical data-driven scheduler correctly starts only one. Outline Intro Profiling data-intensive batch workloads File system support Data-driven scheduling Conclusions and contributions Related Work Distributed file systems • GFS [Ghemawat03] • P2P [Dahlin94,Kubi00,Muthitacharoen02,Rowstron01,Saito02] • Untrusted storage nodes [Flinn03] Data management in grid computing • FreeLoader [Vazhkudai05] • Caching and locating batch data [Bell03,Bent05,Chervenak02, Lamehehamedi03,Park03,Ranganathan02,Romosan05] • Stork [Kosar04] Parallel scheduling • Gang scheduling with memory constraint [Batat00] • Backfilling [Lifka95] • Reserving non-dedicated resources [Snell00] Query planning • LEO [Markl03] • Proactive reoptimization [Babu05] Conclusions Batch computing is successful for a reason. • • • • Simple mechanism Easy to understand and use Great need for it Works well But. The world is changing. • • • • Data sets are getting larger Increased availability of remote resources Existing approaches are fraught with problems CPU-centric scheduling is insufficient New data-driven policies are needed. Contributions Profiling • Batch-pipeline workloads • Taxonomy to reason about workloads • Taxonomy to distinguish between data types • Data differentiation allows data-driven scheduling File system support • Transfer of explicit storage control • Batch scheduler dictates allocation, caching, replication • Order of magnitude improvement over CPU-centric Data-driven batch scheduling • Multiple allocations and scheduling policies • Predictive analytical model End of Talk Future Work Measurement Information • Collecting • Retrieving • Mitigating inaccuracies Multi-* Non-canonical Dynamic reallocation Checkpointing Partial results