Data-Driven Batch Scheduling John Bent Ph.D. Oral Exam Department of Computer Sciences

advertisement
Data-Driven Batch Scheduling
John Bent
Ph.D. Oral Exam
Department of Computer Sciences
University of Wisconsin, Madison
May 31, 2005
Batch Computing
Highly successful, widely-deployed technology
• Frequently used across the sciences
• As well as in video production, document processing, data
mining, financial services, graphics rendering
Evolved into much more than a job queue
•
•
•
•
Manages multiple distributed resources
Handles complex job dependencies
Checkpoints and migrates running jobs
Enables transparent remote execution
Environmental transparency is becoming problematic
Problem with Environmental
Transparency
Emerging trends
•
•
•
•
Data sets are getting larger [Roselli00,Zadok04,LHC@CERN]
Data growth outpacing increase in processing ability [Gray03]
More wide-area resources are available [Foster01,Thain04]
Users push successful technology beyond design [Wall02]
Batch schedulers are CPU-centric
•
•
•
•
Ignore data dependencies
Data movement happens as a side-effect of job placement
Great for compute-intensive workloads executing locally
Horrible for data-intensive workloads executing remotely
How to run data-intensive workloads remotely?
Approaches to Remote Execution
Direct I/O: Direct remote data access
• Simple and reliable
• Large throughput hit
Prestaging Data: Manipulating environment
• Allows remote execution without performance hit
• High user burden
• Fault tolerance, restarts are difficult
Distributed File Systems: AFS, NFS, etc.
• Designed for wide-area networks and sharing
• Infeasible across autonomous domains
• Policies not appropriate for batch workloads
Common Fundamental Problem
Layers upon CPU-centric scheduling
Data planning external to the scheduler
• “Second class citizen in batch scheduling”
Data-aware scheduling policies needed
• Knowledge of I/O behavior of batch workloads
• Support from the file system
• Data-driven batch scheduler
Outline
Intro
Profiling data-intensive batch workloads
File system support
Data-driven scheduling
Conclusions and contributions
Selecting Workloads to Profile
General characteristics of batch workloads?
Select a representative suite of workloads
•
•
•
•
•
•
•
BLAST
IBIS
CMS
Hartree-Fock
Nautilus
AMANDA
SETI@home
biology
ecology
physics
chemistry
molecular dynamics
astrophysics
astronomy
Key Observation:
Batch-Pipeline Workloads
Workloads consist of many processes
• Complex job and data dependencies
• Three types of data and data sharing
Vertical relationships: Pipelines
• Processes often are vertically dependent
• Output of parent becomes input to child
Horizontal relationships: Batches
• Users often submit multiple pipelines
• May be read sharing across siblings
Batch-Pipeline Workloads
Batch width
Endpoint
Endpoint
Endpoint
Endpoint
Batch
dataset
Pipeline
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
Batch
dataset
Pipeline Observations
Large proportion of random access
• IBIS, CMS close to 100%, HF ~ 80%
No isolated pipeline is resource intensive
• Max memory is 500 MB
• Max disk bandwidth is 8 MB/s
Amdahl and Gray balances skewed
• Drastically overprovisioned in terms of I/O
bandwidth and memory capacity
Workload Observations
These pipelines are NOT run in isolation
• Submitted in batch widths of 100s, 1000s
• Run in parallel on remote CPUs
Large amount of I/O sharing
• Batch and pipeline I/O are shared
• Endpoint I/O is relatively small
Scheduler must differentiate I/O types
• Scalability implications
• Performance implications
Bandwidth needed for Direct I/O
Storage center
(1500 MB/s)
Commodity disk
(40 MB/s)
• Bandwidth needed for all data (i.e. endpoint, batch, pipeline).
• Only seti, ibis, and cms can scale to greater than 1000.
Eliminate Pipeline Data
Storage center
(1500 MB/s)
Commodity disk
(40 MB/s)
• Endpoint and batch data is accessed at submit site.
• Better . . .
Eliminate Batch Data
Storage center
(1500 MB/s)
Commodity disk
(40 MB/s)
• Bandwidth needed for endpoint and pipeline data.
• Better . . .
Complete I/O Differentiation
Storage center
(1500 MB/s)
Commodity disk
(40 MB/s)
• Only endpoint traffic is incurred at archive storage.
• Best. All can scale to batch width greater than 1000.
Outline
Intro
Profiling data-intensive batch workloads
File system support
• Develop new distributed file system: BAD-FS
• Develop capacity-aware scheduler
• Evaluate against CPU-centric scheduling
Data-driven scheduling
Conclusions and contributions
Why a New Distributed
File System?
Internet
Home
store
User needs mechanism for remote data access
Existing distributed file systems seem ideal
• Easy to use
• Uniform name space
• Designed for wide-area networks
But . . .
• Not practically deployable
• Embedded decisions are wrong
Existing DFS’s make bad decisions
Caching
• Must guess what and how to cache
Consistency
• Output: Must guess when to commit
• Input: Needs mechanism to invalidate cache
Replication
• Must guess what to replicate
BAD-FS makes good decisions
Removes the guesswork
• Scheduler has detailed workload knowledge
• Storage layer allows external control
• Scheduler makes informed storage decisions
Retains simplicity and elegance of DFS
• Uniform name space
• Location transparency
Practical and deployable
Batch-Aware Distributed File System:
Practical and Deployable
User-level; requires no privilege
Packaged as a modified batch system
SGE SGE SGE SGE
BADFS
Internet
SGE SGE SGE SGE
Home
store
A new batch system which includes BAD-FS
General; will work on all batch systems
Modified Batch System
Compute node
Compute node
Compute node
Compute node
CPU
CPU
CPU
CPU
Manager
Manager
Manager
Manager
Storage
Manager
BAD-FS
Storage
Manager
BAD-FS
Storage
Manager
BAD-FS
Storage
Manager
1) Storage managers
2) Batch-Aware Distributed
File System
3) Expanded job
description language
1 2
Home
storage
3 4
Job queue
4) Data-aware scheduler
Data-aware
Scheduler
Scheduler
Requires Knowledge
Remote cluster knowledge
• Storage availability
• Failure rates
Workload knowledge
• Data type (batch, pipeline, or endpoint)
• Data quantity
• Job dependencies
Requires Storage Control
BAD-FS exports explicit control via volumes
• Abstraction allowing external storage control
• Guaranteed allocations for workload data
• Specified type: either cache or scratch
Scheduler exerts control through volumes
• Creates volumes to cache input data
• Subsequent jobs can reuse this data
• Creates volumes to buffer output data
• Destroys pipeline, copies endpoint
• Configures workload to access appropriate volumes
Knowledge Plus Control
Enhanced performance
• I/O scoping
• Capacity-aware scheduling
Improved failure handling
• Cost-benefit replication
Simplified implementation
• No cache consistency protocol
I/O Scoping
Technique to minimize wide-area traffic
Allocate volumes to cache batch data
Allocate volumes for pipeline and endpoint
Compute node
Compute node
Extract endpoint
AMANDA:
200 MB pipeline
500 MB batch
5 MB endpoint
BAD-FS
Scheduler
Steady-state:
Only 5 of 705 MB
traverse wide-area.
Capacity-Aware Scheduling
Technique to avoid over-allocations
• Over-allocated batch data causes WAN thrashing
• Over-allocated pipeline data causes write failures
Scheduler has knowledge of
• Storage availability
• Storage usage within the workload
Scheduler runs as many jobs as can fit
Capacity-Aware Scheduling
Endpoint
Endpoint
Endpoint
Endpoint
Pipeline
Endpoint
Pipeline
Endpoint
Pipeline
Endpoint
Pipeline
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Batch
dataset
Batch
dataset
Batch
dataset
Pipeline
Batch
dataset
Endpoint
Endpoint
Endpoint
Endpoint
Capacity-aware scheduling
64 batch-intensive synthetic pipelines
• Four processes within each pipeline
• Four different batch datasets
• Vary size of batch data
16 compute nodes
Improved failure handling
Scheduler understands data semantics
• Data is not just a collection of bytes
• Losing data is not catastrophic
• Output can be regenerated by rerunning jobs
Cost-benefit replication
• Replicates only data whose replication cost is
cheaper than cost to rerun the job
Simplified implementation
Data dependencies known
Scheduler ensures proper ordering
Build a distributed file system
• With cooperative caching
• But without a cache consistency protocol
Real workload experience
Setup
• Batch-width of 64
• 16 compute nodes
• Emulated wide-area
Configuration
• Remote I/O
• AFS-like with /tmp
• BAD-FS
Result is order of magnitude improvement
Outline
Intro
Profiling data-intensive batch workloads
File system support
Data-driven scheduling
•
•
•
•
•
Codify simplifying assumptions
Define data-driven scheduling policies
Develop predictive analytical models
Evaluate
Discuss
Conclusions and contributions
Simplifying Assumptions
Predictive models
• Only interested in relative (not absolute) accuracy
Batch-pipeline workloads are “canonical”
• Mostly uniform and homogeneous
• Combine pipeline and endpoint data into private
Environment
• Assume homogeneous compute nodes
Target scenario
• Single workload on a single compute cluster
Predictive Accuracy:
Choosing Relative over Absolute
Develop predictive models to guide scheduler
• Absolute accuracy not needed
• Relative accuracy to select between different possible schedules
Predictive model does not consider
•
•
•
•
Latencies
Disk bandwidths
Buffer cache size or bandwidth
Other “lower” characteristics such as buses, CPUs, etc
Simplify model and retain relative predictive accuracy
• These effects are more uniform across different possible schedules
• Network and CPU utilization make largest difference
The Trade-off? Loss of absolute accuracy
Canonical B-P Workload
WWidth
WPrivate
WPrivate
WBatch
WDepth
WRun
+/WVar
WRun
+/WVar
WPrivate
WPrivate
WBatch
WRun
+/WVar
WRun
+/WVar
WPrivate
WPrivate
Represented using six variables.
Target Compute Platform
CFailure
CRemote
CStorage
CStorage
CLocal
CCPU
Storage Server
Compute Cluster
Represented using five variables.
Scheduling Objective
Maximize throughput for user workloads
Easy without data considerations
• Run as many jobs as possible
Potential problems with data considerations
• WAN overutilization by resending batch data
• Data barriers can reduce CPU utilization
• Concurrency limits can reduce CPU utilization
Need data-driven policies to avoid these problems
Five Scheduling Allocations
Define allocations to avoid problems
•
•
•
•
•
All: Has no constraints; allows CPU-centric scheduling
AllBatch: Avoids overutilizing WAN and barriers
AllPrivate: Avoids concurrency limits
Slice: Avoids overutilizing WAN
Minimal: Attempts to maximize concurrency
Each allocation might incur other problems
Which problems are incurred by each?
Workflow for All Allocation
Full concurrency
No batch data refetch
No barriers
Requires most storage
Minimum storage needed:
WDepthWBatch + WWidthWPrivate(WDepth+1)
Workflow for AllBatch allocation
Limited concurrency
No batch data refetch
No barriers
Minimum storage needed:
WDepthWBatch + 2WPrivate
Workflow for AllPrivate allocation
Full concurrency
No batch data refetch
Barriers possible
Minimum storage needed:
WBatch + WWidthWPrivate(WDepth+1)
Workflow for Slice allocation
Limited concurrency
No batch data refetch
Barriers possible
Minimum storage needed:
WBatch + WWidthWPrivate +WPrivate
Workflow for Minimal allocation
Limited concurrency
Batch data refetch
Barriers possible
Smallest storage footprint
Minimum storage needed:
WBatch + 2WPrivate
Selecting an allocation
Multiple allocations may be possible
• Dependent on WWidth and WDepth
• Dependent of WBatch and WPrivate
Which allocations are possible?
Which possible allocation is preferrable?
Maximum Private Volume Size (% of CStorage)
WWidth = 3, WDepth = 3
Only Minimal Possible
Minimal, Slice, AllBatch Possible
All Five Allocations Possible
Maximum Batch Volume Size (% of CStorage)
Analytical Modelling
How to select when multiple are possible?
Use analytical models to predict runtimes
Use 6 workload and 5 environmental constants
• WWidth, WDepth, WBatch, WPrivate, WRun, WVar
• CStorage, CFailure, CRemote, CLocal, CCPU
All Predictive Model
TTotal  TColdPhase  VW arm  TW armPhase
TColdPhase  TColdBatch  TPrivateRead  TPrivateW rite  TCompute
TColdBatch  BW(C Remote,CLocal)  DTotBatch
1
BW(C Remote,CLocal) 
1
1

CRemote CLocal
DTotBatch  WDepth  WBatch
TPrivateRead  BW(C Remote,CLocal)  WPrivate  CLocal  WPrivate  WDepth  1
TPrivateW rite  CLocal  WPrivate  WDepth  1  BW(C Remote,CLocal)  WPrivate
TCompute  WRun  WDepth


WW idth
VW arm  
1

 min( VExec, CCPU ) 
VExec  WW idth
TW armPhase TW armBatch TPrivateRead  TPrivateW rite  TCompute
TW armBatch  CLocal  DTotBatch
AllBatch Predictive Model
TTotal  TColdPhase  VWarm  TWarmPhase
TColdPhase  TColdBatch  T Pr ivate Re ad  TPrivateWrite  TCompute
TColdBatch  BW(CRemote,CLocal)  DTotBatch
BW(CRemote,CLocal) 
1
1

1
CRemote CLocal
DTotBatch  WDepth  WBatch
T Pr ivate Re ad  BW(CRemote,CLocal)  WPrivate  CLocal  WPrivate  WDepth  1
T Pr ivateWrite  CLocal  WPrivate  WDepth  1  BW(CRemote,CLocal)  WPrivate
TCompute  WRun  WDepth


WWidth
VWarm  
1

 min(VExec, CCPU ) 
 CStorage  DTotBatch 
VExec  

2WPrivate


TWarmPhase  TWarmBatch  TPrivateRead  TPrivateWrite  TCompute
TWarmBatch  CLocal  DTotBatch
AllPrivate Predictive Model
TTotal  TColdPhase  VWarm  TWarmPhase
TColdPhase  TColdBatch  T Pr ivate Re ad  TPrivateWrite  TCompute
TColdBatch  BW(CRemote,CLocal)  DTotBatch
1
BW(CRemote,CLocal) 
1
1

CRemote CLocal
DTotBatch  WDepth  WBatch
T Pr ivate Re ad  BW(CRemote,CLocal)  WPrivate  CLocal  WPrivate  WDepth  1
T Pr ivateWrite  CLocal  WPrivate  WDepth  1  BW(CRemote,CLocal)  WPrivate
WRun WVar 

TCompute   WRun 
  WDepth
VBatch 

 CStorage  DTotPrivate 
VBatch  

WBatch

DTotPrivate  WDepth  1 WPrivate WWidth


WWidth
VWarm  
 1
min(
V
Exec
,
C
CPU
)


VExec  WWidth
TWarmPhase  TWarmBatch  TPrivateRead  TPrivateWrite  TCompute
TWarmBatch  CLocal  DTotBatch
Slice Predictive Model
TTotal  TColdPhase  VWarm  TWarmPhase
TColdPhase  TColdBatch  T Pr ivate Re ad  TPrivateWrite  TCompute
TColdBatch  BW(CRemote,CLocal)  DTotBatch
BW(CRemote,CLocal) 
1
1
1

C
C
Slice has twoD phases
of concurrency
W
W
Remote
TotBatch
Local
Depth
Batch
T
 BW(C
,C ) W
C
W
• At start and
end
of workload
T
C
W
W  1  BW(C
• Steady-state
in
W middle
W 
 the
T
 W 
W
Pr ivate Re ad
Pr ivateWrite
Local 
Remote
Local 
Private 
Run
Private
Local 
Depth
Private 
WDepth  1
,CLocal)  WPrivate
Remote
Var
 Run
  Depth
VBatch 

 CStorage  DSliceData 
VBatch  

WBatch

DSliceData  WPrivate WWidth  VExec WPrivate
Compute
Concurrency limited by remaining storage after
allocating one volume for each pipeline
W
W
W
C

V 

At start and end, fewerW pipelines
are allocated


W
V

1
Other allocations  min(
are
consistent
V more
, C ) 
Storage
Batch
Private
Width
Exec
Private
Width
Warm
ExecAve
CPU
VExecAve  n n   DNumJobs  2n  VExec  n n
 CStorage  WBatch 
n
 2WPrivate 
DNumJobs  WDepth WWidth
TWarmPhase  TWarmBatch  TPrivateRead  TPrivateWrite  TCompute
TWarmBatch  CLocal  DTotBatch
Minimal Predictive Model
TTotal 
VCycles
T
Total
( Slice, SubWorkloadi )
i 1
 WWidth 
VCycles  

V
Exec


 CStorage  WBatch 
VExec  

2
W
Private


i VCycles
WWidth ( SubWorkloadi )  {
WWidth mod VExec, i VCycles
VExec,
Variables Used
Workload Variables
• WWidth and WDepth
• WBatch and WPrivate
• WRun and WVar
Environmental Variables
• CLocal and CRemote
• CCPU and CStorage
CFailure not considered in the models
Modelling Failure Effect
Almost uniform effect across all allocations
• Estimate TTotal
• Multiply by CFailure to estimate failed pipelines
• Estimate TTotal for remaining failed pipelines
Only one difference across allocations
• For All and AllBatch, estimates for re-running
failed pipelines do not include a “cold” phase
Evaluation
Winnow the allocations
Define three synthetic workloads
• Batch-intensive, private-intensive, mixed
Simulate workloads and allocations across
• Workload variables (WBatch, etc)
• Environmental variables (CFailure, etc)
Compare modelled predictions with results
Winnowing the Allocations:
Removing All and AllPrivate
Remove All from consideration
• Lacks constraint
• No possible schedule results in problems
• Scheduling is uninteresting
Remove AllPrivate from consideration
• Is a strict subset of Slice
• When possible, Slice can run at full CPU utilization
• No reason to prefer AllPrivate over Slice
• AllPrivate more likely to suffer barriers
• No benefit to holding onto “old” pipeline data
Evaluation Framework:
Baseline Characteristics
Workloads
• Batch-intensive
• WBatch = 883 GB, WPrivate = 420 GB
• Private-intensive
• WBatch = 150 GB, WPrivate = 5,250 GB
• Mixed
• WBatch = 225 GB, WPrivate = 1,050 GB
• Common across all workloads
• WWidth = 350, WDepth = 5
• WRun = 3000s, WVar = 10%
Environment
• CCPU = 50, CStorage = 250 GB
• CLocal = 12 MB/s, CRemote = 1 MB/s
• CFailure = 0%
Maximum Private Volume Size (% of CStorage)
Synthetic Workloads
Private-Intensive possible only
in Minimal and AllBatch
Batch-Intensive possible
only in Minimal and Slice
Mixed possible in Minimal,
AllBatch, and Slice
Maximum Batch Volume Size (% of CStorage)
Experimental Evaluation
Nine experiments across 3 workloads using 11 variables
•
•
•
•
•
•
•
•
•
WWidth:CCPU
WDepth
WBatch:CStorage
WPrivate:CStorage
WRun
WVar:WRun
WVar:(WRun●1000)
CLocal:CRemote
CFailure
example results
Compare throughput loss of worst to modelled best
• Worst possible prediction routinely exceeds 50%
• Modelled best routinely within 5%, never exceeds 30%
Throughput
Simulated
Throughput
Loss
Throughput (pipes per minute)
Sensitivity to WWidth:CCPU in Mixed
Modelled
WWidth:CCPU
Predictive
accuracy
Throughput
Simulated
Throughput
Loss
Throughput (pipes per minute)
Sensitivity to CFailure in Mixed
Modelled
CFailure
Predictive
accuracy
Discussion
Selecting appropriate allocation is important
• A naïve CPU-centric approach that exceeds storage
can lead to orders of magnitude throughput loss
• Even a poorly choosen data-driven schedule can lead
to upwards of 50% throughput loss
Predictive modelling is highly accurate
• Never predicts an allocation that exceeds storage
• Consistently predicts the highest throughput allocation
• Consistently avoids “bad” allocations
What about non-canonical?
Canonical scheduling still useful
• Many workloads may be “almost” canonical
• Canonical scheduling will work but might
leave unnecessary pockets of CPU / storage
• Similar to PBS which often elicits overly
conservative runtime estimates
Avoids deadlock or failures
CPU-Centric Scheduling Problem
20%
20%
20%
20%
70%
70%
•Naïve CPU-centric scheduler starts both, one will fail.
•Naïve capacity-aware scheduler starts both, can finish neither.
•Canonical data-driven scheduler correctly starts only one.
Outline
Intro
Profiling data-intensive batch workloads
File system support
Data-driven scheduling
Conclusions and contributions
Related Work
Distributed file systems
• GFS [Ghemawat03]
• P2P [Dahlin94,Kubi00,Muthitacharoen02,Rowstron01,Saito02]
• Untrusted storage nodes [Flinn03]
Data management in grid computing
• FreeLoader [Vazhkudai05]
• Caching and locating batch data [Bell03,Bent05,Chervenak02,
Lamehehamedi03,Park03,Ranganathan02,Romosan05]
• Stork [Kosar04]
Parallel scheduling
• Gang scheduling with memory constraint [Batat00]
• Backfilling [Lifka95]
• Reserving non-dedicated resources [Snell00]
Query planning
• LEO [Markl03]
• Proactive reoptimization [Babu05]
Conclusions
Batch computing is successful for a reason.
•
•
•
•
Simple mechanism
Easy to understand and use
Great need for it
Works well
But. The world is changing.
•
•
•
•
Data sets are getting larger
Increased availability of remote resources
Existing approaches are fraught with problems
CPU-centric scheduling is insufficient
New data-driven policies are needed.
Contributions
Profiling
• Batch-pipeline workloads
• Taxonomy to reason about workloads
• Taxonomy to distinguish between data types
• Data differentiation allows data-driven scheduling
File system support
• Transfer of explicit storage control
• Batch scheduler dictates allocation, caching, replication
• Order of magnitude improvement over CPU-centric
Data-driven batch scheduling
• Multiple allocations and scheduling policies
• Predictive analytical model
End of Talk
Future Work
Measurement
Information
• Collecting
• Retrieving
• Mitigating inaccuracies
Multi-*
Non-canonical
Dynamic reallocation
Checkpointing
Partial results
Download