Explicit Control in a Batch-aware Distributed File System John Bent

advertisement
Explicit Control in a
Batch-aware Distributed
File System
John Bent
Douglas Thain
Andrea Arpaci-Dusseau
Remzi Arpaci-Dusseau
Miron Livny
University of Wisconsin, Madison
Grid computing
Physicists invent
Astronomers
develop
distributed
virtual supercomputers!
computing!
Grid computing
Home
storage
If it looks like a duck . . .
Are existing distributed file
systems adequate for batch
computing workloads?
• NO. Internal decisions inappropriate
• Caching, consistency, replication
• A solution: Batch-Aware Distributed File System (BAD-FS)
• Combines knowledge with external storage control
• Detail information about workload is known
• Storage layer allows external control
• External scheduler makes informed storage decisions
• Combining information and control results in
• Improved performance
• More robust failure handling
• Simplified implementation
Outline
• Introduction
• Batch computing
•
•
•
•
Systems
Workloads
Environment
Why not DFS?
• Our answer: BAD-FS
• Design
• Experimental evaluation
• Conclusion
Batch computing
•
•
•
•
•
Not interactive computing
Job description languages
Users submit
System itself executes
Many different batch systems
•
•
•
•
Condor
LSF
PBS
Sun Grid Engine
Batch computing
Compute node
Compute node
Compute node
CPU
CPU
CPU
CPU
Manager
Manager
Manager
Manager
Internet
Home
storage
Compute node
1 2
3 4
Job queue
1 2
Scheduler
3 4
Batch
workloads
“Pipeline and Batch Sharing in Grid
Workloads,” Douglas Thain, John Bent,
Andrea Arpaci-Dusseau, Remzi ArpaciDussea, Miron Livny. HPDC 12, 2003.
• General properties
• Large number of processes
• Process and data dependencies
• I/O intensive
• Different types of I/O
• Endpoint
• Batch
• Pipeline
• Our focus: Scientific workloads
• More generally applicable
• Many others use batch computing
• video production, data mining, electronic design,
financial services, graphic rendering
Batch workloads
Endpoint
Endpoint
Endpoint
Endpoint
Batch
dataset
Pipeline
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
Batch
dataset
Cluster-to-cluster (c2c)
• Not quite p2p
•
•
•
•
More organized
Less hostile
More homogeneity
Correlated failures
Internet
Home
store
• Each cluster is autonomous
• Run and managed by different entities
• An obvious bottleneck is wide-area
How to manage flow of data into, within and out of these clusters?
Why not DFS?
Internet
Home
store
• Distributed file system would be ideal
• Easy to use
• Uniform name space
• Designed for wide-area networks
• But . . .
• Not practical
• Embedded decisions are wrong
DFS’s make bad decisions
• Caching
• Must guess what and how to cache
• Consistency
• Output: Must guess when to commit
• Input: Needs mechanism to invalidate cache
• Replication
• Must guess what to replicate
BAD-FS makes good decisions
• Removes the guesswork
• Scheduler has detailed workload knowledge
• Storage layer allows external control
• Scheduler makes informed storage decisions
• Retains simplicity and elegance of DFS
• Practical and deployable
Outline
• Introduction
• Batch computing
•
•
•
•
Systems
Workloads
Environment
Why not DFS?
• Our answer: BAD-FS
• Design
• Experimental evaluation
• Conclusion
Practical and deployable
• User-level; requires no privilege
• Packaged as a modified batch system
SGE SGE SGE SGE
BADFS
Internet
SGE SGE SGE SGE
Home
store
• A new batch system which includes BAD-FS
• General; will work on all batch systems
• Tested thus far on multiple batch systems
Contributions of BAD-FS
Compute node
Compute node
Compute node
Compute node
CPU
CPU
CPU
CPU
Manager
Manager
Manager
Manager
Storage
Manager
BAD-FS
Storage
Manager
BAD-FS
Storage
Manager
BAD-FS
Storage
Manager
1) Storage managers
2) Batch-Aware Distributed
File System
3) Expanded job
description language
1 2
Home
storage
3 4
Job queue
4) BAD-FS scheduler
BAD-FS
Scheduler
Scheduler
BAD-FS knowledge
• Remote cluster knowledge
• Storage availability
• Failure rates
• Workload knowledge
• Data type (batch, pipeline, or endpoint)
• Data quantity
• Job dependencies
Control through volumes
• Guaranteed storage allocations
• Containers for job I/O
• Scheduler
• Creates volumes to cache input data
• Subsequent jobs can reuse this data
• Creates volumes to buffer output data
• Destroys pipeline, copies endpoint
• Configures workload to access containers
Knowledge plus control
• Enhanced performance
• I/O scoping
• Capacity-aware scheduling
• Improved failure handling
• Cost-benefit replication
• Simplified implementation
• No cache consistency protocol
I/O scoping
•
•
•
•
Technique to minimize wide-area traffic
Allocate storage to cache batch data
Allocate storage for pipeline and endpoint
Compute node
Compute node
Extract endpoint
AMANDA:
200 MB pipeline
500 MB batch
5 MB endpoint
BAD-FS
Scheduler
Steady-state:
Only 5 of 705 MB
traverse wide-area.
Capacity-aware scheduling
• Technique to avoid over-allocations
• Scheduler runs only as many jobs as fit
Capacity-aware scheduling
Endpoint
Endpoint
Endpoint
Endpoint
Pipeline
Endpoint
Pipeline
Endpoint
Pipeline
Endpoint
Pipeline
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Batch
dataset
Batch
dataset
Batch
dataset
Pipeline
Batch
dataset
Endpoint
Endpoint
Endpoint
Endpoint
Capacity-aware scheduling
• 64 batch-intensive synthetic pipelines
• Vary size of batch data
• 16 compute nodes
Improved failure handling
• Scheduler understands data semantics
• Data is not just a collection of bytes
• Losing data is not catastrophic
• Output can be regenerated by rerunning jobs
• Cost-benefit replication
• Replicates only data whose replication cost is
cheaper than cost to rerun the job
• Results in paper
Simplified
implementation
• Data dependencies known
• Scheduler ensures proper ordering
• No need for cache consistency
protocol in cooperative cache
Real workloads
• AMANDA
• Astrophysics study of cosmic events such as gammaray bursts
• BLAST
• Biology search for proteins within a genome
• CMS
• Physics simulation of large particle colliders
• HF
• Chemistry study of non-relativistic interactions
between atomic nuclei and electors
• IBIS
• Ecology global-scale simulation of earth’s climate used
to study effects of human activity (e.g. global warming)
Real workload experience
• Setup
• 16 jobs
• 16 compute nodes
• Emulated wide-area
• Configuration
• Remote I/O
• AFS-like with /tmp
• BAD-FS
• Result is order of magnitude improvement
BAD Conclusions
• Existing DFS’s insufficient
• Schedulers have workload knowledge
• Schedulers need storage control
• Caching
• Consistency
• Replication
• Combining this control with knowledge
• Enhanced performance
• Improved failure handling
• Simplified implementation
For more information
• http://www.cs.wisc.edu/adsl
• http://www.cs.wisc.edu/condor
• Questions?
Why not BAD-scheduler
and traditional DFS?
• Cooperative caching
• Data sharing
• Traditional DFS
• assume sharing is exception
• provision for arbitrary, unplanned sharing
• Batch workloads, sharing is rule
• Sharing behavior is completely known
• Data committal
• Traditional DFS must guess when to commit
• AFS uses close, NFS uses 30 seconds
• Batch workloads precisely define when
Is cap aware imp in real
world?
1. Heterogeneity of remote resources
2. Shared disk
3. Workloads changing, some are very,
very large.
Capacity-aware
scheduling
• Goal
• Avoid overallocations
• Cache thrashing
• Write failures
• Method
• Breadth-first
• Depth-first
• Idleness
Capacity-aware
scheduling evaluation
• Workload
• 64 synthetic pipelines
• Varied pipe size
• Environment
• 16 compute nodes
• Configuration
• Breadth-first
• Depth-first
• BAD-FS
Failures directly correlate to
workload throughput.
Download