Operating System Support for Space Allocation in Grid Storage

advertisement
Operating System Support
for Space Allocation
in Grid Storage Systems
Douglas Thain
University of Notre Dame
IEEE Grid Computing, Sep 2006
Bad News:
Many large distributed systems
fall to pieces under heavy load!
Example: Grid3 (OSG)
Robert Gardner, et al. (102 authors)
The Grid3 Production Grid
Principles and Practice
IEEE HPDC 2004
The Grid2003 Project has deployed a multi-virtual
organization, application-driven grid laboratory
that has sustained for several months the
production-level services required by…
ATLAS, CMS, SDSS, LIGO…
Grid2003: The Details
The good news:
– 27 sites with 2800 CPUs
– 40985 CPU-days provided over 6 months
– 10 applications with 1300 simultaneous jobs
The bad news on ATLAS jobs:
– 40-70 percent utilization
– 30 percent of jobs would fail.
– 90 percent of failures were site problems
– Most site failures were due to disk space!
A Thought Experiment
CPU task
CPU task
CPU task
CPU
task
CPU task
CPU task
CPU task
CPU
task
CPU task
CPU task
CPU task
CPU
Job
in
in
out
shared
out
disk
out
task
out
CPU task
CPU task
CPU task
CPU
x 1,000,000
1 - Only a problem when load > capacity.
2 – Grids are employed by users with infinite needs!
Need Space Allocation
• Grid storage managers:
– SRB - Storage Resource Broker at SDSC.
– SRM – Storage Resource Manager at LBNL.
– NeST – Networked Storage at UW-Madison.
– IBP – Internet Backplane Protocol at UTK.
• But, do not have any help from the OS.
– A runaway logfile can invalidate the careful
accounting of the grid storage mgr.
Outline
• Grids Need OS Support for Allocation
• A Model of Space Allocation
• Three Implementations
– User-Level Library
– Loopback Devices
– AllocFS: Kernel Filesystem
• Application to a Cluster
A Model of Space Allocation
root
size: 100 GB
used:
used: 100
10
0 GB
size:1000 GB
used:
used: 700
100
0 GB
jobs
home
Three commands:
mkalloc (dir) (size)
size: 10 GB
used: 0
5 GB
j1
lsalloc (dir)
j2
rm –rf (dir)
data
core
size: 100 GB
used: 0 GB
alice
betty
size: 500 GB
used: 0 GB
No Built-In Allocation Policy
• In order to make an allocation:
– Must have permission to mkdir.
– New allocation must fit in available space.
• Need something more complex?
– Check remote database re global quota?
– Delete allocation after a certain time?
– Send email when allocation is full?
• Use a storage manager at a higher level.
– SRB, SRM, NeST, IBP, etc...
No Built-In Allocation Policy
need 10 GB
check database,
charge credit card,
consult human...
size: 100 GB
used:
used: 10
20 GB
size: 10 GB
used: 0
5 GB
grid
storage
manager
ok, use jobs/j5
mkalloc /jobs/j5 10GB
jobs
j4
setacl /jobs/j5 alice write
j5
size: 10 GB
used: 0 GB
(writeable by alice)
size: 5 GB
used: 0 GB
task1
task2
size: 5 GB
used: 0 GB
ordinary
file access
Outline
• Grids Need OS Support for Allocation
• A Model of Space Allocation
• Three Implementations
– User-Level Library
– Loopback Devices
– AllocFS: Kernel Filesystem
• Application to a Cluster
User Level Library
3 - unlock/write
1 - lock/read
size: 10 GB
used: 2
0 GB
root
size:1000 GB
used: 0 GB
jobs
size: 100 GB
used: 5
0 GB
j1
j2
file
file
1 - lock/read
3 - write/unlock
Appl
Appl
LibAlloc
LibAlloc
2 - stat/write
2 - stat/write
User Level Library
• Some details about locking: see paper.
• Applicability
– Must modify apps or servers to employ.
– Fails if non-enabled apps interfere.
– But, can employ anywhere without privileges.
• Performance
– Optimization: Cache locks until idle 2 sec.
– At best, writes double in latency.
– At worst, shared directories ping-pong locks.
• Recovery
– fixalloc: traverses the directory structure and
recomputes current allocations.
Loopback Filesystems
root
size:1000 GB
jobs
size: 100 GB
dd if=/dev/zero of=/jobs.fs 100GB
losetup /dev/loopN size:
/jobs.fs j1
mke2fs /dev/loopN 10 GB
mount /dev/loopN /jobs
file
j2
Loopback Filesystems
• Applicability
– Works with any standard application.
– Must be root to deploy and manage allocations.
– Limited to approx 10-100 allocations.
• Performance
– Ordinary reads and writes: no overhead.
– Allocations: Must touch every block to reserve!
– Massively increases I/O traffic to disk.
• Recovery
– Must scan hierarchy, fsck and mount every allocation.
– Disastrous for large file systems!
AllocFS: Kernel-Level Filesystem
2
3
4
5
j1
file
Inode Table
root
jobs
j2
file
6
#
uid
size
used
parent
2
0
1000 GB
700 GB
2
3
0
100 GB
99 GB
2
4
34
10 GB
5 GB
3
5
34
4
6
56
3
7
56
7
7
1 – To update allocation state, update fields in incore-inode.
2 – To create/delete an allocation, update the parent’s
allocation state, which is already cached for other reasons.
AllocFS: Kernel-Level Filesystem
• Applicability
–
–
–
–
Works with any ordinary application.
Must load module and be root to install.
Binary compatible with existing EXT2 filesystem.
Once loaded, ordinary users may employ.
• Performance
– No measurable overhead on I/O.
– Creating an allocation: touch two inodes.
– Deleting an allocation: same as deleting directory.
• Recovery
– fixalloc: traverses the directory structure and
recomputes current allocations.
Library Adds Latency
Allocation Performance
• Loopback Filesystem
– 1 second per 25 MB of allocation. (40 sec/GB)
– Must touch every single block.
– Big increase in unnecessary I/O traffic!
• Allocation Library
– 227 usec regardless of size.
– Several synchronous disk ops.
• Kernel Level Filesystem
– 32 usec regardless of size.
– Touch one inode.
Comparison
Priv.
Reqd.
Guarantee? Max #
Write
Perf.
Alloc
Perf.
Recovery
Library
any
user
no
no limit
2x
latency
usec
fixalloc
once
Loopback
root to
install,
use
yes
10-100
no
secs
change to
mins
fsck and
mount each
alloc
Kernel
root to
install
yes
no limit
no
usec
change
fixalloc
once
Outline
• Grids Need OS Support for Allocation
• A Model of Space Allocation
• Three Implementations
– User-Level Library
– Loopback Devices
– AllocFS: Kernel Filesystem
• Application to a Cluster
A Physical Experiment
CPU task
CPU task
CPU CPU
CPU task
CPU CPU CPU
task
CPU task
CPU CPU task
CPU
Job
in
out
shared
out
disk
out
in
task
out
CPU CPU CPU task
CPU
Only space for 10.
Vary load: # of simultaneous jobs.
Three configurations:
1 – No allocations.
2 – Backoff when failures detected.
3 – Heuristic: don’t start job unless space > threshhold.
4 – Allocate space for each job.
Allocations Improve Robustness
Summary
• Grids require space allocations in order to
become robust under heavy loads.
• Explicit operating system support for
allocations is needed in order to make
them manageable and efficient.
• User level approximations are possible,
but have overheads in perf and mgmt.
• AllocFS provides allocations compatible
with EXT2 with no measurable overhead.
Library Implementation
• http://www.cctools.org/chirp
• Solaris, Linux, Mac, Windows
• Start server with –Q 100GB
Kernel Implementation
• http://www.cctools.org/allocfs
• Works with Linux 2.4.21.
• Install over existing EXT2 FS.
– (And, uninstall without loss.)
% mkalloc /mnt/alloctest/adir 25M
mkalloc: /mnt/alloctest/adir allocated 25600 blocks.
% lsalloc -r /mnt/alloctest
USED
TOTAL PCT PATH
25.01M 87.14M 28% /mnt/alloctest
10.00M 25.00M 39% /mnt/alloctest/adir
A Final Thought
[Some think] traditional OS issues are either solved
problems or minor problems. We believe that
building such vast distributed systems upon the
fragile infrastructure provided by today’s operating
systems is analogous to building castles on sand.
The Persistent Relevance of the Local Operating
System to Global Applications
Jay Lepreau, Bryan Ford, and Mike Hibler
SIGOPS European Workshop, September
1996
For More Information:
• Cooperative Computing Lab:
– http://www.cse.nd.edu/~ccl
• Douglas Thain
– dthain@cse.nd.edu
• Related Talks:
– “Grid Deployment of Bioinformatics Apps...”
• Session 4A Friday
– “Cacheable Decentralized Groups...”
• Session 5B Friday
Extra Slides
Existing Tools Not Suitable for the Grid
• User and Group Quotas
– Don’t always correspond to allocation needs!
• User might want one alloc per job.
• Or, many users may want to share an alloc.
• Disk Partitions
– Very expensive to create, change, manage.
– Not hierarchical: only root can manage.
• ZFS Allocations
– Cheap to create, change, manage.
– Not hierarchical: only root can manage.
Library Suffers on Small Writes
Recovery Linear wrt # of Files
Download