Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden

advertisement
July 2003
Sorrento: A Self-Organizing
Distributed File System on
Large-scale Clusters
Hong Tang, Aziz Gulbeden
and Tao Yang
Department of Computer Science,
University of California, Santa Barbara
Information Management Challenges

“Disk full (again)!”

Cause: increasing storage demand.
 Options: adding more disks, reorganizing data,
removing garbage.

“Where are the data?”





Cause 1: scattered storage repositories.
Cause 2: disk corruptions (crashes).
Options: exhaustive search; indexing; backup.
Management headaches!
Nightmares for data-intensive applications and
online services.
2
A Better World






A single repository – virtual disk.
A uniform hierarchical namespace.
Expand storage capacity on-demand.
Resilient to disk failures through data redundancy.
Fast and ubiquitous access.
Inexpensive storage.
3
Cluster-based Storage Systems

Turn a generic cluster as a storage system.
Clients
LAN
Storage cluster
4
Why?

Clusters provide:
 Cost-effective computing platform.
 Incremental scalability.
 High availability.
5
Design Objectives

Programmability

Virtualization of distributed storage resources.
 Uniform namespace for data addressing.

Manageability

Incremental expansion.
 Self-adaptive to node additions and departures.
 Almost-zero administration.

Performance



Performance monitoring.
Intelligent data placement and migration.
247 Availability

Replication support.
6
Design Choices




Use commodity components as much as possible.
Share-nothing architecture.
Functionally symmetric servers (serverless).
User-level file system.

Daemons run as user processes.
 Possible to make it mountable through kernel modules.
7
Data Organization Model




User-perceived files are split into
variable-length segments (data
objects).
Data objects are linked by index
objects.
Data and index objects are stored
in their entirety as files within
native file systems.
Objects are addressed through
location-transparent GUIDs.
8
Multi-level Data Consistency Model




Level 0: best-effort without any guarantee.
Possible to reorder I/O operations.
Level 1: time-ordered I/O operations. May
observe problems of missed writes.
Level 2: open-to-close session consistency. The
effect of multiple I/O operations within an opento-close session are either ALL visible or NONE
visible to others. May lead to abortion when there
is a write/write conflict.
Level 3: adding file sharing and automatic conflict
resolution upon Level 2.
9
System Architecture

Proxy Module



Server Module


Data location and
placement.
Monitor multicast channel.
server
proxy
server
proxy
Export local storage.
server
proxy
server
proxy
LAN
Namespace Server


Maintain a global
directory tree.
Translate filenames to
root-object GUIDs.
Namespace
server
server
proxy
server
proxy
server
proxy
10
Accessing a File
server
proxy
server
proxy
server
proxy
server
proxy
1. Wants to access
/foo/bar
2. Ask “/foo/bar”s GUID
LAN
Namespace
server
server
proxy
3. Get “/foo/bar”s GUID
server
proxy
server
proxy
4. Determine the
server to contact
5. Ask for the root object
6. Retrieve the data
7. Contact other servers
if necessary
8. Close file
11
Project Status




Distributed data placement and location protocol.
To appear in SuperComputing 2003.
Prototype implementation done by summer 2003.
Production usage by end of 2003.
Project Web page:
http://www.cs.ucsb.edu/~gulbeden/sorrento/
12
Evaluation

We are planning to use trace-driven evaluation.




Enables us to find problems without adding much to
the system.
Performance of various applications can be measured
without porting.
Allows us to reproduce and identify the any potential
problem.
Applications that can benefit from the system are:



Web crawler.
Protein sequence matching.
Parallel I/O applications.
13
Project Status and Development Plan



Most software modules are implemented such as:
consistent hashing, UDP request/response
management, persistent hash table, file block
cache, thread pool, and load statistics collection.
We are working on building a running prototype.
Milestones:




Barebone runtime system.
Add dynamic migration.
Add version-based data management and replication.
Add kernel VFS switch.
14
Conclusion

Project website

http://www.cs.ucsb.edu/~gulbeden/sorrento
15
Proxy Module

Consists of:

Dispatcher: listens for incoming requests.
 Thread pool: processes requests from local
applications.
 Subscriber: monitors the multicast channel.

Stores:




A set of live hosts.
Address of the Namespace Server.
Set of opened file handles.
Accesses data by hashing GUID of the object.
16
Server Module

Consists of:




Dispatcher: Listens for
requests (UDP or TCP).
Thread Pool: Handles
requests for local operations.
Local Storage: Stores the
local data.
Stores:



Global block table partition.
INode Map.
Physical local store.
Local Storage
Server Dispatcher
(UDP / TCP)
wait_for_
request
Data
Run
Enqueue
Request
Create
Open
Close
Read
Write
Append
Remove
Server
Queue
Thread_Pool
handleRequest
UDP / TCP
Respond
17
Choice I: SAN



Distributed and
heterogeneous devices.
Dedicated fast network.
Storage virtualization.



Volume-based.
Each volume managed by a
dedicated server.
Volume map.
18
Choice I: SAN (cont)

Cost Disadvantage

Scalability
Manageability




Handling disk failures:



Change the volume map.
Reorganize data on the old volume.
Exclude failed disks from volume maps.
Restore data to spare disks.
Conclusions:


Hard to automate.
Prone to human errors (at large scale).
19
SAN: Storage Area Networks



Distributed and
heterogeneous devices.
Dedicated fast network.
Storage virtualization.



Volume-based.
Each volume managed by
a dedicated server.
Volume map.
20
Management Challenges of SAN

Expanding an existing volume:

Change the volume map.
 Reorganize data on the old volume.

Handling disk failures:

Exclude failed disks from volume maps.
 Restore data to spare disks.

Conclusions:


Hard to automate.
Prone to human errors (at large scale).
21
Download