Large Scale Sharing GFS and PAST Mahesh Balakrishnan

advertisement
Large Scale Sharing
GFS and PAST
Mahesh Balakrishnan
Distributed File Systems
Traditional Definition:


Data and/or metadata stored at remote
locations, accessed by client over the
network.
Various degrees of centralization: from NFS to
xFS.
GFS and PAST


Unconventional, specialized functionality
Large-scale in data and nodes
The Google File System
Specifically designed for Google’s
backend needs
Web Spiders append to huge files
Application data patterns:

Multiple producer – multiple consumer

Many-way merging
GFS  Traditional File Systems
Design Space Coordinates
Commodity Components
Very large files – Multi GB
Large sequential accesses
Co-design of Applications and File System
Supports small files, random access writes
and reads, but not efficiently
GFS Architecture
Interface:

Usual: create, delete, open, close, etc

Special: snapshot, record append
Files divided into fixed size chunks
Each chunk replicated at chunkservers
Single master maintains metadata
Master, Chunkservers, Clients: Linux
workstations, user-level process
Client File Request
Client finds chunkid for offset within file
Client sends <filename, chunkid> to Master
Master returns chunk handle and chunkserver locations
Design Choices: Master
Single master maintains all metadata …
 Simple Design
 Global decision making for chunk replication
and placement
 Bottleneck?
 Single Point of Failure?
Design Choices: Master
Single master maintains all metadata … in
memory!
 Fast master operations
 Allows background scans of entire data
 Memory Limit?
 Fault Tolerance?
Relaxed Consistency Model
File Regions are 

Consistent: All clients see the same thing
Defined: After mutation, all clients see exactly
what the mutation wrote
Ordering of Concurrent Mutations –


For each chunk’s replica set, Master gives
one replica primary lease
Primary replica decides ordering of mutations
and sends to other replicas
Anatomy of a Mutation
1 2 Client gets chunkserver
locations from master
3
Client pushes data to
replicas, in a chain
4
Client sends write request to
primary; primary assigns
sequence number to write
and applies it
5 6 Primary tells other replicas to
apply write
7
Primary replies to client
Connection with Consistency Model
Secondary replica encounters error while applying write
(step 5): region Inconsistent.
Client code breaks up single large write into multiple
small writes: region Consistent, but Undefined.
Special Functionality
Atomic Record Append


Primary appends to itself, then tells other replicas to
write at that offset
If secondary replica fails to write data (step 5),
duplicates in successful replicas, padding in failed ones
region defined where append successful, inconsistent where
failed
Snapshot

Copy-on-write: chunks copied lazily to same replica
Master Internals
Namespace management
Replica Placement
Chunk Creation, Re-replication,
Rebalancing
Garbage Collection
Stale Replica Detection
Dealing with Faults
High availability

Fast master and chunkserver recovery

Chunk replication

Master state replication: read-only shadow replicas
Data Integrity

Chunk broken into 64KB blocks, with 32 bit checksum

Checksums stored in memory, logged to disk

Optimized for appends, since no verifying required
Micro-benchmarks
Storage Data for ‘real’ clusters
Performance
Workload Breakdown
% of operations
for given size
% of bytes
transferred for
given operation
size
GFS: Conclusion
Very application-specific: more
engineering than research
PAST
Internet-based P2P global storage utility




Strong persistence
High availability
Scalability
Security
Not a conventional FS



Files have unique id
Clients can insert and retrieve files
Files are immutable
PAST Operations
Nodes have random unique nodeIds
No searching, directory lookup, key distribution
Supported Operations:
Insert: (name, key, k, file)  fileId

Stores on k nodes closest in id space
Lookup: (fileId)  file
Reclaim: (fileId, key)
Pastry
P2P routing substrate
route (key, msg) : routes to numerically closest
node in less than log2b N steps
Routing Table Size: (2b - 1) * log2b N + 2l
b : determines tradeoff between per node state
and lookup order
l : failure tolerance: delivery guaranteed unless
l/2 adjacent nodeIds fail
10233102: Routing Table
|L|/2 larger and |L|/2
smaller nodeIds
Routing Entries
|M| closest nodes
PAST operations/security
Insert:



Certificate created with fileId, file content hash,
replication factor and signed with private key
File and certificate routed through Pastry
First node in k closest accepts file and forwards to
other k-1
Security: Smartcards



Public/Private key
Generate and verify certificates
Ensure integrity of nodeId and fileId assignments
Storage Management
Design Goals


High global storage utilization
Graceful degradation near max utilization
PAST tries to:


Balance free storage space amongst nodes
Maintain k closest nodes replication invariant
Storage Load Imbalance



Variance in number of files assigned to node
Variance in size distribution of inserted files
Variance in storage capacity of PAST nodes
Storage Management
Large capacity storage nodes have multiple nodeIds
Replica Diversion



If node A cannot store file, it stores pointer to file at leaf set node
B which is not in k closest
What if A or B fail? Duplicate pointer in k+1 closest node
Policies for directing and accepting replicas: tpri and tdiv
thresholds for file size / free space.
File Diversion

If insert fails, client retries with different fileId
Storage Management
Maintaining replication invariant

Failures and joins
Caching


k-replication in PAST for availability
Extra copies stored to reduce client latency, network
traffic

Unused disk space utilized

Greedy Dual-Size replacement policy
Performance
Workloads:

8 Web Proxy Logs

Combined file systems
k=5, b=4
# of nodes = 2250
Without replica and file
diversion:


51.1% insertions failed
60.8% global utilization
4 normal distributions
of node storage sizes
Effect of Storage Management
Effect of tpri
Lower tpri:
Better utilization,
More failures
tdiv = 0.05
tpri varied
Effect of tdiv
Trend similar
to tpri
tpri = 0.1
tdiv varied
File and Replica Diversions
Ratio of replica
diversions vs
utilization
Ratio of file diversions
vs utilization
Distribution of Insertion Failures
File system trace
Web logs trace
Caching
Conclusion
Download