File Storage Systems in Cloud

advertisement
CSE 726 Hot Topics in Cloud Computing
CLOUD COMPUTING FILE
STORAGE SYSTEMS
University at Buffalo
21 Oct 2011
Sindhuja Venkatesh
(sindhuja@buffalo.edu)
Overview

Google File system(GFS)

IBM General Parallel File System(GPFS)

Comparisons
Google File System
[3]
Introduction
 Component failures are the norm
 Files are huge by traditional standards
 Modification to the files happens by appending
 Co-designing applications and API for file system
Design Overview






System built from inexpensive components that fail
often.
System stores modest number of large files
Two kinds of reads
Large sequential writes
Efficient support for concurrent appends.
High sustained bandwidth as code of the day as
opposed to low latency.
Architecture
[3] [5]
Architecture-contd.
Client translates file name and byte offset to
chunk index.
 Sends request to master.
 Master replies with chunk handle and location
of replicas.
 Client caches this info.
 Sends request to a close replica, specifying
chunk handle and byte range.
 Requests to master are typically buffered.

Chunksize


Chunk size is chosen to be 64MB.
Advantages of a large chunksize
•
•
•

Lesser interaction between clients and master
Reduced network overhead
Reduces size of metadata stored at master
Disadvantages
•
•
Small files to single chunk become hot spots.
Higher replication as a solution
Metadata

Three major types :
 File
and chunk namespaces
 Mapping from files to chunks
 Locations of chunk replicas

All metadata is stored in memory
•
•
•
In-Memory Data Structures
Chunk Locations
Operation Logs
Consistency Model– Read
• Consider a set of data modifications, and a set of reads all
executed by different clients. Furthermore, assume that the
reads are executed a “sufficient” time after the writes.
Consistent if all clients see the same thing.
Defined if all clients see the modification in its entirety (atomic).
Lease and Mutation Order - Write
4
step 1
2
3
Secondary
Replica A
7
file region may end up cont aining fragment s from
Client asks master for all replicas.
client s, alt hough t he replicas will be ident ical becaus
operat
ions are
complet
ed successfully in t
1. dividual
Master
replies.
Client
caches.
order on all replicas. T his leaves t he file region in co
undefined
st at e as not
ed intoSect
1. but Client
pre-pushes
data
allion 2.7.
1.
Master
Client
replicas.
6
Primary
Replica
1.
5
Legend:
Secondary
Replica B
6
Control
Data
F igur e 2: W r it e C ont r ol and D at a F low
3.2 After
Data
allFlow
replicas acknowledge,
We
decouple
flow request
of dat a from
client
sendst hewrite
to t he flow of c
use primary.
t he network efficient ly. While cont rol flows f
client to the primary and t hen t o all secondaries,
1. pushed
Primary
write request
to of chun
linearlyforwards
along a carefully
picked chain
replicas.
in aall
pipelined
fashion. Our goals are t o fully ut i
network signal
bandwidth,
avoid network bo
1. machine’s
Secondary(s)
completion.
and high-lat ency links, and minimize t he lat ency
all t he
dat a. to client. Errors
1. t hrough
Primary
replies
Tohandled
fully ut ilize
each machine’s network bandwi
by retrying.
dat a is pushed linearly along a chain of chunkserve
t han dist ribut ed in some ot her t opology (e.g., t ree
each machine’s full out bound bandwidt h is used t
fer t he dat a as fast as possible rat her t han divide
Atomic Record Appends



Similar to that of the previously mentioned leasing
mutation method
Client pushes data to all replicas.
Sends request to primary. Primary



Pads current chunk if necessary, telling client to retry.
Writes data, tells replicas to do the same.
Failures may cause record to be duplicated. These
are handled by the client.

Data may be different at each replica.
Snapshot



Copy of a file or a directory tree at an instant
Used for Check pointing.
Handled using copy-on-write.
•
•
•
First revoke all leases.
Then duplicate the metadata, but point to the same chunks.
When a client requests a write, the master allocates a new
chunk handle.
Master Operation





Namespace Management and Locking
Replica Placement
Creation, Re-replication, Rebalancing
Garbage Collection
Stale replica detection
Fault Tolerance

High Availability




Fast recovery
Chunk Replication
Master Replication
Data Integrity
General Parallel File System
[1]
Introduction
 The file system was fundamentally designed for
high performance computing clusters.
 Traditional supercomputing file access involves:




Parallel access from multiple nodes within a file
Interfile Parallel access (files in same dir)
GPFS supports fully parallel access to both file
data and metadata.
Even administrative actions performed in parallel.
GPFS Architecture
 Achieves extreme scalability through
shared-disk architecture.
 File system Nodes
• Cluster nodes
• File system and the applications
that use it run
• Equal access to all disks
 Switching Fabric
• Storage area network (SAN)
 Shared disks
• Files are striped all across the
file system disks.
The switching fabric that connects file system nodes to
disks may consist of a storage area network (SAN), e.g.
fibre channel or iSCSI. Alternatively, individual disks
may be attached to some number of I/O server nodes
GPFS Issues

Data striping and Allocation, Prefetch and Writebehind.
 Large
files are divided into equal sized blocks and
consecutive blocks are placed in different disks.
 256k block size.
 Prefetching the data into buffer pool.

Large directory support
 Extensible

hashing for large directories file name lookup
Logging and recovery
 All
metadata updates are logged
 All nodes have logs for each file system it mounts.
Distributed locking vs. Centralized
Management


Distributed Locking: Every file system operation
acquires an appropriate read or write lock to synchronize with conflicting operations on other nodes
before reading or updating any file system data or
metadata.
Centralized Management: all conflicting operations
are forwarded to a designated node, which performs the requested read or update.
Distributed Lock Manager




Uses a centralized lock manager in conjunction with
local lock managers in each file system node.
The global lock manager coordinates locks between
local lock managers by handing out lock tokens
Repeated accesses to the same disk object from the
same node only require a single message to obtain the
right to acquire a lock on the object (the lock token).
Only when an operation on another node requires a
conflict- ing lock on the same object are additional
messages necessary to revoke the lock token from the
first node so it can be granted to the other node.
Parallel Data Access
Certain classes of supercomputer applications
require writing to the same file from multiple
nodes.
 GPFS uses byte-range locking to synchronize
reads and writes to file data.

•
•
Token given from (zero to infinity)
Then limited based on the concurrent reads
Parallel Data Access
1600
1400
Throughput (MB/s)
 The measurements demonstrate how I/O
throughput in GPFS scales when adding
more file system nodes and more disks to
the system
 The figure compares reading and writing
a single large file from multiple nodes in
parallel against each node reading or
writing a different file.
 At 18 nodes the write throughput leveled
off due to a problem in the switch
adapter microcode.
 The other point to note in this figure is
that writing to a single file from multiple
nodes in GPFS was just as fast as each
node writing to a different file,
demonstrating the effectiveness of the
byte-range token protocol described
before.
1200
1000
800
600
400
200
0
0
2
4
6
8 10 12 14 16 18 20 22 24
Number of Nodes
each node reading a different file
all nodes reading the same file
each node writing to a different file
all nodes writing to the same file
Figure 2: Read/Write Scaling
f
u
f
s
n
r
l
r
t
t
A
p
p
f
n
e
a
A
r
t
a
t
Synchronizing access to Metadata





Like other file systems, GPFS uses inodes and indirect blocks to store file
attributes and data block addresses.
Write operations in GPFS use a shared write lock on the inode that allows
concurrent writers on multiple nodes.
One of the nodes accessing the file is designated as the metanode for the
file, only the metanode reads or writes the inode from or to disk.
Each writer updates a locally cached copy of the inode and forwards its
inode updates to the metanode periodically or when the shared write token
is revoked by a stat() or read() operation on another node.
The metanode for a particular file is elected dynami- cally with the help of
the token server. When a node first accesses a file, it tries to acquire the
metanode token for the file. The token is granted to the first node to do so;
other nodes instead learn the identity of the metanode.
Allocation Maps



The allocation map records the allocation status (free or in-use)
of all disk blocks in the file system.
Since each disk block can be divided into up to 32 subblocks
to store data for small files, the allocation map contains 32 bits
per disk block as well as linked lists for finding a free disk
block or a subblock of a particular size efficiently.
For each GPFS file system, one of the nodes in the cluster is
responsible for maintaining free space statistics about all
allocation regions. This allocation manager node initializes free
space statistics by reading the allocation map when the file
system is mounted.
Token Manager Scaling





The token manager keeps track of all lock tokens granted to all nodes in
the cluster.
GPFS uses a number of optimizations in the token protocol that significantly
reduce the cost of token management and improve response time as well.
When it is necessary to revoke a token, it is the responsibility of the
revoking node to send revoke messages to all nodes that are holding the
token in a conflicting mode, to collect replies from these nodes, and to
forward these as a single message to the token manager.
Acquiring a token will never require more than two messages to the token
manager, regardless of how many nodes may be holding the token in a
conflicting mode.
The protocol also supports token prefetch and token request batching,
which allow acquiring multiple tokens in a single message to the token
manager.
Fault Tolerance

Node Failures
 Updated

by other nodes containing the logs
Communication Failures
 The
network is divided and access provided only to the
group containing majority of nodes

Disk Failures
 Dual
attached RAID for redundancy.
[2] [4]
File Systems: Internet Services Vs. HPC
Introduction




Leading Internet services have designed and implemented file systems
“from-scratch” to provide high performance for their anticipated
application workloads and usage scenarios.
Leading examples of such Internet services file systems, as we will call them,
include the Google file system (GoogleFS), Amazon Simple Storage Service
(S3) and the open-source Hadoop distribute file system (HDFS).
Another style of computing at a comparable scale and with a growing
market place [24] is high performance computing (HPC). Like Internet
applications, HPC applications are often data- intensive and run in parallel
on large clusters (supercomputers). These applications use parallel file
systems for highly scalable and concurrent storage I/O.
Examples of parallel file systems include IBM’s GPFS, Sun’s LustreFS, and
the open source Parallel Virtual file system (PVFS).
Comparison
Experimental Evaluation


Implemented a shim layer that uses Hadoop’s extensible
abstract file system API
(org.apache.hadoop.fs.FileSystem) to use PVFS for all
file I/O operations.
Hadoop directs all file system operations to the shim
layer that forwards each request to the PVFS user-level
library. This implementation does not make any code
changes to PVFS other than one configuration change,
increasing the default 64KB stripe size to match the
HDFS chunk size of 64MB, during PVFS setup.
Experimental Evaluation Contd.
Experiment- Contd..

The shim layer has three key components that are used by Hadoop applications.
 Readahead buffering – While applications can be programmed to request data in any
size, the Hadoop framework uses 4KB as the default amount of data accessed in each file
system call. Instead of performing such small reads, HDFS prefetches the entire chunk (of
default size 64MB)
 Data layout module – The Hadoop/Mapreduce job scheduler distributes computation tasks
across many nodes in the cluster. Although not mandatory, it prefers to assign tasks to those
nodes that store input data required for that task. This requires the Hadoop job scheduler
to be aware of the file’s layout information. Fortunately, as a parallel file system, PVFS
has this information at the client, and exposes the file striping layout as an extended
attribute of each file. Our shim layer matches the HDFS API for the data layout by
querying the appropriate extended attributes as needed.
 Replication emulator – Although the public release of PVFS does not support triplication,
our shim enables PVFS to emulate HDFS-style replication by writing, on behalf of the
client, to three data servers with every application write. Note that it is the client that
sends the three write requests to different servers, unlike HDFS which uses pipelining
among its servers. Our approach was motivated by the simplicity of emulating replication
at the client instead of making non-trivial changes to the PVFS server implementation.
Planned work in PVFS project includes support for replication techniques
Experimental Setup

Experiments were performed on two clusters.

A small cluster for microbenchmarks :
SS cluster, consists of 20 nodes, each containing a dual-core 3GHz Pentium D processor, 4GB of
memory, and one 7200 rpm SATA 180 GB Seagate Barracuda disk with 8MB buffer DRAM size.
Nodes are directly connected to a HP Procurve 2848 using Gigabit Ethernet backplane and have
100 μsecond node to node latency. All machines run the Linux 2.6.24.2 kernel (Debian release)
and use the ext3 file system to manage its disk.

A big cluster for running real time applications:
or large scale testing, we use the Yahoo! M45 cluster, a 4000-core cluster used to
experiment with ideas in data-intensive scalable computing . It makes available about 400
nodes, of which we typically use about 50-100 at a time, each containing two quad-core
1.86GHz Xeon processors, 6GB of memory, and four 7200 rpm SATA 750 GB Seagate
Barracuda ES disk with 8MB buffer DRAM size. Because of the configuration of these nodes,
only one disk is used for a PVFS I/O server. Nodes are interconnected using a Gigabit
Ethernet switch hierarchy. All machines run the Redhat Enterprise Linux Server OS (release
5.1) with the 2.6.18-53.1.13.el5 kernel and use the ext3 file system to manage its disks.
Results
Results – Micro Benchmarks
Results – Micro Benchmarks Contd.
Results – Micro Benchmarks Contd.
Performance of Real Time Apps.
Conclusion and Future Work




This paper explores the relationship between modern parallel file systems,
represented by PVFS, and purpose-built Internet services file systems,
represented by HDFS, in the context of their design and performance. It is
shown that PVFS can perform comparable to HDFS in the Hadoop Internet
services stack.
The biggest difference between PVFS and HDFS is the redundancy scheme for
handling failures.
On balance, it is believed that parallel file systems could be made available
for use in Hadoop, while delivering promising performance for diverse access
patterns. These services can benefit from parallel file system specializations for
concurrent writing, faster metadata and small file operations. With a range of
parallel file systems to choose from, Internet services can select a system that
better integrates their local data management tools.
In future, we can plan to investigate the “opposite” direction; that is, how could
we use Internet services file systems for HPC applications.
References
[1] GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck and Roger Haskin,IBM
Almaden Research Center San Jose, CA Proceedings of the Conference on File and Storage Technologies
(FAST’02), 28–30 January 2002, Monterey, CA, pp. 231–244. (USENIX, Berkeley, CA.)
[2] Data-intensive file systems for Internet services: A rose by any other name … Wittawat Tantisiriroj Swapnil
Patil Garth Gibson {wtantisi, swapnil.patil , garth.gibsong} @ cs.cmu.edu CMU-PDL-08-114,October
2008,Parallel Data Laboratory, CarnegieMellonUniversity,Pittsburgh, PA 15213-3890 URL:
http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/hdfspvfs-tr-08.pdf
[3] The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,Google∗ 19th ACM
Symposium on Operating Systems Principles, Lake George, NY, October, 2003.
[4] HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza
Abouzeid, Kamil Bajda-Pawlikowski,Daniel Abadi,Avi Silberschatz,Alexander Rasin Yale University,Brown
University {azza,kbajda,dna,avi}@cs.yale.edu; alexr@cs.brown.edu, Published in:Journal Proceedings of the
VLDB Endowment VLDB Endowment Hompage archive Volume 2 Issue 1, August 2009
[5] .Data-Intensive Text Processing with MapReduce ,Jimmy Lin and Chris Dyer University of Maryland, College
Park. ,URL : http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf
THANK YOU!!
Download