CSE 726 Hot Topics in Cloud Computing CLOUD COMPUTING FILE STORAGE SYSTEMS University at Buffalo 21 Oct 2011 Sindhuja Venkatesh (sindhuja@buffalo.edu) Overview Google File system(GFS) IBM General Parallel File System(GPFS) Comparisons Google File System [3] Introduction Component failures are the norm Files are huge by traditional standards Modification to the files happens by appending Co-designing applications and API for file system Design Overview System built from inexpensive components that fail often. System stores modest number of large files Two kinds of reads Large sequential writes Efficient support for concurrent appends. High sustained bandwidth as code of the day as opposed to low latency. Architecture [3] [5] Architecture-contd. Client translates file name and byte offset to chunk index. Sends request to master. Master replies with chunk handle and location of replicas. Client caches this info. Sends request to a close replica, specifying chunk handle and byte range. Requests to master are typically buffered. Chunksize Chunk size is chosen to be 64MB. Advantages of a large chunksize • • • Lesser interaction between clients and master Reduced network overhead Reduces size of metadata stored at master Disadvantages • • Small files to single chunk become hot spots. Higher replication as a solution Metadata Three major types : File and chunk namespaces Mapping from files to chunks Locations of chunk replicas All metadata is stored in memory • • • In-Memory Data Structures Chunk Locations Operation Logs Consistency Model– Read • Consider a set of data modifications, and a set of reads all executed by different clients. Furthermore, assume that the reads are executed a “sufficient” time after the writes. Consistent if all clients see the same thing. Defined if all clients see the modification in its entirety (atomic). Lease and Mutation Order - Write 4 step 1 2 3 Secondary Replica A 7 file region may end up cont aining fragment s from Client asks master for all replicas. client s, alt hough t he replicas will be ident ical becaus operat ions are complet ed successfully in t 1. dividual Master replies. Client caches. order on all replicas. T his leaves t he file region in co undefined st at e as not ed intoSect 1. but Client pre-pushes data allion 2.7. 1. Master Client replicas. 6 Primary Replica 1. 5 Legend: Secondary Replica B 6 Control Data F igur e 2: W r it e C ont r ol and D at a F low 3.2 After Data allFlow replicas acknowledge, We decouple flow request of dat a from client sendst hewrite to t he flow of c use primary. t he network efficient ly. While cont rol flows f client to the primary and t hen t o all secondaries, 1. pushed Primary write request to of chun linearlyforwards along a carefully picked chain replicas. in aall pipelined fashion. Our goals are t o fully ut i network signal bandwidth, avoid network bo 1. machine’s Secondary(s) completion. and high-lat ency links, and minimize t he lat ency all t he dat a. to client. Errors 1. t hrough Primary replies Tohandled fully ut ilize each machine’s network bandwi by retrying. dat a is pushed linearly along a chain of chunkserve t han dist ribut ed in some ot her t opology (e.g., t ree each machine’s full out bound bandwidt h is used t fer t he dat a as fast as possible rat her t han divide Atomic Record Appends Similar to that of the previously mentioned leasing mutation method Client pushes data to all replicas. Sends request to primary. Primary Pads current chunk if necessary, telling client to retry. Writes data, tells replicas to do the same. Failures may cause record to be duplicated. These are handled by the client. Data may be different at each replica. Snapshot Copy of a file or a directory tree at an instant Used for Check pointing. Handled using copy-on-write. • • • First revoke all leases. Then duplicate the metadata, but point to the same chunks. When a client requests a write, the master allocates a new chunk handle. Master Operation Namespace Management and Locking Replica Placement Creation, Re-replication, Rebalancing Garbage Collection Stale replica detection Fault Tolerance High Availability Fast recovery Chunk Replication Master Replication Data Integrity General Parallel File System [1] Introduction The file system was fundamentally designed for high performance computing clusters. Traditional supercomputing file access involves: Parallel access from multiple nodes within a file Interfile Parallel access (files in same dir) GPFS supports fully parallel access to both file data and metadata. Even administrative actions performed in parallel. GPFS Architecture Achieves extreme scalability through shared-disk architecture. File system Nodes • Cluster nodes • File system and the applications that use it run • Equal access to all disks Switching Fabric • Storage area network (SAN) Shared disks • Files are striped all across the file system disks. The switching fabric that connects file system nodes to disks may consist of a storage area network (SAN), e.g. fibre channel or iSCSI. Alternatively, individual disks may be attached to some number of I/O server nodes GPFS Issues Data striping and Allocation, Prefetch and Writebehind. Large files are divided into equal sized blocks and consecutive blocks are placed in different disks. 256k block size. Prefetching the data into buffer pool. Large directory support Extensible hashing for large directories file name lookup Logging and recovery All metadata updates are logged All nodes have logs for each file system it mounts. Distributed locking vs. Centralized Management Distributed Locking: Every file system operation acquires an appropriate read or write lock to synchronize with conflicting operations on other nodes before reading or updating any file system data or metadata. Centralized Management: all conflicting operations are forwarded to a designated node, which performs the requested read or update. Distributed Lock Manager Uses a centralized lock manager in conjunction with local lock managers in each file system node. The global lock manager coordinates locks between local lock managers by handing out lock tokens Repeated accesses to the same disk object from the same node only require a single message to obtain the right to acquire a lock on the object (the lock token). Only when an operation on another node requires a conflict- ing lock on the same object are additional messages necessary to revoke the lock token from the first node so it can be granted to the other node. Parallel Data Access Certain classes of supercomputer applications require writing to the same file from multiple nodes. GPFS uses byte-range locking to synchronize reads and writes to file data. • • Token given from (zero to infinity) Then limited based on the concurrent reads Parallel Data Access 1600 1400 Throughput (MB/s) The measurements demonstrate how I/O throughput in GPFS scales when adding more file system nodes and more disks to the system The figure compares reading and writing a single large file from multiple nodes in parallel against each node reading or writing a different file. At 18 nodes the write throughput leveled off due to a problem in the switch adapter microcode. The other point to note in this figure is that writing to a single file from multiple nodes in GPFS was just as fast as each node writing to a different file, demonstrating the effectiveness of the byte-range token protocol described before. 1200 1000 800 600 400 200 0 0 2 4 6 8 10 12 14 16 18 20 22 24 Number of Nodes each node reading a different file all nodes reading the same file each node writing to a different file all nodes writing to the same file Figure 2: Read/Write Scaling f u f s n r l r t t A p p f n e a A r t a t Synchronizing access to Metadata Like other file systems, GPFS uses inodes and indirect blocks to store file attributes and data block addresses. Write operations in GPFS use a shared write lock on the inode that allows concurrent writers on multiple nodes. One of the nodes accessing the file is designated as the metanode for the file, only the metanode reads or writes the inode from or to disk. Each writer updates a locally cached copy of the inode and forwards its inode updates to the metanode periodically or when the shared write token is revoked by a stat() or read() operation on another node. The metanode for a particular file is elected dynami- cally with the help of the token server. When a node first accesses a file, it tries to acquire the metanode token for the file. The token is granted to the first node to do so; other nodes instead learn the identity of the metanode. Allocation Maps The allocation map records the allocation status (free or in-use) of all disk blocks in the file system. Since each disk block can be divided into up to 32 subblocks to store data for small files, the allocation map contains 32 bits per disk block as well as linked lists for finding a free disk block or a subblock of a particular size efficiently. For each GPFS file system, one of the nodes in the cluster is responsible for maintaining free space statistics about all allocation regions. This allocation manager node initializes free space statistics by reading the allocation map when the file system is mounted. Token Manager Scaling The token manager keeps track of all lock tokens granted to all nodes in the cluster. GPFS uses a number of optimizations in the token protocol that significantly reduce the cost of token management and improve response time as well. When it is necessary to revoke a token, it is the responsibility of the revoking node to send revoke messages to all nodes that are holding the token in a conflicting mode, to collect replies from these nodes, and to forward these as a single message to the token manager. Acquiring a token will never require more than two messages to the token manager, regardless of how many nodes may be holding the token in a conflicting mode. The protocol also supports token prefetch and token request batching, which allow acquiring multiple tokens in a single message to the token manager. Fault Tolerance Node Failures Updated by other nodes containing the logs Communication Failures The network is divided and access provided only to the group containing majority of nodes Disk Failures Dual attached RAID for redundancy. [2] [4] File Systems: Internet Services Vs. HPC Introduction Leading Internet services have designed and implemented file systems “from-scratch” to provide high performance for their anticipated application workloads and usage scenarios. Leading examples of such Internet services file systems, as we will call them, include the Google file system (GoogleFS), Amazon Simple Storage Service (S3) and the open-source Hadoop distribute file system (HDFS). Another style of computing at a comparable scale and with a growing market place [24] is high performance computing (HPC). Like Internet applications, HPC applications are often data- intensive and run in parallel on large clusters (supercomputers). These applications use parallel file systems for highly scalable and concurrent storage I/O. Examples of parallel file systems include IBM’s GPFS, Sun’s LustreFS, and the open source Parallel Virtual file system (PVFS). Comparison Experimental Evaluation Implemented a shim layer that uses Hadoop’s extensible abstract file system API (org.apache.hadoop.fs.FileSystem) to use PVFS for all file I/O operations. Hadoop directs all file system operations to the shim layer that forwards each request to the PVFS user-level library. This implementation does not make any code changes to PVFS other than one configuration change, increasing the default 64KB stripe size to match the HDFS chunk size of 64MB, during PVFS setup. Experimental Evaluation Contd. Experiment- Contd.. The shim layer has three key components that are used by Hadoop applications. Readahead buffering – While applications can be programmed to request data in any size, the Hadoop framework uses 4KB as the default amount of data accessed in each file system call. Instead of performing such small reads, HDFS prefetches the entire chunk (of default size 64MB) Data layout module – The Hadoop/Mapreduce job scheduler distributes computation tasks across many nodes in the cluster. Although not mandatory, it prefers to assign tasks to those nodes that store input data required for that task. This requires the Hadoop job scheduler to be aware of the file’s layout information. Fortunately, as a parallel file system, PVFS has this information at the client, and exposes the file striping layout as an extended attribute of each file. Our shim layer matches the HDFS API for the data layout by querying the appropriate extended attributes as needed. Replication emulator – Although the public release of PVFS does not support triplication, our shim enables PVFS to emulate HDFS-style replication by writing, on behalf of the client, to three data servers with every application write. Note that it is the client that sends the three write requests to different servers, unlike HDFS which uses pipelining among its servers. Our approach was motivated by the simplicity of emulating replication at the client instead of making non-trivial changes to the PVFS server implementation. Planned work in PVFS project includes support for replication techniques Experimental Setup Experiments were performed on two clusters. A small cluster for microbenchmarks : SS cluster, consists of 20 nodes, each containing a dual-core 3GHz Pentium D processor, 4GB of memory, and one 7200 rpm SATA 180 GB Seagate Barracuda disk with 8MB buffer DRAM size. Nodes are directly connected to a HP Procurve 2848 using Gigabit Ethernet backplane and have 100 μsecond node to node latency. All machines run the Linux 2.6.24.2 kernel (Debian release) and use the ext3 file system to manage its disk. A big cluster for running real time applications: or large scale testing, we use the Yahoo! M45 cluster, a 4000-core cluster used to experiment with ideas in data-intensive scalable computing . It makes available about 400 nodes, of which we typically use about 50-100 at a time, each containing two quad-core 1.86GHz Xeon processors, 6GB of memory, and four 7200 rpm SATA 750 GB Seagate Barracuda ES disk with 8MB buffer DRAM size. Because of the configuration of these nodes, only one disk is used for a PVFS I/O server. Nodes are interconnected using a Gigabit Ethernet switch hierarchy. All machines run the Redhat Enterprise Linux Server OS (release 5.1) with the 2.6.18-53.1.13.el5 kernel and use the ext3 file system to manage its disks. Results Results – Micro Benchmarks Results – Micro Benchmarks Contd. Results – Micro Benchmarks Contd. Performance of Real Time Apps. Conclusion and Future Work This paper explores the relationship between modern parallel file systems, represented by PVFS, and purpose-built Internet services file systems, represented by HDFS, in the context of their design and performance. It is shown that PVFS can perform comparable to HDFS in the Hadoop Internet services stack. The biggest difference between PVFS and HDFS is the redundancy scheme for handling failures. On balance, it is believed that parallel file systems could be made available for use in Hadoop, while delivering promising performance for diverse access patterns. These services can benefit from parallel file system specializations for concurrent writing, faster metadata and small file operations. With a range of parallel file systems to choose from, Internet services can select a system that better integrates their local data management tools. In future, we can plan to investigate the “opposite” direction; that is, how could we use Internet services file systems for HPC applications. References [1] GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck and Roger Haskin,IBM Almaden Research Center San Jose, CA Proceedings of the Conference on File and Storage Technologies (FAST’02), 28–30 January 2002, Monterey, CA, pp. 231–244. (USENIX, Berkeley, CA.) [2] Data-intensive file systems for Internet services: A rose by any other name … Wittawat Tantisiriroj Swapnil Patil Garth Gibson {wtantisi, swapnil.patil , garth.gibsong} @ cs.cmu.edu CMU-PDL-08-114,October 2008,Parallel Data Laboratory, CarnegieMellonUniversity,Pittsburgh, PA 15213-3890 URL: http://www.ece.rutgers.edu/~parashar/Classes/08-09/ece572/readings/hdfspvfs-tr-08.pdf [3] The Google File System, Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung,Google∗ 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. [4] HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads Azza Abouzeid, Kamil Bajda-Pawlikowski,Daniel Abadi,Avi Silberschatz,Alexander Rasin Yale University,Brown University {azza,kbajda,dna,avi}@cs.yale.edu; alexr@cs.brown.edu, Published in:Journal Proceedings of the VLDB Endowment VLDB Endowment Hompage archive Volume 2 Issue 1, August 2009 [5] .Data-Intensive Text Processing with MapReduce ,Jimmy Lin and Chris Dyer University of Maryland, College Park. ,URL : http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf THANK YOU!!