Bigtable: A Distributed Storage System for Structured Data Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google’s NoSQL Solution Chao Wang wang660@usc.edu 2013/4/1 Title 1 Webtable Example How many web pages are there? Recently Google reported finding 1 trillion unique URLs, which would require 80 terabytes to store. How much storage is required to hold a single snapshot of the Web? 1 trillion web pages at 100K bytes per page requires 100 petabytes How is the data stored in the Bigtable? 2013/4/1 An Example 2 Introduction Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability. 2013/4/1 Introduction 3 Data Model A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. 2013/4/1 Data Model 4 Data Model Rows The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of distribution and load balancing. Column Families Column keys are grouped into sets called column families, which form the basic unit of access control. A column key is named using the following syntax: family:qualier. Column family names must be printable, but qualiers may be arbitrary strings. Timestamps Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client 2013/4/1 Data Model 5 Webtable Example Rows In Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient. Column Families An example column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family, and it stores each web page's language ID. Timestamps In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism lets us keep only the most recent several versions ,which we specify, of every page. 2013/4/1 An Example 6 API The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights. 2013/4/1 API 7 API Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable allows cells to be used as integer counters. Bigtable supports the execution of client-supplied scripts in the address spaces of the servers. The scripts are written in a language developed at Google for processing data called Sawzall. Bigtable can be used with MapReduce, a framework for running large-scale parallel computations developed at Google. 2013/4/1 API 8 Building Blocks Bigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google File System (GFS) to store log and data files. The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Bigtable relies on a highly-available and persistent distributed lock service called Chubby. Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data; to discover tablet servers and nalize tablet server deaths; to store Bigtable schema information (the column family information for each table); and to store access control lists. 2013/4/1 Building Blocks 9 Implementation The Bigtable implementation has three major components: a library that is linked into every client, one master server, and many tablet servers. The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage collection of files in GFS. Each tablet server manages a set of tablets. The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large. As with many single-master distributed storage systems, client data does not move through the master: clients communicate directly with tablet servers for reads and writes. 2013/4/1 Implementation 10 Implementation Tablet Location Using a three-level hierarchy analogous to that of a B+- tree to store tablet location information. Tablet Assignment Each tablet is assigned to one tablet server at a time. Bigtable uses Chubby to keep track of tablet servers. 2013/4/1 Implementation 11 Implementation Tablet Serving Compactions As write operations execute, the size of the memtable increases. When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable and written to GFS. A merging compaction that rewrites all SSTables into exactly one SSTable. 2013/4/1 Implementation 12 Refinements Locality groups Clients can group multiple column families together into a locality group. A separate SSTable is generated for each locality group in each tablet. Segregating column families that are not typically accessed together into separate locality groups enables more effcient reads. Compression Clients can control whether or not the SSTables for a locality group are compressed, and if so, which compression format is used. The user-specified compression format is applied to each SSTable block. 2013/4/1 Refinements 13 Refinements Caching for read performance To improve read performance, tablet servers use two levels of caching. The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS. Bloom filters A Bloom filter allows us to ask whether an SSTable might contain any data for a specified row/column pair. For certain applications, a small amount of tablet server memory used for storing Bloom filters drastically reduces the number of disk seeks required for read operations. Commit-log implementation Using one log provides significant performance benefits during normal operation, but it complicates recovery. 2013/4/1 Refinements 14 Refinements Speeding up tablet recovery If the master moves a tablet from one tablet server to another, the source tablet server first does a minor compaction on that tablet. After finishing this compaction, the tablet server stops serving the tablet. Before it actually unloads the tablet, the tablet server does another(usually very fast) minor compaction to eliminate any remaining uncompacted state in the tablet server's log that arrived while the first minor compaction was being performed. After this second minor compaction is complete, the tablet can be loaded on another tablet server without requiring any recovery of log entries. Exploiting immutability Besides the SSTable caches, various other parts of the Bigtable system have been simplified by the fact that all of the SSTables that we generate are immutable. 2013/4/1 Refinements 15 Pros Introduce the structure and function of Bigtable comprehensively. Discuss how Bigtable face to different requirements. Introduce the experience during the process of designing Bigtable. 2013/4/1 Pros 16 Cons According to professor Eric Brewer’s CAP theory, consistency, availability and partition tolerance cannot be met by a distributed system at the same time.As a typical AP database, consistency, its weakness, is not discussed in this paper. 2013/4/1 Cons 17