Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon kyunghoj@buffalo.edu 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 1 Motivation and Design Goal • Distributed Storage System for Structured Data – Scalability • Petabytes of data on Thousands of (commodity) machines – Wide Applicability • Throughput-oriented and Latency-sensitive – High Performance – High Availability 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 2 Data Model 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 3 Data Model • Not a Full Relational Data Model • Provides a simple data model – Supports Dynamic Control over Data Layout – Allows clients to reason about the locality properties 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 4 Data Model – A Big Table • A Table in Bigtable is a: – Sparse – Distributed – Persistent – Multidimensional – Sorted map 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 5 Data Model • Data is indexed using row and column names • Data is treated as uninterpreted strings – (row:string, column:string, time:int64) → string • Data locality can be controlled through careful choices of the schema 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 6 Data Model • Rows – Data maintained in lexicographic order by row key – Tablet: rows with consecutive keys • Units of distribution and load balancing • Columns – Column families • Family:qualifier • Cells • Timestamps 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 7 Data Model – WebTable Example A large collection of web pages and related information 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 8 Data Model – WebTable Example Row Key Tablet - Group of rows with consecutive keys. Unit of Distribution Bigtable maintains data in lexicographic order by row key 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 9 Data Model – WebTable Example Column family is the unit of access control 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management Column Family 10 Data Model – WebTable Example Column key is specified by “Column family:qualifier” 10/22/2012 Column Fall 2012: CSE 704 Web-scale Data Management 11 Data Model – WebTable Example You can add a column in a column family if the column family was created 10/22/2012 Column Fall 2012: CSE 704 Web-scale Data Management 12 Data Model – WebTable Example Cell 10/22/2012 Cell: the storage referenced by a particular row key, column key, and timestamp Fall 2012: CSE 704 Web-scale Data Management 13 Data Model – WebTable Example Different cells in a table can contain multiple versions indexed by timestamp 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 14 API 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 15 API • Write or Delete values in Bigtable • Look up values from individual rows • Iterate over a subset of the data in a table 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 16 API – Update a Row 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 17 API – Update a Row Opens a Table 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 18 API – Update a Row We’re going to mutate the row 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 19 API – Update a Row Store a new item under the column key “anchor:www.cspan.org” 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 20 API – Update a Row Delete an item under the column key “anchor:www.abc.com” 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 21 API – Update a Row Atomic Mutation 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 22 API – Iterate over a Table Create a Scanner instance 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 23 API – Iterate over a Table Access “anchor” column family 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 24 API – Iterate over a Table Specify “return all versions” 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 25 API – Iterate over a Table Specify a row key 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 26 API – Iterate over a Table Iterate over rows 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 27 API – Other Features • Single row transaction • Client-supplied scripts in the address space of the server • Input source/Output target for MapReduce jobs 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 28 A Typical Google Machine 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 29 A Google Cluster 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 30 A Google Cluster 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 31 Building Blocks • Chubby – Highly-available and persistent distributed lock service • GFS – Store logs and data files – SSTable • Google’s immutable file format • A persistent, ordered immutable map from keys to values • http://code.google.com/p/leveldb/ 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 32 Chubby • Highly-available and persistent distributed lock service – 5 replicas, one is elected as a master – Paxos – Provides a namespace that consists of directories and small files 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 34 Implementation • Client Library • Master – one and only one! • Tablet Servers – Many 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 35 Implementation - Master • Responsible for assigning tablets to table servers – Addition/removal of tablet server – Tablet-server load balancing – Garbage collecting files in GFS • Handles schema changes • Single master system (as GFS did) 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 36 Tablet Server • Manages a set of tablets • Handles read and write requests to the tablets • Splits tablets that have grown too large 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 37 How Does a Client Find a Tablet? 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 38 Tablet Assignment • Each tablet is assigned to at most one tablet server at a time • When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request • Bigtable uses Chubby to keep track of tablet servers 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 39 Tablet Assignment • Detecting a tablet server which is no longer serving its tablets – The master periodically asks each tablet server for the status of its lock – If a tablet server reports it has lost its lock, or if the master cannot reach a tablet server, – The master attempts to acquire an exclusive lock on the server’s file – If the lock acquire is successful -> Chubby is alive, so the tablet server must have a problem – The master deletes the server’s file in Chubby to ensure the tablet server can never serve again – Then, the master move all the tablets that were previously assigned to that server into the set of unassigned tablets 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 40 Tablet Assignment • When a master is started, the master… – Grabs a unique master lock in Chubby – Scans the servers directory in Chubby to find the live servers – Communicates with every live tablet server to discover the current tablet assignment – Scans the METADATA table and adds unassigned tablets to the set of unassigned tablets 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 41 Tablet Serving 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 42 Tablet Serving • Memtable – A sorted buffer – Maintains the updates on a row-by-row basis – Each row is copy-on-write to maintain row-level consistency – Older updates are stored in a sequence of SSTable 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 43 Tablet Serving 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 44 Tablet Serving - Write • Write operation – The server checks if the operation is valid – A valid mutation is written to the commit log – After the write has been committed, its contents are inserted into the memtable 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 45 Tablet Serving 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 46 Tablet Serving - Read • Read operation – Check if the operation is valid – A valid operation is executed on a merged view of the sequence of SSTables and the memtable – The merged view can be formed efficiently since SSTables and the memtable are lexicographically sorted data structure 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 47 Tablet Serving - Recover 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 48 Tablet Serving - Recover • Recover a table – A tablet server reads its metadata from METADATA table – The metadata contains the list of SSTables that comprise a tablet and a set of redo points – The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 49 Compaction • Minor compaction – When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable • Major compaction – Rewrite multiple SSTables into one SSTable 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 50 Compaction memtable Memory GFS Commit Log Write Op 10/22/2012 SSTable SSTable SSTable SSTable Fall 2012: CSE 704 Web-scale Data Management 51 Compaction Threshold reached memtable Memory GFS Commit Log Write Op 10/22/2012 SSTable SSTable SSTable SSTable Fall 2012: CSE 704 Web-scale Data Management 52 Compaction Threshold reached memtable Memory GFS Commit Log Write Op 10/22/2012 SSTable SSTable SSTable SSTable SSTable Fall 2012: CSE 704 Web-scale Data Management 53 Compaction A new memtable memtable Memory GFS Commit Log Write Op 10/22/2012 SSTable SSTable SSTable SSTable SSTable Fall 2012: CSE 704 Web-scale Data Management 54 Compaction memtable Memory GFS Commit Log Write Op 10/22/2012 SSTable Major compaction Fall 2012: CSE 704 Web-scale Data Management 55 Schema Management • Bigtable schemas are stored in Chubby • The master update the schema by rewriting the corresponding schema file in Chubby 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 56 Optimization • Locality Group – Client defined – An abstraction that enables clients to control their data’s storage layout – A separate SSTable is generated for each locality group in each tablet during compaction – A locality group can be declared to be in-memory 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 57 Optimization • Compression – Client can control whether the SSTables for a locality group are compressed 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 58 Optimization • Two-level Caching for Read Performance – Scan cache: • higher level. • Caches the key-value pairs returned by the SSTable interface to the tablet server code – Block cache: • lower level • Caches SSTable blocks 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 59 Optimization • Commit-Log Implementation – Using one log per tablet server – Recovery? • • • • A tablet server hosted 100 tablets failed 100 other machines were each assigned a single tablet 100 reads? Sort the commit log by <table, row name, log seq #> – Writing commit logs • Two log-writer threads 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 61 Performance Evaluation • Sequential writes/reads – Row keys with names 0 to R-1, partitioned into 10N equal-sized ranges – Wrote a single string under each row key – 1GB / tablet server • Scan – Uses Bigtable Scan API • Random writes/reads – Similar to Sequential write/read, but the row key was hashed • Random reads (Mem) – 100MB / tablet server, the locality group is marked as in-memory 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 62 Single Tablet Server Performance 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 63 Aggregate Throughput 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 64 Real Applications 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 65 Lessons Learned • Failures! • Delay new features until it is clear how the new features will be used • Monitoring • Simple Design! 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 66 Acknowledgement • Jeff Dean, “Handling Large Datasets at Google: Current Systems and Future Directions” 10/22/2012 Fall 2012: CSE 704 Web-scale Data Management 67