Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current System and Future Directions, Jeff Dean 4/13/2015 EECS 584, Fall 2011 1 Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 2 Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 3 Google’s Motivation – Scale! Scale Problem – – – – Lots of data Millions of machines Different project/applications Hundreds of millions of users Storage for (semi-)structured data No commercial system big enough – Couldn’t afford if there was one Low-level storage optimization help performance significantly Much harder to do when running on top of a database layer 4/13/2015 EECS 584, Fall 2011 4 Bigtable Distributed multi-level map Fault-tolerant, persistent Scalable – – – – Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance 4/13/2015 EECS 584, Fall 2011 5 Real Applications 4/13/2015 EECS 584, Fall 2011 6 Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 7 Data Model a sparse, distributed persistent multidimensional sorted map (row, column, timestamp) -> cell contents 4/13/2015 EECS 584, Fall 2011 8 Data Model Rows – Arbitrary string – Access to data in a row is atomic – Ordered lexicographically 4/13/2015 EECS 584, Fall 2011 9 Data Model Column – Tow-level name structure: • family: qualifier – Column Family is the unit of access control 4/13/2015 EECS 584, Fall 2011 10 Data Model Timestamps – Store different versions of data in a cell – Lookup options • Return most recent K values • Return all values 4/13/2015 EECS 584, Fall 2011 11 Data Model The row range for a table is dynamically partitioned Each row range is called a tablet Tablet is the unit for distribution and load balancing 4/13/2015 EECS 584, Fall 2011 12 Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 13 APIs Metadata operations – Create/delete tables, column families, change metadata Writes – Set(): write cells in a row – DeleteCells(): delete cells in a row – DeleteRow(): delete all cells in a row Reads – Scanner: read arbitrary cells in a bigtable • • • • 4/13/2015 Each row read is atomic Can restrict returned rows to a particular range Can ask for just data from 1 row, all rows, etc. Can ask for all columns, just certain column families, or specific columns EECS 584, Fall 2011 14 Outline Motivation Data Model APIs Building Blocks Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 15 Typical Cluster Shared pool of machines that also run other distributed applications 4/13/2015 EECS 584, Fall 2011 16 Building Blocks Google File System (GFS) – stores persistent data (SSTable file format) Scheduler – schedules jobs onto machines Chubby – Lock service: distributed lock manager – master election, location bootstrapping MapReduce (optional) – Data processing – Read/write Bigtable data 4/13/2015 EECS 584, Fall 2011 17 Chubby {lock/file/name} service Coarse-grained locks Each clients has a session with Chubby. – The session expires if it is unable to renew its session lease within the lease expiration time. 5 replicas, need a majority vote to be active Also an OSDI ’06 Paper 4/13/2015 EECS 584, Fall 2011 18 Outline Motivation Overall Architecture & Building Blocks Data Model APIs Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 19 Implementation Single-master distributed system Three major components – Library that linked into every client – One master server • • • • • Assigning tablets to tablet servers Detecting addition and expiration of tablet servers Balancing tablet-server load Garbage collection Metadata Operations – Many tablet servers • Tablet servers handle read and write requests to its table • Splits tablets that have grown too large 4/13/2015 EECS 584, Fall 2011 20 Implementation 4/13/2015 EECS 584, Fall 2011 21 Tablets Each Tablets is assigned to one tablet server. – Tablet holds contiguous range of rows • Clients can often choose row keys to achieve locality – Aim for ~100MB to 200MB of data per tablet Tablet server is responsible for ~100 tablets – Fast recovery: • 100 machines each pick up 1 tablet for failed machine – Fine-grained load balancing: • Migrate tablets away from overloaded machine • Master makes load-balancing decisions 4/13/2015 EECS 584, Fall 2011 22 How to locate a Tablet? Given a row, how do clients find the location of the tablet whose row range covers the target row? METADATA: Key: table id + end row, Data: location Aggressive Caching and Prefetching at Client side 4/13/2015 EECS 584, Fall 2011 23 Tablet Assignment Each tablet is assigned to one tablet server at a time. Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room. It uses Chubby to monitor health of tablet servers, and restart/replace failed servers. 4/13/2015 EECS 584, Fall 2011 24 Tablet Assignment Chubby – Tablet server registers itself by getting a lock in a specific directory chubby • Chubby gives “lease” on lock, must be renewed periodically • Server loses lock if it gets disconnected – Master monitors this directory to find which servers exist/are alive • If server not contactable/has lost lock, master grabs lock and reassigns tablets • GFS replicates data. Prefer to start tablet server on same machine that the data is already at 4/13/2015 EECS 584, Fall 2011 25 Outline Motivation Overall Architecture & Building Blocks Data Model APIs Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 26 Refinement – Locality groups & Compression Locality Groups – Can group multiple column families into a locality group • Separate SSTable is created for each locality group in each tablet. – Segregating columns families that are not typically accessed together enables more efficient reads. • In WebTable, page metadata can be in one group and contents of the page in another group. Compression – Many opportunities for compression • Similar values in the cell at different timestamps • Similar values in different columns • Similar values across adjacent rows 4/13/2015 EECS 584, Fall 2011 27 Outline Motivation Overall Architecture & Building Blocks Data Model APIs Implementation Refinement Evaluation 4/13/2015 EECS 584, Fall 2011 28 Performance - Scaling Not Linear! WHY? As the number of tablet servers is increased by a factor of 500: – Performance of random reads from memory increases by a factor of 300. – Performance of scans increases by a factor of 260. 4/13/2015 EECS 584, Fall 2011 29 Not linearly? Load Imbalance – Competitions with other processes • Network • CPU – Rebalancing algorithm does not work perfectly • Reduce the number of tablet movement • Load shifted around as the benchmark progresses 4/13/2015 EECS 584, Fall 2011 30