Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data Authors Presented by: Fay Chang Jeffrey Dean Sanjay Ghemawat Wilson Hsieh Deborah Wallach Mike Burrows Tushar Chandra Andrew Fikes Robert Gruber Arif Bin Hossain Dept. of Computer Science UTSA Motivation  Large scale structured data  URLs: Contents, links, anchors, page rank  User data: Pref. settings, recent queries, search results  Geographic locations: Physical entities, roads, satellite image  Large set of structured MATLAB data  EEG, EMG, Eye motion  Field are not uniform among datasets  Data types are not uniform among datasets Why not Relational Database?  Scale is too large for most commercial databases  Even if it weren’t, cost would be very high  Low-level storage optimizations help performance significantly  Hard to map semi-structured data to relational database  Non-uniform fields makes it difficult to insert/query data Bigtable  BigTable is a distributed storage system for managing structured data.  Designed to scale to a very large size  Used for many Google projects  Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance  Efficient scans over all or interesting subsets of data  Efficient joins of large one-to-one and one-tomany datasets Bigtable  Used for variety of demanding workloads  Throughput oriented batch processing  Latency sensitive data serving  Data is indexed using row and column names  Treats data as uninterpreted strings  Clients can control the locality  Dynamic controls to serve data out of memory or from disk Building Blocks  Google File System (GFS)  Large scale distributed file system  Maintains multiple replicas  Consists for Master and Chunk server  Chunk Server  Stores the data files  Each data file broken into fixed size chunks  Each chunk is replicated at least three times  Master  Stores the metadata associated with the chunks Building Blocks  Chubby lock service  Have five active replicas  Provides namespace that consists of directories and files  Each file can be used as a lock  Each Chubby client maintains a session with Chubby service  When the session expires, it loses any locks and open handles Building Block  SSTable  Immutable file format used internally to store data files  Sorted Key-Value pairs of arbitrary byte strings  Contains a sequence of blocks  Block index is used to locate blocks  Index is loaded into memory when the SSTable is opened  Lookup can be performed in single disk access 64K block 64K block 64K block SSTable Index Basic Data Model  A table is a sparse, distributed, persistent multidimensional sorted map  Data is organized into three dimensions  (row: string, column: string, time: int64)  string  Each cell is referenced by a row key, column key and timestamp Basic Data Model  (row, column, timestamp)  cell contents  Example: webtable Data Model: Row  Name is an arbitrary string.  Access to data in a row is a atomic.  Row creation is implicit upon storing data.  Transactions with in a row  Rows ordered lexicographically by row key  Rows close together lexicographically usually on one or a small number of machines.  Rows are grouped together to form the unit of load balancing Data Model: Column  Columns has two-level name structure:  Family:qualifier  Example: “anchor: cnnsi.com”  Column keys are grouped into sets called Column Family  Unit of access control  All data stored in a column family is usually of same type  Additional level of indexing, if desired  Main idea: Limited families, Unbounded columns Data Model: Timestamp  Used to store different versions of data in a cell  New writes default to current time  Can also be set explicitly by clients  Look up examples  “Return most recent K values”  “Return all values in timestamp range(on all values)”  Can be used to mark column family  “Only retain most recent K values in a cell”  “Keep values until they are older than K seconds” Tablets  Rows with consecutive key are grouped into tablets  Unit of load balancing  Reads of short row ranges are efficient and require communication with a small number of machines  Clients can use this property to get good locality by selecting row keys efficiently Tablets (cont.)  Contains some range of rows, essentially a set of SSTables Tablet 64K block 64K block 64K block SSTable Index 64K block 64K block 64K block SSTable Index Implementation  Three major components  Library linked into every client  Single master server Assigning tablets to tablet servers  Detecting addition and expiration of tablet servers  Balancing tablet-server load  Garbage collection files in GFS   Many tablet servers  Manages a set of tablets  Tablet servers handle read and write requests to its table  Splits tablets that have grown too large Implementation (cont.)  Clients communicates directly with tablet servers for read/write  Each table consists of a set of tablets Initially, each table have just one tablet  Tablets are automatically split as the table grows   Row size can be arbitrary (hundreds of GB) Locating Tablets  How do clients find a right machine ?  Need to find tablet whose row range covers the target row  Three level hierarchy  Level 1: Chubby file containing location of the root tablet  Level 2: Root tablet contains the location of METADATA tablets  Level 3: Each METADATA tablet contains the location of user tablets  Location of tablet is stored under a row key that encodes table identifier and its end row Locating Tablets Assigning Tablets  Each tablet is assigned to one tablet server at a time.  Master server keeps track of  Set of live tablet servers  Current assignments of tablets to servers.  Unassigned tablets.  When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient space. Assigning Tablets  Tablet server startup   It creates and acquires an exclusive lock on uniquely named file on Chubby Master monitors this directory to discover tablet servers.  Tablet server stops serving tablets    If it loses its exclusive lock. Tries to reacquire the lock on its file as long as the file still exists. If file no longer exists, the tablet server will never be able to serve again Assigning Tablets  Master server startup  Grabs unique master lock in Chubby.  Scans the tablet server directory in Chubby.  Communicates with every live tablet server  Scans METADATA table to learn set of tablets.  Master is responsible for finding when tablet server is no longer serving its tablets and reassigning those tablets as soon as possible.  Periodically asks each tablet server for the status of its lock  If no reply, master tries to acquire the lock itself  If successful to acquire lock, then tablet server is either dead or having network trouble Tablet Serving  Updates are committed to a commit log that stores the redo records  Recently committed updates are stored in memory in a sorted buffer called memtable  Memtable maintains the updates on a row-by-row basis  Older updates are stored in a sequence of immutable SSTables.  To recover a tablet  Tablet server reads data from METADATA table.  Metadata contains list of SSTables and set of redo points  Server reads the indices of the SSTables in memory  Reconstructs the memtable by applying all of the updates since redo points. Tablet Serving  Write operation  Server checks if it is well-formed  Checks if the sender is authorized  Write to commit log  After commit, contents are inserted into Memtable  Read operation  Similar check for well-formedness and authorization  Executed on a merged view of the sequence of SSTables and memtable Compaction: Minor  As write operations execute, size of memtable increases  When memtable reaches threshold   Frozen memtable is converted to an SSTable SSTable written to file system  Goals  Reduce memory usage of the tablet server  Reduce the amount of data to read from commit log during recovery Compaction  Problem: too many SSTable  Read operations might need to merge from a number of SSTables  Merging compaction  Reads the contents of a few SSTable and memtable  Writes new SSTable  Merging compaction that re-writes all SSTables into exactly one SSTable is a major compaction Locality Groups  Each column families is assigned to a locality group defined by client  Seperate SSTable is created for each locality group during compaction  Increases read efficiency as columns that are grouped together are usually accessed together  Used to organize underlying storage representation for performance  Scans over one locality group are O(bytes_in_locality_group), not O(bytes_in_table)  Data in locality group can be explicitly memory mapped Refinements  Compression  Clients can control SSTable compression for a locality group  Caching  Scan Cache: a high-level cache that caches key-value pairs returned by the SSTable interface  Block Cache: a lower-level cache that caches SSTable blocks read from file system  Bloom Filters  Allows to ask whether an SSTable might contain any data for a given row/column pair  Reduces disk access while reading SSTables Example: Cassandra  Initially developed by Facebook for inbox search  Built on BigTable data model  Provides a structured key-value store  Keys map to multiple values, which are grouped into column families  Used by Cassandra  A table in cassandra is distributed multidimensional map indexed by a key  The row key in a table is a string with no size restrictions  Usually a four dimensional map     Keyspace -> Column Family Column Family -> Column Family Row Column Family Row -> Columns Column -> Data value Cassandra: Column  Column { name: "emailAddress", value: "arin@example.com", timestamp: 123456789 } Cassandra: SuperColumn  SuperColumn { name: "homeAddress", value: { street: {name: "street", value: "1234 x street", timestamp: 123456789}, city: {name: "city", value: "san francisco", timestamp: 123456789}, zip: {name: "zip", value: "94107", timestamp: 123456789}, } } Cassandra: ColumnFamily  Column Family UserProfile = { ahossain: { username: " ahossain", email: “ahossain@example.com", phone: "(210) 123-4567" }, jdoe: { username: “jdoe", email: “jdoe@example.com", phone: "(210) 765-4321" age: "66", gender: “male" }, } Example: Pelops (Write) String pool = "pool"; String keyspace = "mykeyspace"; String colFamily = "users"; String rowKey = "abc123"; Cluster cluster = new Cluster("localhost", 9160); Pelops.addPool(pool, cluster, keyspace); Mutator mutator = Pelops.createMutator(pool); mutator.writeColumns( colFamily, rowKey, mutator.newColumnList( mutator.newColumn("name", "Dan"), mutator.newColumn("age", Bytes.fromInt(33)) ) ); mutator.execute(ConsistencyLevel.ONE); Example: Pelops (Read) Selector selector = Pelops.createSelector(pool); List<Column> columns = selector.getColumnsFromRow( colFamily, rowKey, false, ConsistencyLevel.ONE); System.out.println("Name: " + Selector.getColumnStringValue(columns, "name")); System.out.println("Age: " + Selector.getColumnValue(columns, "age").toInt()); Thank you Questions?

Bigtable: A Distributed Storage System for Structured Data

Related documents

Products

Support

Bigtable: A Distributed Storage System for Structured Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib