6.2-BigTable

advertisement
Bigtable: A Distributed Storage
System for Structured Data
Fay Chang et al. (Google, Inc.)
Presenter: Kyungho Jeon
kyunghoj@buffalo.edu
10/22/2012
Fall 2012: CSE 704 Web-scale Data
Management
1
Motivation and Design Goal
• Distributed Storage System for Structured
Data
– Scalability
• Petabytes of data on Thousands of (commodity)
machines
– Wide Applicability
• Throughput-oriented and Latency-sensitive
– High Performance
– High Availability
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
2
Data Model
10/22/2012
Fall 2012: CSE 704 Web-scale Data
Management
3
Data Model
• Not a Full Relational Data Model
• Provides a simple data model
– Supports Dynamic Control over Data Layout
– Allows clients to reason about the locality
properties
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
4
Data Model – A Big Table
• A Table in Bigtable is a:
– Sparse
– Distributed
– Persistent
– Multidimensional
– Sorted map
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
5
Data Model
• Data is indexed using row and column names
• Data is treated as uninterpreted strings
– (row:string, column:string, time:int64) → string
• Data locality can be controlled through careful
choices of the schema
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
6
Data Model
• Rows
– Data maintained in lexicographic order by row key
– Tablet: rows with consecutive keys
• Units of distribution and load balancing
• Columns
– Column families
• Family:qualifier
• Cells
• Timestamps
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
7
Data Model – WebTable Example
A large collection of web pages and related information
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
8
Data Model – WebTable Example
Row Key
Tablet - Group of rows with consecutive keys.
Unit of Distribution
Bigtable maintains data in lexicographic order by row key
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
9
Data Model – WebTable Example
Column family is the unit of access control
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
Column
Family
10
Data Model – WebTable Example
Column key is specified by
“Column family:qualifier”
10/22/2012
Column
Fall 2012: CSE 704 Web-scale Data Management
11
Data Model – WebTable Example
You can add a column in a
column family if the column
family was created
10/22/2012
Column
Fall 2012: CSE 704 Web-scale Data Management
12
Data Model – WebTable Example
Cell
10/22/2012
Cell: the storage referenced
by a particular row key,
column key, and
timestamp
Fall 2012: CSE 704 Web-scale Data Management
13
Data Model – WebTable Example
Different cells in a table
can contain multiple
versions indexed by
timestamp
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
14
API
10/22/2012
Fall 2012: CSE 704 Web-scale Data
Management
15
API
• Write or Delete values in Bigtable
• Look up values from individual rows
• Iterate over a subset of the data in a table
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
16
API – Update a Row
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
17
API – Update a Row
Opens a Table
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
18
API – Update a Row
We’re going to mutate the row
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
19
API – Update a Row
Store a new item under the
column key “anchor:www.cspan.org”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
20
API – Update a Row
Delete an item under the
column key
“anchor:www.abc.com”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
21
API – Update a Row
Atomic Mutation
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
22
API – Iterate over a Table
Create a Scanner instance
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
23
API – Iterate over a Table
Access “anchor” column family
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
24
API – Iterate over a Table
Specify “return all versions”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
25
API – Iterate over a Table
Specify a row key
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
26
API – Iterate over a Table
Iterate over rows
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
27
API – Other Features
• Single row transaction
• Client-supplied scripts in the address space of
the server
• Input source/Output target for MapReduce
jobs
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
28
A Typical Google Machine
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
29
A Google Cluster
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
30
A Google Cluster
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
31
Building Blocks
• Chubby
– Highly-available and persistent distributed lock
service
• GFS
– Store logs and data files
– SSTable
• Google’s immutable file format
• A persistent, ordered immutable map from keys to
values
• http://code.google.com/p/leveldb/
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
32
Chubby
• Highly-available and persistent distributed lock
service
– 5 replicas, one is elected as a master
– Paxos
– Provides a namespace that consists of directories
and small files
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
34
Implementation
• Client Library
• Master
– one and only one!
• Tablet Servers
– Many
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
35
Implementation - Master
• Responsible for assigning tablets to table
servers
– Addition/removal of tablet server
– Tablet-server load balancing
– Garbage collecting files in GFS
• Handles schema changes
• Single master system (as GFS did)
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
36
Tablet Server
• Manages a set of tablets
• Handles read and write requests to the tablets
• Splits tablets that have grown too large
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
37
How Does a Client Find a Tablet?
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
38
Tablet Assignment
• Each tablet is assigned to at most one tablet
server at a time
• When a tablet is unassigned, and a tablet
server is available, the master assigns the
tablet by sending a tablet load request
• Bigtable uses Chubby to keep track of tablet
servers
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
39
Tablet Assignment
• Detecting a tablet server which is no longer serving its
tablets
– The master periodically asks each tablet server for the status of
its lock
– If a tablet server reports it has lost its lock, or if the master
cannot reach a tablet server,
– The master attempts to acquire an exclusive lock on the
server’s file
– If the lock acquire is successful -> Chubby is alive, so the tablet
server must have a problem
– The master deletes the server’s file in Chubby to ensure the
tablet server can never serve again
– Then, the master move all the tablets that were previously
assigned to that server into the set of unassigned tablets
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
40
Tablet Assignment
• When a master is started, the master…
– Grabs a unique master lock in Chubby
– Scans the servers directory in Chubby to find the
live servers
– Communicates with every live tablet server to
discover the current tablet assignment
– Scans the METADATA table and adds unassigned
tablets to the set of unassigned tablets
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
41
Tablet Serving
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
42
Tablet Serving
• Memtable
– A sorted buffer
– Maintains the updates on a row-by-row basis
– Each row is copy-on-write to maintain row-level
consistency
– Older updates are stored in a sequence of SSTable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
43
Tablet Serving
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
44
Tablet Serving - Write
• Write operation
– The server checks if the operation is valid
– A valid mutation is written to the commit log
– After the write has been committed, its contents
are inserted into the memtable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
45
Tablet Serving
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
46
Tablet Serving - Read
• Read operation
– Check if the operation is valid
– A valid operation is executed on a merged view of
the sequence of SSTables and the memtable
– The merged view can be formed efficiently since
SSTables and the memtable are lexicographically
sorted data structure
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
47
Tablet Serving - Recover
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
48
Tablet Serving - Recover
• Recover a table
– A tablet server reads its metadata from
METADATA table
– The metadata contains the list of SSTables that
comprise a tablet and a set of redo points
– The server reads the indices of the SSTables into
memory and reconstructs the memtable by
applying all of the updates that have committed
since the redo points
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
49
Compaction
• Minor compaction
– When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created,
and the frozen memtable is converted to an
SSTable
• Major compaction
– Rewrite multiple SSTables into one SSTable
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
50
Compaction
memtable
Memory
GFS
Commit Log
Write Op
10/22/2012
SSTable
SSTable
SSTable
SSTable
Fall 2012: CSE 704 Web-scale Data Management
51
Compaction
Threshold reached
memtable
Memory
GFS
Commit Log
Write Op
10/22/2012
SSTable
SSTable
SSTable
SSTable
Fall 2012: CSE 704 Web-scale Data Management
52
Compaction
Threshold reached
memtable
Memory
GFS
Commit Log
Write Op
10/22/2012
SSTable
SSTable
SSTable
SSTable
SSTable
Fall 2012: CSE 704 Web-scale Data Management
53
Compaction
A new memtable
memtable
Memory
GFS
Commit Log
Write Op
10/22/2012
SSTable
SSTable
SSTable
SSTable
SSTable
Fall 2012: CSE 704 Web-scale Data Management
54
Compaction
memtable
Memory
GFS
Commit Log
Write Op
10/22/2012
SSTable
Major
compaction
Fall 2012: CSE 704 Web-scale Data Management
55
Schema Management
• Bigtable schemas are stored in Chubby
• The master update the schema by rewriting
the corresponding schema file in Chubby
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
56
Optimization
• Locality Group
– Client defined
– An abstraction that enables clients to control
their data’s storage layout
– A separate SSTable is generated for each locality
group in each tablet during compaction
– A locality group can be declared to be in-memory
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
57
Optimization
• Compression
– Client can control whether the SSTables for a
locality group are compressed
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
58
Optimization
• Two-level Caching for Read Performance
– Scan cache:
• higher level.
• Caches the key-value pairs returned by the SSTable
interface to the tablet server code
– Block cache:
• lower level
• Caches SSTable blocks
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
59
Optimization
• Commit-Log Implementation
– Using one log per tablet server
– Recovery?
•
•
•
•
A tablet server hosted 100 tablets failed
100 other machines were each assigned a single tablet
100 reads?
Sort the commit log by <table, row name, log seq #>
– Writing commit logs
• Two log-writer threads
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
61
Performance Evaluation
• Sequential writes/reads
– Row keys with names 0 to R-1, partitioned into 10N equal-sized
ranges
– Wrote a single string under each row key
– 1GB / tablet server
• Scan
– Uses Bigtable Scan API
• Random writes/reads
– Similar to Sequential write/read, but the row key was hashed
• Random reads (Mem)
– 100MB / tablet server, the locality group is marked as in-memory
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
62
Single Tablet Server Performance
10/22/2012
Fall 2012: CSE 704 Web-scale Data
Management
63
Aggregate Throughput
10/22/2012
Fall 2012: CSE 704 Web-scale Data
Management
64
Real Applications
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
65
Lessons Learned
• Failures!
• Delay new features until it is clear how the
new features will be used
• Monitoring
• Simple Design!
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
66
Acknowledgement
• Jeff Dean, “Handling Large Datasets at Google:
Current Systems and Future Directions”
10/22/2012
Fall 2012: CSE 704 Web-scale Data Management
67
Download