19_Wang_Chao

advertisement
Bigtable: A Distributed Storage
System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh,
Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
Google’s NoSQL Solution
Chao Wang
wang660@usc.edu
2013/4/1
Title
1
Webtable Example
How many web pages are there?
Recently Google reported finding 1 trillion unique URLs, which
would require 80 terabytes to store.
 How much storage is required to hold a single snapshot of
the Web?
1 trillion web pages at 100K bytes per page requires 100 petabytes
 How is the data stored in the Bigtable?

2013/4/1
An Example
2
Introduction
Bigtable is a distributed storage system for
managing structured data that is designed to
scale to a very large size petabytes of data
across thousands of commodity servers.
 Many projects at Google store data in
Bigtable, including web indexing, Google
Analytics, Google Finance, Orkut,
Personalized Search, Writely, and Google
Earth.
 Bigtable has achieved several goals: wide
applicability, scalability, high performance, and
high availability.

2013/4/1
Introduction
3
Data Model

A Bigtable is a sparse, distributed,
persistent multidimensional sorted map.
The map is indexed by a row key, column
key, and a timestamp; each value in the
map is an uninterpreted array of bytes.
2013/4/1
Data Model
4
Data Model
Rows
The row keys in a table are arbitrary strings (currently up to 64KB in size,
although 10-100 bytes is a typical size for most of our users).
Bigtable maintains data in lexicographic order by row key. The row range
for a table is dynamically partitioned. Each row range is called a tablet, which
is the unit of distribution and load balancing.
 Column Families
Column keys are grouped into sets called column families, which form the
basic unit of access control.
A column key is named using the following syntax: family:qualier. Column
family names must be printable, but qualiers may be arbitrary strings.
 Timestamps
Each cell in a Bigtable can contain multiple versions of the same data; these
versions are indexed by timestamp. Bigtable timestamps are 64-bit integers.
They can be assigned by Bigtable, in which case they represent “real time” in
microseconds, or be explicitly assigned by client

2013/4/1
Data Model
5
Webtable Example
Rows
In Webtable, pages in the same domain are grouped together into
contiguous rows by reversing the hostname components of the URLs. For
example, we store data for maps.google.com/index.html under the key
com.google.maps/index.html. Storing pages from the same domain near
each other makes some host and domain analyses more efficient.
 Column Families
An example column family for the Webtable is language, which stores
the language in which a web page was written. We use only one column
key in the language family, and it stores each web page's language ID.
 Timestamps
In our Webtable example, we set the timestamps of the crawled pages
stored in the contents: column to the times at which these page versions
were actually crawled. The garbage-collection mechanism lets us keep
only the most recent several versions ,which we specify, of every page.

2013/4/1
An Example
6
API

The Bigtable API provides functions for creating and
deleting tables and column families. It also provides
functions for changing cluster, table, and column family
metadata, such as access control rights.
2013/4/1
API
7
API
Bigtable supports single-row transactions, which
can be used to perform atomic read-modify-write
sequences on data stored under a single row key.
 Bigtable allows cells to be used as integer
counters.
 Bigtable supports the execution of client-supplied
scripts in the address spaces of the servers. The
scripts are written in a language developed at
Google for processing data called Sawzall.
 Bigtable can be used with MapReduce, a
framework for running large-scale parallel
computations developed at Google.

2013/4/1
API
8
Building Blocks
Bigtable is built on several other pieces of Google
infrastructure. Bigtable uses the distributed Google File
System (GFS) to store log and data files.
 The Google SSTable file format is used internally to store
Bigtable data. An SSTable provides a persistent, ordered
immutable map from keys to values, where both keys and
values are arbitrary byte strings.
 Bigtable relies on a highly-available and persistent distributed
lock service called Chubby. Bigtable uses Chubby for a
variety of tasks: to ensure that there is at most one active
master at any time; to store the bootstrap location of
Bigtable data; to discover tablet servers and nalize tablet
server deaths; to store Bigtable schema information (the
column family information for each table); and to store
access control lists.

2013/4/1
Building Blocks
9
Implementation
The Bigtable implementation has three major components: a
library that is linked into every client, one master server, and
many tablet servers.
 The master is responsible for assigning tablets to tablet
servers, detecting the addition and expiration of tablet
servers, balancing tablet-server load, and garbage collection
of files in GFS.
 Each tablet server manages a set of tablets. The tablet server
handles read and write requests to the tablets that it has
loaded, and also splits tablets that have grown too large.
 As with many single-master distributed storage systems,
client data does not move through the master: clients
communicate directly with tablet servers for reads and
writes.

2013/4/1
Implementation
10
Implementation
Tablet Location
Using a three-level hierarchy analogous to that of a
B+- tree to store tablet location information.

Tablet Assignment
Each tablet is assigned to one tablet server at a time.
Bigtable uses Chubby to keep track of tablet servers.

2013/4/1
Implementation
11
Implementation

Tablet Serving
Compactions
As write operations execute, the size of the memtable increases.
When the memtable size reaches a threshold, the memtable is frozen,
a new memtable is created, and the frozen memtable is converted to
an SSTable and written to GFS.
A merging compaction that rewrites all SSTables into exactly one
SSTable.

2013/4/1
Implementation
12
Refinements
Locality groups
Clients can group multiple column families
together into a locality group. A separate SSTable is
generated for each locality group in each tablet.
Segregating column families that are not typically
accessed together into separate locality groups
enables more effcient reads.
 Compression
Clients can control whether or not the SSTables
for a locality group are compressed, and if so, which
compression format is used. The user-specified
compression format is applied to each SSTable
block.

2013/4/1
Refinements
13
Refinements
Caching for read performance
To improve read performance, tablet servers use two levels
of caching. The Scan Cache is a higher-level cache that caches
the key-value pairs returned by the SSTable interface to the
tablet server code. The Block Cache is a lower-level cache that
caches SSTables blocks that were read from GFS.
 Bloom filters
A Bloom filter allows us to ask whether an SSTable might
contain any data for a specified row/column pair. For certain
applications, a small amount of tablet server memory used for
storing Bloom filters drastically reduces the number of disk
seeks required for read operations.
 Commit-log implementation
Using one log provides significant performance benefits
during normal operation, but it complicates recovery.

2013/4/1
Refinements
14
Refinements
Speeding up tablet recovery
If the master moves a tablet from one tablet server to another, the
source tablet server first does a minor compaction on that tablet.
After finishing this compaction, the tablet server stops serving the
tablet. Before it actually unloads the tablet, the tablet server does
another(usually very fast) minor compaction to eliminate any
remaining uncompacted state in the tablet server's log that arrived
while the first minor compaction was being performed.
After this second minor compaction is complete, the tablet can be
loaded on another tablet server without requiring any recovery of log
entries.
 Exploiting immutability
Besides the SSTable caches, various other parts of the Bigtable
system have been simplified by the fact that all of the SSTables that we
generate are immutable.

2013/4/1
Refinements
15
Pros
Introduce the structure and function of
Bigtable comprehensively. Discuss how
Bigtable face to different requirements.
 Introduce the experience during the
process of designing Bigtable.

2013/4/1
Pros
16
Cons

According to professor Eric Brewer’s
CAP theory, consistency, availability and
partition tolerance cannot be met by a
distributed system at the same time.As a
typical AP database, consistency, its
weakness, is not discussed in this paper.
2013/4/1
Cons
17
Download