slides - Jiaheng Lu

advertisement
云计算与云数据管理
陆嘉恒
中国人民大学
www.jiahenglu.net
主要内容
 云计算概述
 Google 云计算技术:GFS,Bigtable 和
Mapreduce
 Yahoo云计算技术和Hadoop
 云数据管理的挑战
3
Cloud computing
Why we use cloud computing?
Why we use cloud computing?
Case 1:
Write a file
Save
Computer down, file is lost
Files are always stored in cloud, never lost
Why we use cloud computing?
Case 2:
Use IE --- download, install, use
Use QQ --- download, install, use
Use C++ --- download, install, use
……
Get the serve from the cloud
What is cloud and cloud
computing?
Cloud
Demand resources or services over Internet
scale and reliability of a data center.
What is cloud and cloud
computing?
Cloud computing is a style of computing
in which dynamically scalable and often
virtualized resources are provided as a
serve over the Internet.
Users need not have knowledge of,
expertise in, or control over the technology
infrastructure in the "cloud" that supports
them.
Characteristics of cloud
computing

Virtual.
software, databases, Web servers,
operating systems, storage and networking
as virtual servers.

On demand.
add and subtract processors, memory,
network bandwidth, storage.
Types of cloud service
SaaS
Software as a Service
PaaS
Platform as a Service
IaaS
Infrastructure as a Service
SaaS
Software delivery model




No hardware or software to manage
Service delivered through a browser
Customers use the service on demand
Instant Scalability
SaaS
Examples

Your current CRM package is not
managing the load or you simply don’t
want to host it in-house. Use a SaaS
provider such as Salesforce.com

Your email is hosted on an exchange
server in your office and it is very slow.
Outsource this using Hosted Exchange.
PaaS
Platform delivery
model



Platforms are built upon
Infrastructure, which is expensive
Estimating demand is not a science!
Platform management is not fun!
PaaS
Examples

You need to host a large file (5Mb) on your
website and make it available for 35,000 users for
only two months duration. Use Cloud Front from
Amazon.

You want to start storage services on your network
for a large number of files and you do not have the
storage capacity…use Amazon S3.
IaaS
Computer infrastructure
delivery model



A platform virtualization
environment
Computing resources, such as
storing and processing capacity.
Virtualization taken a step further
IaaS
Examples

You want to run a batch job but you don’t
have the infrastructure necessary to run it
in a timely manner. Use Amazon EC2.

You want to host a website, but only for a
few days. Use Flexiscale.
Cloud computing and other computing
techniques
The 21st Century Vision Of Computing
Leonard Kleinrock , one of the chief scientists
of the original Advanced Research Projects
Agency Network (ARPANET) project which
seeded the Internet, said: “
As of now, computer networks are still in their
infancy, but as they grow up and become
sophisticated, we will probably see the
spread of ‘computer utilities’ which, like
present electric and telephone utilities, will
service individual homes and offices across
the country.”
The 21st Century Vision Of Computing
Sun Microsystems
co-founder Bill Joy
The 21st Century Vision Of Computing
Definitions
utility
Cluster
Grid
Cloud
Definitions
Utility computing is the
packaging of computing
resources, such as
computation and storage, as a
metered service similar to a
traditional public utility
utility
Cluster
Grid
Cloud
Definitions
utility
A computer cluster is a group
of linked computers, working
together closely so that in
many respects they form a
single computer.
Cluster
Grid
Cloud
Definitions
utility
Grid computing is the application
of several computers to a single
problem at the same time —
usually to a scientific or technical
problem that requires a great
number of computer processing
cycles or access to large amounts
of data
Cluster
Grid
Cloud
Definitions
utility
Cloud computing is a style of
computing in which dynamically
scalable and often virtualized
resources are provided as a service
over the Internet.
Cluster
Grid
Cloud
Grid Computing & Cloud Computing


share a lot commonality
intention, architecture and technology
Difference
programming model, business model,
compute model, applications, and
Virtualization.
Grid Computing & Cloud Computing

the problems are mostly the same



manage large facilities;
define methods by which consumers
discover, request and use resources
provided by the central facilities;
implement the often highly parallel
computations that execute on those
resources.
Grid Computing & Cloud Computing

Virtualization
 Grid


do not rely on virtualization as much
as Clouds do, each individual
organization maintain full control of
their resources
Cloud

an indispensable ingredient for
almost every Cloud
Any question and any comments ?
2015/4/13
33
主要内容
 云计算概述
 Google 云计算技术:GFS,Bigtable 和
Mapreduce
 Yahoo云计算技术和Hadoop
 云数据管理的挑战
34
Google Cloud computing techniques
Cloud Systems

MapReduce









BigTable
HBase
HyperTable
Hive
HadoopDB
GreenPlum
CouchDB
Voldemort
PNUTS
SQL Azure
OSDI’06
BigTable-like
VLDB’09
VLDB’09
DBMS-based
VLDB’08
The Google File System
The Google File System
(GFS)
A scalable distributed file system for
large distributed data intensive
applications
Multiple GFS clusters are currently
deployed.
The largest ones have:
1000+ storage nodes
300+ TeraBytes of disk storage
heavily accessed by hundreds of clients on distinct
machines
Introduction
Shares many same goals as previous
distributed file systems
performance, scalability, reliability, etc
GFS design has been driven by four
key observation of Google application
workloads and technological
environment
Intro: Observations 1
1.
Component failures are the norm
constant monitoring, error detection, fault tolerance and
automatic recovery are integral to the system
2.
Huge files (by traditional standards)
Multi GB files are common
I/O operations and blocks sizes must be revisited
Intro: Observations 2
3.
Most files are mutated by appending
new data
This is the focus of performance optimization and atomicity
guarantees
4.
Co-designing the applications and
APIs benefits overall system by
increasing flexibility
The Design
Cluster consists of a single master and
multiple chunkservers and is accessed
by multiple clients
The Master
Maintains all file system metadata.
names space, access control info, file to chunk
mappings, chunk (including replicas) location, etc.
Periodically communicates with
chunkservers in HeartBeat messages
to give instructions and check state
The Master
Helps make sophisticated chunk
placement and replication decision, using
global knowledge
For reading and writing, client contacts
Master to get chunk locations, then deals
directly with chunkservers
Master is not a bottleneck for reads/writes
Chunkservers
Files are broken into chunks. Each chunk has
a immutable globally unique 64-bit chunkhandle.
handle is assigned by the master at chunk creation
Chunk size is 64 MB
Each chunk is replicated on 3 (default)
servers
Clients
Linked to apps using the file system API.
Communicates with master and
chunkservers for reading and writing
Master interactions only for metadata
Chunkserver interactions for data
Only caches metadata information
Data is too large to cache.
Chunk Locations
Master does not keep a persistent
record of locations of chunks and
replicas.
Polls chunkservers at startup, and when
new chunkservers join/leave for this.
Stays up to date by controlling placement
of new chunks and through HeartBeat
messages (when monitoring
chunkservers)
Operation Log
Record of all critical metadata changes
Stored on Master and replicated on other
machines
Defines order of concurrent operations
Also used to recover the file system state
System Interactions:
Leases and Mutation Order
Leases maintain a mutation order across all
chunk replicas
Master grants a lease to a replica, called the
primary
The primary choses the serial mutation order,
and all replicas follow this order
Minimizes management overhead for the Master
Atomic Record Append
Client specifies the data to write; GFS
chooses and returns the offset it writes to and
appends the data to each replica at least
once
Heavily used by Google’s Distributed
applications.
No need for a distributed lock manager
GFS choses the offset, not the client
Atomic Record Append: How?
•
•
•
Follows similar control flow as mutations
Primary tells secondary replicas to append
at the same offset as the primary
If a replica append fails at any replica, it is
retried by the client.
So replicas of the same chunk may contain different data,
including duplicates, whole or in part, of the same record
Atomic Record Append: How?
•
GFS does not guarantee that all replicas
are bitwise identical.
Only guarantees that data is written at
least once in an atomic unit.
Data must be written at the same offset for
all chunk replicas for success to be reported.
Detecting Stale Replicas
•
•
•
•
•
Master has a chunk version number to distinguish
up to date and stale replicas
Increase version when granting a lease
If a replica is not available, its version is not
increased
master detects stale replicas when a chunkservers
report chunks and versions
Remove stale replicas during garbage collection
Garbage collection
When a client deletes a file, master logs it like
other changes and changes filename to a hidden
file.
Master removes files hidden for longer than 3
days when scanning file system name space
metadata is also erased
During HeartBeat messages, the chunkservers
send the master a subset of its chunks, and the
master tells it which files have no metadata.
Chunkserver removes these files on its own
Fault Tolerance:
High Availability
•
Fast recovery
Master and chunkservers can restart in seconds
•
•
Chunk Replication
Master Replication
“shadow” masters provide read-only access when primary
master is down
mutations not done until recorded on all master replicas
Fault Tolerance:
Data Integrity
Chunkservers use checksums to detect
corrupt data
Since replicas are not bitwise identical, chunkservers
maintain their own checksums
For reads, chunkserver verifies checksum
before sending chunk
Update checksums during writes
Introduction to
MapReduce
MapReduce: Insight
 ”Consider
the problem of counting the
number of occurrences of each word in a
large collection of documents”
 How
would you do it in parallel ?
MapReduce Programming Model
 Inspired
from map and reduce operations
commonly used in functional programming
languages like Lisp.
 Users
implement interface of two primary
methods:
 1.
Map: (key1, val1) → (key2, val2)
 2. Reduce: (key2, [val2]) → [val3]
Map operation

Map, a pure function, written by the user, takes
an input key/value pair and produces a set of
intermediate key/value pairs.
 e.g.

(doc—id, doc-content)
Draw an analogy to SQL, map can be visualized
as group-by clause of an aggregate query.
Reduce operation
 On
completion of map phase, all the
intermediate values for a given output key
are combined together into a list and given to
a reducer.
 Can
be visualized as aggregate function
(e.g., average) that is computed over all the
rows with the same group-by attribute.
Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, "1");
reduce(String output_key, Iterator intermediate_values):
// output_key: a word
// output_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
MapReduce: Execution overview
MapReduce: Example
MapReduce in Parallel: Example
MapReduce: Fault Tolerance

Handled via re-execution of tasks.

Task completion committed through master

What happens if Mapper fails ?

Re-execute completed + in-progress map tasks

What happens if Reducer fails ?

Re-execute in progress reduce tasks

What happens if Master fails ?

Potential trouble !!
MapReduce:
Walk through of One more
Application
MapReduce : PageRank

PageRank models the behavior of a “random surfer”.
n
PR( x)  (1  d )  d 
i 1
PR(ti )
C (ti )

C(t) is the out-degree of t, and (1-d) is a damping factor (random
jump)

The “random surfer” keeps clicking on successive links at random
not taking content into consideration.

Distributes its pages rank equally among all pages it links to.

The dampening factor takes the surfer “getting bored” and
typing arbitrary URL.
PageRank : Key Insights

Effects at each iteration is local. i+1th iteration
depends only on ith iteration

At iteration i, PageRank for individual nodes can be
computed independently
PageRank using MapReduce
 Use

Sparse matrix representation (M)
Map each row of M to a list of PageRank
“credit” to assign to out link neighbours.
 These
prestige scores are reduced to a
single PageRank value for a page by
aggregating over them.
PageRank using MapReduce
Map: distribute PageRank “credit” to link targets
Reduce: gather up PageRank “credit” from multiple
sources to compute new PageRank value
Iterate until
convergence
Source of Image: Lin 2008
Phase 1: Process HTML
 Map
task takes (URL, page-content) pairs
and maps them to (URL, (PRinit, list-of-urls))
is the “seed” PageRank for URL
 list-of-urls contains all pages pointed to by URL
 PRinit
 Reduce
task is just the identity function
Phase 2: PageRank Distribution
 Reduce
task gets (URL, url_list) and many
(URL, val) values
 Sum
vals and fix up with d to get new PR
 Emit (URL, (new_rank, url_list))
 Check
for convergence using non parallel
component
MapReduce: Some More Apps

Distributed Grep.

Count of URL Access
Frequency.

Clustering (K-means)

Graph Algorithms.

Indexing Systems
MapReduce Programs In Google
Source Tree
MapReduce: Extensions and
similar apps
 PIG
(Yahoo)
 Hadoop
(Apache)
 DryadLinq
(Microsoft)
Large Scale Systems Architecture using
MapReduce
User App
MapReduce
Distributed File Systems (GFS)
BigTable: A Distributed
Storage System for Structured
Data
Introduction


BigTable is a distributed storage system for
managing structured data.
Designed to scale to a very large size


Used for many Google projects


Petabytes of data across thousands of servers
Web indexing, Personalized Search, Google Earth,
Google Analytics, Google Finance, …
Flexible, high-performance solution for all of
Google’s products
Motivation

Lots of (semi-)structured data at Google

URLs:


Per-user data:


User preference settings, recent queries/search results, …
Geographic locations:


Contents, crawl metadata, links, anchors, pagerank, …
Physical entities (shops, restaurants, etc.), roads, satellite
image data, user annotations, …
Scale is large



Billions of URLs, many versions/page (~20K/version)
Hundreds of millions of users, thousands or q/sec
100TB+ of satellite image data
Why not just use commercial
DB?


Scale is too large for most commercial
databases
Even if it weren’t, cost would be very high


Building internally means system can be applied
across many projects for low incremental cost
Low-level storage optimizations help
performance significantly

Much harder to do when running on top of a database
layer
Goals

Want asynchronous processes to be
continuously updating different pieces of data


Need to support:




Want access to most current data at any time
Very high read/write rates (millions of ops per second)
Efficient scans over all or interesting subsets of data
Efficient joins of large one-to-one and one-to-many
datasets
Often want to examine data changes over time

E.g. Contents of a web page over multiple crawls
BigTable



Distributed multi-level map
Fault-tolerant, persistent
Scalable





Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient scans
Self-managing


Servers can be added/removed dynamically
Servers adjust to load imbalance
Building Blocks

Building blocks:





Google File System (GFS): Raw storage
Scheduler: schedules jobs onto machines
Lock service: distributed lock manager
MapReduce: simplified large-scale data processing
BigTable uses of building blocks:




GFS: stores persistent data (SSTable file format for
storage of data)
Scheduler: schedules jobs involved in BigTable
serving
Lock service: master election, location bootstrapping
Map Reduce: often used to read/write BigTable data
Basic Data Model

A BigTable is a sparse, distributed persistent
multi-dimensional sorted map
(row, column, timestamp) -> cell contents

Good match for most Google applications
WebTable Example




Want to keep copy of a large collection of web pages
and related information
Use URLs as row keys
Various aspects of web page as column names
Store contents of web pages in the contents: column
under the timestamps when they were fetched.
Rows

Name is an arbitrary string



Access to data in a row is atomic
Row creation is implicit upon storing data
Rows ordered lexicographically

Rows close together lexicographically usually on
one or a small number of machines
Rows (cont.)
Reads of short row ranges are efficient and
typically require communication with a small
number of machines.
 Can exploit this property by selecting row
keys so they get good locality for data
access.
 Example:
math.gatech.edu, math.uga.edu, phys.gatech.edu, phys.uga.edu
VS
edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys
Columns

Columns have two-level name structure:


Column family



family:optional_qualifier
Unit of access control
Has associated type information
Qualifier gives unbounded columns

Additional levels of indexing, if desired
Timestamps

Used to store different versions of data in a cell


Lookup options:



New writes default to current time, but timestamps for writes can also be
set explicitly by clients
“Return most recent K values”
“Return all values in timestamp range (or all values)”
Column families can be marked w/ attributes:


“Only retain most recent K values in a cell”
“Keep values until they are older than K seconds”
Implementation – Three Major
Components


Library linked into every client
One master server

Responsible for:





Assigning tablets to tablet servers
Detecting addition and expiration of tablet servers
Balancing tablet-server load
Garbage collection
Many tablet servers


Tablet servers handle read and write requests to its
table
Splits tablets that have grown too large
Implementation (cont.)


Client data doesn’t move through master
server. Clients communicate directly with
tablet servers for reads and writes.
Most clients never communicate with the
master server, leaving it lightly loaded in
practice.
Tablets

Large tables broken into tablets at row
boundaries

Tablet holds contiguous range of rows



Clients can often choose row keys to achieve locality
Aim for ~100MB to 200MB of data per tablet
Serving machine responsible for ~100 tablets

Fast recovery:


100 machines each pick up 1 tablet for failed machine
Fine-grained load balancing:


Migrate tablets away from overloaded machine
Master makes load-balancing decisions
Tablet Location

Since tablets move around from server to
server, given a row, how do clients find the
right machine?

Need to find tablet whose row range covers the
target row
Tablet Assignment



Each tablet is assigned to one tablet server at
a time.
Master server keeps track of the set of live
tablet servers and current assignments of
tablets to servers. Also keeps track of
unassigned tablets.
When a tablet is unassigned, master assigns
the tablet to an tablet server with sufficient
room.
API

Metadata operations


Writes (atomic)




Create/delete tables, column families, change metadata
Set(): write cells in a row
DeleteCells(): delete cells in a row
DeleteRow(): delete all cells in a row
Reads

Scanner: read arbitrary cells in a bigtable




Each row read is atomic
Can restrict returned rows to a particular range
Can ask for just data from 1 row, all rows, etc.
Can ask for all columns, just certain column families, or specific
columns
Refinements: Compression

Many opportunities for compression




Two-pass custom compressions scheme



Similar values in the same row/column at different
timestamps
Similar values in different columns
Similar values across adjacent rows
First pass: compress long common strings across a
large window
Second pass: look for repetitions in small window
Speed emphasized, but good space reduction
(10-to-1)
Refinements: Bloom Filters


Read operation has to read from disk when
desired SSTable isn’t in memory
Reduce number of accesses by specifying a
Bloom filter.



Allows us ask if an SSTable might contain data for a
specified row/column pair.
Small amount of memory for Bloom filters drastically
reduces the number of disk seeks for read operations
Use implies that most lookups for non-existent rows or
columns do not need to touch disk
Refinements: Bloom Filters


Read operation has to read from disk when
desired SSTable isn’t in memory
Reduce number of accesses by specifying a
Bloom filter.



Allows us ask if an SSTable might contain data for a
specified row/column pair.
Small amount of memory for Bloom filters drastically
reduces the number of disk seeks for read operations
Use implies that most lookups for non-existent rows or
columns do not need to touch disk
主要内容
 云计算概述
 Google 云计算技术:GFS,Bigtable 和
Mapreduce
 Yahoo云计算技术和Hadoop
 云数据管理的挑战
100
Yahoo! Cloud computing
Yahoo! Cloud Stack
EDGE
Brooklyn
Horizontal
Cloud Services
YCPI
…
WEB
VM/OS
Horizontal
Cloud ServicesPHP
yApache
APP
VM/OS
Horizontal
Cloud
Serving
Grid Services …
STORAGE
Sherpa
Horizontal
Cloud Services…
MOBStor
BATCH
Hadoop
Horizontal…Cloud Services
App Engine
Data Highway
Monitoring/Metering/Security
Provisioning (Self-serve)
YCS
Web Data Management
• Scan oriented
workloads
• Focus on
sequential disk
I/O
• $ per cpu
cycle
Large data analysis
(Hadoop)
Structured record
storage
(PNUTS/Sherpa)
Blob storage
(SAN/NAS)
• Object
retrieval and
streaming
• Scalable file
storage
• $ per GB
• CRUD
• Point lookups
and short
scans
• Index
organized
table and
random I/Os
• $ per latency
The World Has Changed

Web serving applications need:






Scalability!
 Preferably elastic
Flexible schemas
Geographic distribution
High availability
Reliable storage
Web serving applications can do without:


Complicated queries
Strong transactions
PNUTS /
SHERPA
To Help You Scale Your Mountains of Data
Yahoo! Serving Storage Problem

Small records – 100KB or less

Structured records – lots of fields, evolving

Extreme data scale - Tens of TB

Extreme request scale - Tens of thousands of requests/sec

Low latency globally - 20+ datacenters worldwide

High Availability - outages cost $millions

Variable usage patterns - as applications and users change
108
What is PNUTS/Sherpa?
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Parallel database
CREATE TABLE Parts (
ID VARCHAR,
StockNumber INT,
Status VARCHAR
…
)
Structured, flexible schema
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Geographic replication
Hosted, managed infrastructure
110
What Will It Become?
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
Indexes and views
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Design Goals
Scalability



Thousands of machines
Easy to add capacity
Restrict query language to avoid costly queries
Geographic replication


Asynchronous replication around the globe
Low-latency local access
High availability and fault tolerance


Automatically recover from failures
Serve reads and writes despite failures
Consistency



Per-record guarantees
Timeline model
Option to relax if needed
Multiple access paths


Hash table, ordered table
Primary, secondary access
Hosted service


Applications plug and play
Share operational cost
113
Technology Elements
Applications
Tabular API
PNUTS API
YCA: Authorization
PNUTS
• Query planning and execution
• Index maintenance
Distributed infrastructure for tabular data
• Data partitioning
• Update consistency
• Replication
YDOT FS
• Ordered tables
YDHT FS
• Hash tables
Tribble
• Pub/sub messaging
Zookeeper
• Consistency service
114
Data Manipulation

Per-record operations




Get
Set
Delete
Multi-record operations



Multiget
Scan
Getrange
115
Tablets—Hash Table
0x0000
Name
Description
Grape
Grapes are good to eat
$12
Lime
Limes are green
$9
Apple
Apple is wisdom
$1
Strawberry
0x2AF3
0x911F
0xFFFF
Strawberry shortcake
Price
$900
Orange
Arrgh! Don’t get scurvy!
$2
Avocado
But at what price?
$3
Lemon
How much did you pay for this lemon?
$1
Tomato
Is this a vegetable?
$14
Banana
The perfect fruit
$2
New Zealand
$8
Kiwi
116
Tablets—Ordered Table
A
Name
Description
Price
Apple
Apple is wisdom
$1
Avocado
But at what price?
$3
Banana
The perfect fruit
$2
Grape
Grapes are good to eat
$12
New Zealand
$8
How much did you pay for this lemon?
$1
Limes are green
$9
H
Kiwi
Lemon
Lime
Q
Orange
Strawberry
Tomato
Arrgh! Don’t get scurvy!
$2
Strawberry shortcake
$900
Is this a vegetable?
$14
Z
117
Flexible Schema
Posted date
Listing id
Item
Price
6/1/07
424252
Couch
$570
6/1/07
763245
Bike
$86
6/3/07
211242
Car
$1123
6/5/07
421133
Lamp
$15
Color
Condition
Good
Red
Fair
Detailed Architecture
Remote regions
Local region
Clients
REST API
Routers
Tribble
Tablet Controller
Storage
units
119
Tablet Splitting and Balancing
Each storage unit has many tablets (horizontal partitions of the table)
Storage unit may become a hotspot
Storage unit
Tablet
Overfull tablets split
Tablets may grow over time
Shed load by moving tablets to other servers
120
QUERY
PROCESSING
121
Accessing Data
4 Record for key k
1
Get key k
3 Record for key k
SU
SU
2
Get key k
SU
122
Bulk Read
1
{k1, k2, … kn}
2
Get k1
Get k2
SU
SU
Get k3
Scatter/
gather
server
SU
123
Range Queries in YDOT

Clustered, ordered retrieval of records
Apple
Avocado
Grapefruit…Pear?
Banana
Blueberry
Canteloupe
Grape
Kiwi
Lemon
Grapefruit…Lime?
Lime…Pear?
Router
Lime
Mango
Orange
Strawberry
Apple
Tomato
Avocado
Watermelon
Banana
Blueberry
Storage unit 1
Canteloupe
Storage unit 3
Lime
Storage unit 2
Strawberry
Storage unit 1
Strawberry
Tomato
Watermelon
Storage unit 1
Lime
Mango
Orange
Canteloupe
Grape
Kiwi
Lemon
Storage unit 2
Storage unit 3
Updates
1
8
Write key k
Sequence # for key k
Routers
Message brokers
3
Write key k
2
7
Sequence # for key k
4
Write key k
5
SU
SU
SU
6
SUCCESS
Write key k
125
ASYNCHRONOUS REPLICATION
AND CONSISTENCY
126
Asynchronous Replication
127
Consistency Model

Goal: Make it easier for applications to reason about updates and
cope with asynchrony

What happens to a record with primary key “Alice”?
Record
inserted
Update
v. 1
Update Update
Update
v. 2
v. 3
v. 4
Update
Update
v. 5
v. 6
Generation 1
v. 7
Delete
Update
v. 8
Time
Time
As the record is updated, copies may get out of sync.
128
Example: Social Alice
East
West
User
Status
Alice
___
User
Status
Alice
Busy
User
Status
User
Status
Alice
Busy
Alice
Free
User
Status
User
Status
Alice
???
Alice
???
Record Timeline
___
Busy
Free
Free
Consistency Model
Read
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
In general, reads are served using a local copy
130
Consistency Model
Read up-to-date
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
But application can request and get current version
131
Consistency Model
Read ≥ v.6
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
Or variations such as “read forward”—while copies may lag the
master record, every copy goes through the same sequence of changes
132
Consistency Model
Write
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
Achieved via per-record primary copy protocol
(To maximize availability, record masterships automaticlly
transferred if site fails)
Can be selectively weakened to eventual consistency
(local writes that are reconciled using version vectors)
133
Consistency Model
Write if = v.7
ERROR
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
v. 7
Current
version
v. 8
Time
Test-and-set writes facilitate per-record transactions
134
Consistency Techniques

Per-record mastering
 Each record is assigned a “master region”



May differ between records
Updates to the record forwarded to the master region
Ensures consistent ordering of updates

Tablet-level mastering
 Each tablet is assigned a “master region”
 Inserts and deletes of records forwarded to the master region
 Master region decides tablet splits

These details are hidden from the application
 Except for the latency impact!
Mastering
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
A
B
C
D
E
F
Tablet master
A
B
C
D
E
F
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
136
42342
42521
66354
12352
75656
15677
E
W
W
E
C
E
Bulk Insert/Update/Replace
Client
Source Data
Bulk manager
1. Client feeds records to bulk
manager
2. Bulk loader transfers records
to SU’s in batches
• Bypass routers and
message brokers
• Efficient import into
storage unit
Bulk Load in YDOT

YDOT bulk inserts can cause performance hotspots

Solution: preallocate tablets
Index Maintenance

How to have lots of interesting indexes and
views, without killing performance?

Solution: Asynchrony!

Indexes/views updated asynchronously when
base table updated
SHERPA
IN CONTEXT
140
Types of Record Stores

Query expressiveness
S3
PNUTS
Oracle
Simple
Feature rich
Object
retrieval
Retrieval from
single table of
objects/records
SQL
Types of Record Stores

Consistency model
S3
PNUTS
Oracle
Best effort
Eventual
consistency
Timeline
consistency
Object-centric
consistency
ACID
Program
centric
consistency
Strong
guarantees
Types of Record Stores

Data model
PNUTS
CouchDB
Oracle
Flexibility,
Schema evolution
Object-centric
consistency
Optimized for
Fixed schemas
Consistency
spans objects
Types of Record Stores

Elasticity (ability to add resources on
demand)
Oracle
PNUTS
S3
Inelastic
Elastic
Limited
(via data
distribution)
VLSD
(Very Large
Scale
Distribution
/Replication)
Data Stores Comparison
Versus PNUTS

User-partitioned SQL stores



Microsoft Azure SDS
Amazon SimpleDB
Multi-tenant application databases


Salesforce.com
Oracle on Demand







More expressive queries
Users must control partitioning
Limited elasticity
Highly optimized for complex
workloads
Limited flexibility to evolving
applications
Inherit limitations of underlying data
management system
Mutable object stores

Amazon S3

Object storage versus record
management
Application Design Space
Get a few
things
Sherpa
MySQL Oracle
BigTable
Scan
everything
Everest
Records
MObStor
YMDB
Filer
Hadoop
Files
146
SQL/ACID
Consistency
model
Updates
Structured
access
Global low
latency
Availability
Operability
Elastic
Alternatives Matrix
Sherpa
Y! UDB
MySQL
Oracle
HDFS
BigTable
Dynamo
Cassandra
147
QUESTIONS?
148
Hadoop
Problem

How do you scale up applications?



Run jobs processing 100’s of terabytes of data
Takes 11 days to read on 1 computer
Need lots of cheap computers


Fixes speed problem (15 minutes on 1000
computers), but…
Reliability problems
In large clusters, computers fail every day
 Cluster size is not fixed


Need common infrastructure

Must be efficient and reliable
Solution
Open Source Apache Project
 Hadoop Core includes:

Distributed File System - distributes data
 Map/Reduce - distributes application

Written in Java
 Runs on

Linux, Mac OS/X, Windows, and Solaris
 Commodity hardware

Hardware Cluster of Hadoop

Typically in 2 level architecture




Nodes are commodity PCs
40 nodes/rack
Uplink from rack is 8 gigabit
Rack-internal is 1 gigabit
Distributed File System

Single namespace for entire cluster




Files are broken in to large blocks.



Managed by a single namenode.
Files are single-writer and append-only.
Optimized for streaming reads of large files.
Typically 128 MB
Replicated to several datanodes, for reliability
Access from Java, C, or command line.
Block Placement
Default is 3 replicas, but settable
 Blocks are placed (writes are pipelined):

On same node
 On different rack
 On the other rack

Clients read from closest replica
 If the replication for a block drops below
target, it is automatically re-replicated.

How is Yahoo using Hadoop?

Started with building better applications



Scale up web scale batch applications (search,
ads, …)
Factor out common code from existing
systems, so new applications will be easier to
write
Manage the many clusters
Running Production WebMap

Search needs a graph of the “known” web


Periodic batch job using Map/Reduce


Invert edges, compute link text, whole graph
heuristics
Uses a chain of ~100 map/reduce jobs
Scale





1 trillion edges in graph
Largest shuffle is 450 TB
Final output is 300 TB compressed
Runs on 10,000 cores
Raw disk used 5 PB
Terabyte Sort Benchmark



Started by Jim Gray at Microsoft in 1998
Sorting 10 billion 100 byte records
Hadoop won the general category in 209
seconds








910 nodes
2 quad-core Xeons @ 2.0Ghz / node
4 SATA disks / node
8 GB ram / node
1 gb ethernet / node
40 nodes / rack
8 gb ethernet uplink / rack
Previous records was 297 seconds
Hadoop clusters




We have ~20,000 machines running Hadoop
Our largest clusters are currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
We run hundreds of thousands of jobs every month
Research Cluster Usage
Who Uses Hadoop?












Amazon/A9
AOL
Facebook
Fox interactive media
Google / IBM
New York Times
PowerSet (now Microsoft)
Quantcast
Rackspace/Mailtrust
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy
Q&A
 For
more information:
 Website:
http://hadoop.apache.org/core
 Mailing lists:
 core-dev@hadoop.apache
 core-user@hadoop.apache
主要内容
 云计算概述
 Google 云计算技术:GFS,Bigtable 和
Mapreduce
 Yahoo云计算技术和Hadoop
 云数据管理的挑战
162
Summary of Applications

Data Analysis



BigTable HBase HyperTable
Hive HadoopDB…
Internet Service
Private Cloud
Web Applications

Some operations that can tolerate relaxed
consistency
PNUTS
Architecture
MapReduce-based
BigTable HBase
Hypertable Hive
scalability
fault tolerance
ability to run in a
heterogeneous
environment
data replication in
file system
a lot of work to do
to support SQL
DBMS-based
SQL Azure PNUTS
Voldemort
easy to support
SQL
easy to utilize index,
optimization method
bottleneck of data
storage
data replication
upon DBMS
Hybrid of MapReduce
and DBMS
HadoopDB
sounds good
Performance?
Consistency
A

Two kinds of
consistency:


strong consistency –
ACID(Atomicity
Consistency Isolation
Durability)
weak consistency –
BASE(Basically Available
Soft-state Eventual
consistency )
C
BigTable,HBase,
Hive,Hypertable,HadoopDB
P
A
C
PNUTS
P
SQL Azure ?
A tailor
LOCK
RDBMS
3NF
ACID
Further Reading
Efficient Bulk Insertion into a Distributed Ordered Table (SIGMOD 2008)
Adam Silberstein, Brian Cooper, Utkarsh Srivastava, Erik Vee,
Ramana Yerneni, Raghu Ramakrishnan
PNUTS: Yahoo!'s Hosted Data Serving Platform (VLDB 2008)
Brian Cooper, Raghu Ramakrishnan, Utkarsh Srivastava,
Adam Silberstein, Phil Bohannon, Hans-Arno Jacobsen,
Nick Puz, Daniel Weaver, Ramana Yerneni
Asynchronous View Maintenance for VLSD Databases,
Parag Agrawal, Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava and
Raghu Ramakrishnan
SIGMOD 2009
Cloud Storage Design in a PNUTShell
Brian F. Cooper, Raghu Ramakrishnan, and Utkarsh Srivastava
Beautiful Data, O’Reilly Media, 2009
Further Reading
F. Chang et al.
Bigtable: A distributed storage system for structured data. In OSDI, 2006.
J. Dean and S. Ghemawat.
MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
G. DeCandia et al.
Dynamo: Amazon’s highly available key-value store. In SOSP, 2007.
S. Ghemawat, H. Gobioff, and S.-T. Leung.
The Google File System. In Proc. SOSP, 2003.
D. Kossmann.
The state of the art in distributed query processing.
ACM Computing Surveys, 32(4):422–469, 2000.
即将出版云计算教材

清华大学出版社 2010年6月

分布式系统和云计算

三大部分:



分布式系统
云计算技术概述
云计算平台和编程指导
全书章节《分布式系统及云计算概
论》
















第1章 绪论
1.1 分布式系统概述
1.2 分布式云计算的兴起
1.3 分布式云计算的主要服务和应用
1.4 小结
分布式系统综述
第2章 分布式系统入门
2.1 分布式系统的定义
2.2 分布式系统中的软硬件
2.3分布系统中的主要特征(比如安全性,容错性,安全性等等)
2.4小结
第3章 客户-服务器端构架
3.1 客户-服务器端构架和体系结构
3.2 客户-服务器端通信协议
3.3 客户-服务器端模型的变种
3.4 小结
全书章节《分布式系统及云计算
概论》






















第4章 分布式对象
4.1 分布式对象的基本模型
4.2 远程过程调用
4.3 远程方法调用
4.3 小结
第5章 公共对象请求代理结构 (CORBA)
5.1 CORBA基本概述
5.2 CORBA 的基本服务
5.3 容错性和安全性
5.4 Java IDL语言
5.5 小结
分布式云计算技术
第6章 分布式云计算概述
6.1 云计算入门
6.2 云服务
6.3 云计算与其他技术比较
6.4 小结
第7章 Google云平台的三大技术
7.1 Google 文件系统
7.2 Bigtable技术
7.3 Mapreduce技术
7.4 小结
全书章节《分布式系统及云计算
概论》





























第8章 Yahoo云平台的技术
8.1 PNUTS: 灵活通用的表存储平台
8.2 Pig: 分析大型数据集的平台
8.3 ZooKeeper: 提供团体服务的集中化服务平台
8.4 小结
第9章 Aneka 云平台的技术
9.1 Aneka 云平台
9.2 面向市场的云架构
9.3 Aneka:从企业网格到面向市场的云计算
9.4 小结
第10章 Greenplum云平台的技术
10.1 GreenPlum系统概述
10.2 GreenPlum分析数据库
10.3 GreenPlum数据库的体系结构和特点
10.4 GreenPlum的关键特性和优点
10.5 小结
第11章 Amazon dynamo云平台的技术
11.1 Amazon dynamo概述
11.2 Amazon dynamo的研发背景
11.3 Amazon dynamo系统体系结构
11.4 小结
第12章 IBM技术
12.1 IBM云计算概述
12.2 IBM云风暴
12.3 IBM智能商业服务
12.4 IBM智慧地球计划
12.5 IBM Z系统
12.6 IBM虚拟化的动态基础架构技术
12.7 小结
全书章节《分布式系统及云计算
概论》













分布式云计算的程序开发
第13章 基于Hadoop系统开发
13.1 Hadoop系统概述
13.2 Map/Reduce用户接口
13.3 任务执行和执行环境
13.4 实际编程例子
13.5 小结
第14章 基于HBase系统开发
14.1 什么是HBase系统
14.2 HBase的数据模型
14.3 HBase的结构和功能
14.4 如何使用HBase
14.5 小结
全书章节《分布式系统及云计算
概论》













第15章 基于Google Apps系统开发
15.1 Google App Engine 简介
15.2 如何使用Google App Engine
15.3 基于Google Apps的应用程序开发实例
15.4 小结
第16章 基于MS Azure系统开发
16.1 MS Azure系统简介
16.2 WINDOWS AZURE服务使用
16.3 小结
第17章 基于Amazon EC2系统开发实例
17.1 Amazon Elastic Compute Cloud 介绍
17.2 如何使用AmazonEC2
17.3 小结
Download