pptx - UCSB Computer Science

advertisement
Managing Data in the Cloud
Scaling in the Cloud
Client Site
Client Site
Client Site
Load Balancer (Proxy)
App
Server
App
Server
App
Server
App
Server
App
Server
Replication the
Database
becomes
MySQL
MySQL
Master DB
Scalability
BottleneckSlave DB
Cannot leverage elasticity
CS271
2
Scaling in the Cloud
Client Site
Client Site
Client Site
Load Balancer (Proxy)
App
Server
App
Server
App
Server
MySQL
Master DB
CS271
App
Server
App
Server
Replication
MySQL
Slave DB
3
Scaling in the Cloud
Client Site
Client Site
Client Site
Load Balancer (Proxy)
Apache
+ App
Server
Apache
+ App
Server
Apache
+ App
Server
Apache
+ App
Server
Apache
+ App
Server
Key Value Stores
CS271
4
CAP Theorem (Eric Brewer)
• “Towards Robust
Distributed Systems”
PODC 2000.
• “CAP Twelve Years
Later: How the "Rules"
Have Changed” IEEE
Computer 2012
CS271
5
Key Value Stores
• Key-Valued data model
– Key is the unique identifier
– Key is the granularity for consistent access
– Value can be structured or unstructured
• Gained widespread popularity
– In house: Bigtable (Google), PNUTS (Yahoo!), Dynamo
(Amazon)
– Open source: HBase, Hypertable, Cassandra,
Voldemort
• Popular choice for the modern breed of webapplications
CS271
6
Big Table (Google)
• Data model.
– Sparse, persistent,
multi-dimensional sorted map.
• Data is partitioned across multiple servers.
• The map is indexed by a row key, column key, and
a timestamp.
• Output value is un-interpreted array of bytes.
– (row: byte[ ], column: byte[ ], time: int64)  byte[ ]
CS271
7
Architecture Overview
• Shared-nothing architecture consisting of
thousands of nodes (commodity PC).
Google’s Bigtable Data Model
Google File System
…….
CS271
8
Atomicity Guarantees in Big Table
• Every read or write of data under a single row
is atomic.
• Objective: make read operations single-sited!
CS271
9
Big Table’s Building Blocks
• Google File System (GFS)
– Highly available distributed file system that stores log and
data files
• Chubby
– Highly available persistent distributed lock manager.
• Tablet servers
– Handles read and writes to its tablet and splits tablets.
– Each tablet is typically 100-200 MB in size.
• Master Server
– Assigns tablets to tablet servers,
– Detects the addition and deletion of tablet servers,
– Balances tablet-server load,
CS271
10
Overview of Bigtable Architecture
Master
Chubby
Lease
Management
Control
Operations
Tablet
Tablet
Server
Server
Master and Chubby Proxies
Tablet
Server
T1
T2
Cache Manager
Tn
Tablets
Log Manager
Google File System
CS271
11
GFS Architectural Design
• A GFS cluster
– A single master
– Multiple chunkservers per master
• Accessed by multiple clients
– Running on commodity Linux machines
• A file
– Represented as fixed-sized chunks
• Labeled with 64-bit unique global IDs
• Stored at chunkservers
• 3-way replication across chunkservers
CS271
12
Architectural Design
Application
chunk location?
GFS Master
GFS client
GFS chunkserver
chunk data?
GFS chunkserver
Linux file system
GFS chunkserver
Linux file system
Linux file system
CS271
13
Single-Master Design
• Simple
• Master answers only chunk locations
• A client typically asks for multiple chunk
locations in a single request
• The master also predicatively provides chunk
locations immediately following those
requested
CS271
14
Metadata
• Master stores three major types
– File and chunk namespaces, persistent in operation log
– File-to-chunk mappings, persistent in operation log
– Locations of a chunk’s replicas, not persistent.
• All kept in memory: Fast!
– Quick global scans
• For Garbage collections and Reorganizations
– 64 bytes of metadata only per 64 MB of data
CS271
15
Mutation Operation in GFS
• Mutation: any write or
append operation
• The data needs to be
written to all replicas
• Guarantee of the same
order when multi user
request the mutation
operation.
CS271
16
GFS Revisited
• “GFS: Evolution on Fast-Forward” an interview with GFS
designers in CACM 3/11.
• Single master was critical for early deployment.
• “the choice to establish 64MB …. was much larger than
the typical file-system block size, but only because the
files generated by Google's crawling and indexing
system were unusually large.”
• As the application mix changed over time, ….deal
efficiently with large numbers of files requiring far less
than 64MB (think in terms of Gmail, for example). The
problem was not so much with the number of files itself,
but rather with the memory demands all of those files
made on the centralized master, thus exposing one of
the bottleneck risks inherent
in the original GFS design.17
CS271
GFS Revisited(Cont’d)
• “the initial emphasis in designing GFS was on
batch efficiency as opposed to low latency.”
• “The original single-master design: A single
point of failure may not have been a disaster
for batch-oriented applications, but it was
certainly unacceptable for latency-sensitive
applications, such as video serving.”
• Future directions: distributed master, etc.
• Interesting and entertaining read.
CS271
18
PNUTS Overview
• Data Model:
– Simple relational model—really key-value store.
– Single-table scans with predicates
• Fault-tolerance:
– Redundancy at multiple levels: data, meta-data etc.
– Leverages relaxed consistency for high availability:
reads & writes despite failures
• Pub/Sub Message System:
– Yahoo! Message Broker for asynchronous updates
CS271
19
Asynchronous replication
CS271
20
Consistency Model
• Hide the complexity of data replication
• Between the two extremes:
– One-copy serializability, and
– Eventual consistency
• Key assumption:
– Applications manipulate one record at a time
• Per-record time-line consistency:
– All replicas of a record preserve the update order
CS271
21
Implementation
•
•
•
•
A read returns a consistent version
One replica designated as master (per record)
All updates forwarded to that master
Master designation adaptive, replica with
most of writes becomes master
CS271
22
Consistency model
• Goal: make it easier for applications to reason about updates and
cope with asynchrony
• What happens to a record with primary key “Brian”?
Record Update
Update UpdateUpdate Update Update Update
inserted
v. 1
v. 2
v. 3
v. 4
v. 5
v. 6
Generation 1
CS271
v. 7
Delete
v. 8
Time
23
Consistency model
Read
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
CS271
v. 7
Current
version
v. 8
Time
24
Consistency model
Read up-to-date
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
CS271
v. 7
Current
version
v. 8
Time
25
Consistency model
Read ≥ v.6
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
CS271
v. 7
Current
version
v. 8
Time
26
Consistency model
Write
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
CS271
v. 7
Current
version
v. 8
Time
27
Consistency model
Write if = v.7
ERROR
Stale version
v. 1
v. 2
v. 3
v. 4
Stale version
v. 5
v. 6
Generation 1
CS271
v. 7
Current
version
v. 8
Time
28
PNUTS Architecture
Clients
Data-path components
REST API
Routers
Message
Broker
Tablet
controller
Storage units
CS271
29
PNUTS architecture
Local region
Remote regions
Clients
REST API
Routers
YMB
Tablet controller
Storage
units
CS271
30
System Architecture: Key Features
•
•
•
•
Pub/Sub Mechanism: Yahoo! Message Broker
Physical Storage: Storage Unit
Mapping of records: Tablet Controller
Record locating: Routers
CS271
31
Highlights of PNUTS Approach
•
•
•
•
Shared nothing architecture
Multiple datacenter for geographic distribution
Time-line consistency and access to stale data.
Use a publish-subscribe system for reliable
fault-tolerant communication
• Replication with record-based master.
CS271
32
AMAZON’S KEY-VALUE STORE:
DYNAMO
Adapted from Amazon’s Dynamo Presentation
CS271
33
Highlights of Dynamo
• High write availability
• Optimistic: vector clocks for resolution
• Consistent hashing (Chord) in controlled
environment
• Quorums for relaxed consistency.
CS271
34
TOO MANY CHOICES – WHICH
SYSTEM SHOULD I USE?
Cooper et al., SOCC 2010
CS271
35
Benchmarking Serving Systems
• A standard benchmarking tool for evaluating
Key Value stores: Yahoo! Cloud Servicing
Benchmark (YCSB)
• Evaluate different systems on common
workloads
• Focus on performance and scale out
CS271
36
Benchmark tiers
• Tier 1 – Performance
– Latency versus throughput as throughput increases
• Tier 2 – Scalability
– Latency as database, system size increases
– “Scale-out”
– Latency as we elastically add servers
– “Elastic speedup”
CS271
37
Workload A – Update heavy:
50/50 read/update
Workload A - Read latency
Average read latency (ms)
70
60
50
40
30
20
10
0
0
2000
4000
6000
8000
10000
12000
14000
Throughput (ops/sec)
Cassandra
Hbase
PNUTS
MySQL
Cassandra (based on Dynamo) is optimized for heavy updates
Cassandra uses hash partitioning.
38
Workload B – Read heavy
95/5 read/update
Average read latency (ms)
Workload B - Read latency
20
18
16
14
12
10
8
6
4
2
0
0
1000
2000
3000
4000
5000
6000
Throughput (operations/sec)
Cassandra
HBase
PNUTS
7000
8000
9000
MySQL
PNUTS uses MSQL, and MSQL is optimized for read operations
CS271
39
Workload E – short scans
Scans of 1-100 records of size 1KB
Workload E - Scan latency
Average scan latency (ms)
120
100
80
60
40
20
0
0
200
400
600
800
1000
1200
1400
1600
Throughput (operations/sec)
Hbase
PNUTS
Cassandra
HBASE uses append-only log, so optimized for scans—same for MSQL
and PNUTS. Cassandra uses hash partitioning, so poor scan performance.
CS271
40
Summary
• Different databases suitable for different
workloads
• Evolving systems – landscape changing
dramatically
• Active development community around open
source systems
CS271
41
Two approaches to scalability
• Scale-up
– Classical enterprise setting
(RDBMS)
– Flexible ACID transactions
– Transactions in a single node
• Scale-out
– Cloud friendly (Key value stores)
– Execution at a single server
• Limited functionality & guarantees
– No multi-row or multi-step
transactions
CS271
42
Key-Value Store Lessons
What are the design principles
learned?
Design Principles
[DNIS 2010]
• Separate System and Application State
– System metadata is critical but small
– Application data has varying needs
– Separation allows use of different class of protocols
CS271
44
Design Principles
• Decouple Ownership from Data Storage
– Ownership is exclusive read/write access to data
– Decoupling allows lightweight ownership migration
Transaction
Manager
Ownership
[Multi-step transactions or
Read/Write Access]
Recovery
Cache Manager
Storage
Classical DBMSs
CS271
Decoupled ownership and
Storage 45
Design Principles
• Limit most interactions to a single node
– Allows horizontal scaling
– Graceful degradation during failures
– No distributed synchronization
Thanks: Curino et al VLDB 2010
CS271
46
Design Principles
• Limited distributed synchronization is practical
– Maintenance of metadata
– Provide strong guarantees only for data that needs
it
CS271
47
Fault-tolerance in the Cloud
• Need to tolerate catastrophic failures
– Geographic Replication
• How to support ACID transactions over data replicated at
multiple datacenters
– One-copy serializablity: Clients can access data in any datacenter,
appears as single copy with atomic access
SBBD 2012
48
Megastore: Entity Groups
(Google--CIDR 2011)
• Entity groups are sub-database
– Static partitioning
– Cheap transactions in Entity groups (common)
– Expensive cross-entity group transactions (rare)
SBBD 2012
49
Megastore Entity Groups
Semantically Predefined
• Email
– Each email account forms a natural entity group
– Operations within an account are transactional: user’s send
message is guaranteed to observe the change despite of failover to another replica
• Blogs
– User’s profile is entity group
– Operations such as creating a new blog rely on asynchronous
messaging with two-phase commit
• Maps
– Dividing the globe into non-overlapping patches
– Each patch can be an entity group
SBBD 2012
50
Megastore
Slides adapted from authors’ presentation
SBBD 2012
51
Google’s Spanner: Database Tech That
Can Scan the Planet (OSDI 2012)
SBBD 2012
52
The Big Picture (OSDI 2012)
2PC (atomicity)
GPS +
Atomic
Clocks
2PL + wound-wait
(isolation)
TrueTime
Paxos (consistency)
Tablets
Logs
SSTables
Colossus File System
Movedir
load balancing
TrueTime
 TrueTime: APIs that provide real time with
bounds on error.
o
Powered by GPS and atomic clocks.
 Enforce external consistency
o
If the start of T2 occurs after the commit of T1 ,
then the commit timestamp of T2 must be
greater than the commit timestamp of T1 .
 Concurrency Control:
Update transactions: 2PL
o Read-only transactions: Use real time to return a
consistency snapshot.
o
Primary References
• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes,
Gruber: Bigtable: A Distributed Storage System for Structured
Data. OSDI 2006
• The Google File System: Sanjay Ghemawat, Howard Gobioff, and
Shun-Tak Leung. Symp on Operating Systems Princ 2003.
• GFS: Evolution on Fast-Forward: Kirk McKusick, Sean Quinlan
Communications of the ACM 2010.
• Cooper, Ramakrishnan, Srivastava, Silberstein, Bohannon,
Jacobsen, Puz, Weaver, Yerneni: PNUTS: Yahoo!'s hosted data
serving platform. VLDB 2008.
• DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin,
Sivasubramanian, Vosshall, Vogels: Dynamo: amazon's highly
available key-value store. SOSP 2007
• Cooper, Silberstein, Tam, Ramakrishnan, Sears: Benchmarking cloud
serving systems with YCSB. SoCC 2010
CS271
55
Download