PPT

advertisement
CS 347:
Parallel and Distributed
Data Management
Notes 01: Introduction
Brian Cooper
Based on Material by Hector Garcia-Molina
CS 347
Lecture 1
1
In CS245: Centralized DB system
Software:
P
M
Application
SQL Front End
Query Processor
Transaction Proc.
File Access
...
• Simplifications:
• single front end
• one place to keep locks
• if processor fails, system fails, ...
CS 347
Lecture 1
2
In CS347
• Multiple processors ( + memories)
• Heterogeneity and autonomy of
“components”
CS 347
Lecture 1
3
Multiple processors
• Opportunity for parallelism
• Opportunity for reliability
• Synchronization issues

To illustrate synchronization problems:
Two Generals Problem
CS 347
Lecture 1
4
The one general problem (Trivial!)
G
Troops
CS 347
 Battlefield
Lecture 1
5
The two general problem:
Blue army
Blue
G
CS 347
Enemy
Red army
<------------------------------->
messengers
Lecture 1
Red
G
6
Rules:
• Blue and red army must attack
at same time
• Blue and red generals synchronize
through messengers
• Messengers can be lost
CS 347
Lecture 1
7
How Many Messages Do We Need?
assume blue starts...
BG
RG
attack at 9am
Is this enough??
CS 347
Lecture 1
8
How Many Messages Do We Need?
assume blue starts...
BG
RG
attack at 9am
ack (red goes at 9am)
Is this enough??
CS 347
Lecture 1
9
How Many Messages Do We Need?
assume blue starts...
BG
RG
attack at 9am
ack (red goes at 9am)
got ack
Is this enough??
CS 347
Lecture 1
10
Stated problem is Impossible!
• Theorem: There is no protocol that uses a
finite number of messages that solves the
two-generals problem (as stated here)
Alternatives??
CS 347
Lecture 1
11
Probabilistic Approach?
• Send as many messages as possible, hope
one gets through...
assume blue starts...
BG
RG
attack at 9am
attack at 9am
attack at 9am
attack at 9am
CS 347
Lecture 1
12
Eventual Commit
• Eventually both sides attack...
assume blue starts...
BG
RG
attack ASAP
retransmits
retransmits
on my way!
CS 347
Lecture 1
13
Eventual Commit
• One message sent every time unit
• Probability of success one message is p
• What is probability that red commits by time t?
BG
attack ASAP
retransmits
retransmits
RG
on my way!
CS 347
Lecture 1
14
Eventual Commit
BG
attack ASAP
retransmits
retransmits
RG
on my way!
• C(1) = p
CS 347
Lecture 1
15
Eventual Commit
BG
attack ASAP
retransmits
retransmits
RG
on my way!
• C(1) = p
• C(2) = p + (1-p)p
CS 347
Lecture 1
16
Eventual Commit
BG
attack ASAP
retransmits
retransmits
RG
on my way!
•
•
•
•
C(1)
C(2)
C(3)
C(4)
CS 347
=
=
=
=
p
p + (1-p)p
p + (1-p)p + (1-p)2p
p + (1-p)p + (1-p)2p + (1-p)3p
Lecture 1
17
Eventual Commit
p
C(t)
t
CS 347
Lecture 1
18
Eventual Commit
BG
attack ASAP
retransmits
retransmits
RG
on my way!
• How expensive is protocol?
• E = expected number of messages
• Homework: compute E (function of p)
CS 347
Lecture 1
19
2-Phase Eventual Commit
• Eventually both sides attack...
assume blue starts...
BG
RG
ready to attack?
retransmits
phase 1
yes, at your disposal
attack ASAP
retransmits
phase 2
ack
CS 347
Lecture 1
20
Commit Protocols
• Will study commit protocols like these...
CS 347
Lecture 1
21
Heterogeneity
Select new
investments
Application
Stock
ticker
tape
RDBMS
Portfolio
CS 347
Lecture 1
Files
History of
dividends,
ratios,...
22
Autonomy
Example: unable to get statistics
for query optimization
Example: blue general may have mind of
his (or her) own!
CS 347
Lecture 1
23
• So, in CS347 we study data management
with multiple processors and possible
autonomy, heterogeneity
– Impact on:
• Data organization
• Query processing
• Access structures
• Concurrency control
• Recovery
CS 347
Lecture 1
24
• Renewed Interest in Distributed/Parallel
Data Processing!
– Massive web data, manage with many computers
– How to crawl and search the web?
– Peer-to-peer systems manage huge amounts of
data
– Data from many sources (e.g., comparison
shopping): how to integrate?
– Sensor Networks: data generated an many
sensors/devices, need to analyze
– Multi-player games (e.g., World of Warcraft): tons
of distributed data
CS 347
Lecture 1
25
Data
It’s the Economy, Stupid!
• Example: Multi-player games
P
P
P
P
CS 347
P
state
P
P
Lecture 1
P
P
P
26
Data
It’s the Economy, Stupid!
• Example: Multi-player games
P
P
P
P
P
CS 347
P
state
state
P
Lecture 1
P
P
P
27
Logistics
• LECTURES: Mondays and Wednesdays 12:50pm to
2:05pm, Gates B01
• INSTRUCTOR: Brian Cooper; Email:
brianfrankcooper@gmail.com; Office Hours: Mondays,
Wednesdays 11:30 to 12:30.
• TEACHING ASSISTANT:
– Vasilis Verroios, Email: verroios@stanford.edu
• Piazza forum (tentative):
https://piazza.com/class#spring2015/cs347.
• SECRETARY: Marianne Siroker; Office: Gates Hall 435;
Email: siroker@cs.stanford.edu; Phone: (650) 723-0872
CS 347
Lecture 1
28
Logistics
• TEXTBOOK: No required textbook. You'll be expected to
read several research papers.
• CLASS WEB PAGE: http://www.stanford.edu/class/cs347
Will contain homework assignments, course news, etc. Be
sure to check it periodically.
• ASSIGNMENTS: about 5 homeworks
• GRADING: Homeworks: 20%, Midterm 30%, Final: 50%.
CS 347
Lecture 1
29
Tentative Syllabus 2014 (Part I)
•
•
•
•
•
•
•
•
•
•
DATE
Monday March 30
Wednesday April 1
Monday April 6
Wednesday April 8
Monday April 13
Wednesday April 15
Monday April 20
Wednesday April 22
Monday April 27
Wednesday April 29
CS 347
TOPIC
Introduction [N01]
Data Fragmentation [N02]
Query processing [N03]
Query processing & Optimization [N04]
Concurrency Control, Failures [N05]
Reliable Data Management [N06]
Reliable Data Management [N06]
Guest Lecture – Cloudera Impala
Replicated Data Management [N07]
Midterm
Lecture 1
30
Tentative Syllabus 2014 (Part II)
•
•
•
•
•
•
•
•
•
•
DATE
Monday May 4
Wednesday May 6
Monday May 11
Wednesday May 13
Monday May 18
Wednesday May 20
Wednesday May 27
Monday June 1
Wednesday June 3
Monday June 8
CS 347
TOPIC
Partitions, Entity Resolution [N11]
Peer to Peer Systems [N08]
Map-Reduce & Pig [N09]
Map-Reduce & Pig [N09]
Other Open Source Systems [N09b]
Other Open Source Systems [N09b]
Distributed IR [N10]
Time [N12]
Publish/Subscribe Systems [N13]
8:30 am!!! FINAL EXAM
Lecture 1
31
Interesting New Systems
•
•
•
•
•
•
•
•
•
•
Storm (from Twitter)
S4 (from Yahoo)
Cassandra (key-value store)
Hive (SQL over Hadoop)
Pregel (graph execution)
Kestrel (queues?)
ZooKeeper (replicated data)
Sparkl or Spark (Berkeley?)
H-Base
HyRacks (UC Irvine)
CS 347
•
•
•
•
•
•
•
•
•
•
Lecture 1
MemCache-D
Pnuts (Yahoo)
Dynamo (Amazon)
Spanner (Google)
Paxos
G-Store (UC Santa Barbara)
Elastras (UC Santa Barbara)
Tao (Facebook)
Kafka (LinkedIn)
Impala (Cloudera)
32
Concepts you should be familiar with:
• CS245: query plan, cost estimation, join
algorithms, recovery, logging,…
• Interconnection networks
(bus, mesh, hypercube,…)
• Computer networks (LAN, WAN,…)
CS 347
Lecture 1
33
Introductory topics
•
•
•
•
Database architectures
Client-server systems
Distributed vs. parallel DB systems
Cloud Computing
CS 347
Lecture 1
34
DB architectures
(1) Shared memory
P
P
...
P
M
CS 347
Lecture 1
35
DB architectures
(2) Shared disk
P
P
P
...
M
M
M
...
CS 347
Lecture 1
36
DB architectures
(2B) Shared data storage (disk or file?)
P
P
P
...
M
M
M
• storage area network (SAN)
• Hadoop/Google file system
CS 347
Lecture 1
...
37
DB architectures
(3) Shared nothing
CS 347
P
P
M
M
...
P
M
Lecture 1
38
DB architectures
(4) Hybrid example
P
P
...
P
P
...
P
M
P
M
CS 347
Lecture 1
39
DB architectures
(4) Hybrid example 2
WAN
R
R
LAN #2
LAN #1
P
M
CS 347
...
P
...
M
P
M
Lecture 1
...
P
M
40
DB architectures
(4) Hybrid Tandem-like
also in: Microsoft SQLServer Parallel Data Warehouse
CS 347
P
P
M
M
...
Lecture 1
P
P
M
M
41
DB architectures
(5) Unusual?
Datacycle (Broadcast disks)
Entire DB broadcast
CS 347
P
P
P
M
M
M
Lecture 1
42
(5) Unusual
Sorting network
P
Sort
net
...
M
P
M
CS 347
Lecture 1
43
(5) Unusual — processor per track or
processor per disk
P’
P’
P
P’
...
M
“small” processors
+ “tiny” memories
Related idea in
Oracle Exadata
"DB machine"
CS 347
Lecture 1
44
(6) Unusual — sensor networks
B
B
P
M
sensor
B
battery
B
CS 347
P
M
P
M
data collection node
P’
P
M
M
B
P
M
Lecture 1
45
Issues for selecting architecture
•
•
•
•
•
•
Reliability
Scalability
Geographic distribution of data
Data “clusters”
Performance
Cost
CS 347
Lecture 1
46
Client-Server Systems
(or how to partition software)
Application
Front End
Query Processor
Transaction Processing
File Access
CS 347
Lecture 1
client
server
47
Client-Server Systems
(or how to partition software)
Application
Front End
Query Processor
Transaction Processing
File Access
CS 347
Lecture 1
client
server
48
Client-Server Systems
(or how to partition software)
Application
Front End
Query Processor
Transaction Processing
File Access
CS 347
Lecture 1
client
server
49
Transaction Servers
• Clients ship transactions consisting of 1
or more SQL commands
E.g., Open DataBase Connectivity (ODBC)
(standard API)
CS 347
Lecture 1
50
Data Servers
• Client requests pages or records
• Popular for OODB systems
CS 347
Lecture 1
51
Issues
• Object granularity
• Where is data cached?
• Where is locking done?
CS 347
Lecture 1
52
Basic Tradeoff
• Offloading work to clients
• Data transmitted
C
C
Reserve
hotel room
Get
pages
S
CS 347
S
Lecture 1
53
Note:
Reserve
hotel room
Similar issues arise when we partition
software/functionality within server
P
P
M
M
...
P
M
•Where is data cached?
•Where is locking done?
CS 347
Lecture 1
54
Parallel or distributed DB system?
• More similarities than differences!
CS 347
Lecture 1
55
• Typically, parallel DBs:
– Fast interconnect
– Homogeneous software
– High performance is goal
– Transparency is goal
CS 347
Lecture 1
56
• Typically, distributed DBs:
– Geographically distributed
– Data sharing is goal (may run into
heterogeneity, autonomy)
– Disconnected operation possible
CS 347
Lecture 1
57
Cloud Computing
• Is CC just a marketing term??
– utility (like power)
– data or CPU cycles?
– many processors, many storage units
– business model
CS 347
Lecture 1
58
Is CC a subset, superset, disjoint
from, or overlaps with:
•
•
•
•
•
•
•
•
•
grid computing
distributed computing
Web 2.0
Cluster Computing
Peer-to-peer computing
software as a service
client-server computing
data center as a computer
massively parallel computing
CS 347
Lecture 1
CC
(A)
(B)
CC
(C)
CC
(D)
CC
59
Clash of the Clouds
CS 347
(Economist April 4, 2009)
Lecture 1
60
Silver lining (cloud price war)
(Economist, August 30, 2014)
CS 347
Lecture 1
61
CC Issues
•
•
•
•
Customer lock-in
Privacy
Standards
Software licensing
CS 347
Lecture 1
62
Next
• How to describe distributed data
• Query processing in parallel DBs
• Query processing in distributed DBs
CS 347
Lecture 1
63
Query processing in parallel DBs:
• Typically: we can distribute/ partition/
sort…. data to make certain DB
operations (e.g., Join) fast
CS 347
Lecture 1
64
Query processing in distributed DBs:
• Typically: we are given data
distribution; we need to find query
processing strategy to minimize cost
(e.g., communication cost)
CS 347
Lecture 1
65
Download