Web-Scale Data Management Systems Prof. Oliver

advertisement
Web-Scale Data
Management Systems
Prof. Oliver Kennedy
Monday, September 10, 12
http://www.cse.buffalo.edu/~okennedy/courses/cse704fa2012.html
okennedy@buffalo.edu
Monday, September 10, 12
About Me
•
•
•
Monday, September 10, 12
New to UB (super excited to be here)
4+ years working in databases
•
And 3+ years in distributed systems before
that
Worked with analytics teams at Microsoft Azure
Office Hours
•
Monday, September 10, 12
Davis Hall 338H
•
•
•
Monday: 2:00-4:00 PM
Thursday: 2:00-4:00 PM
or by appointment
Course Goals
• Familiarize you with systems for storing
and analyzing really really big data
•
•
Real world systems (if you want to go into industry)
Interesting problems (if you want to go into academia)
• Past, present, and future systems
• Give (some of) you experience working
with these systems.
Monday, September 10, 12
Course Requirements
(Everyone)
• Submit paper overviews Each Week
• 1-2 paragraphs per paper
• What’s the message of the paper?
• What are the pros and cons of the
paper’s approach?
• Everyone gets 2 weeks of missed overviews
Monday, September 10, 12
Course Requirements
(2+ Credits)
• Present 2 papers
• 30-40 minute presentation
• 20-30 minutes of discussion
• Arrange to go over your presentation
with me by the prior Thursday.
Monday, September 10, 12
Course Requirements
(3+ Credits)
• A simple implementation project
• Hadoop-based Join Algorithms
• A Ring DHT
• Standalone Distributed Join Algorithms
• Or develop your own project!
Monday, September 10, 12
Course Requirements
(3+ Credits; Continued)
• Available Resources for Projects
• A 12 core development testbed
• $100 of Amazon Cloud Credit
• Be careful with this credit!
• Avoid running out early!
Monday, September 10, 12
Course Requirements
(3+ Credits; Continued)
• Project Requirements
• A 1-page milestone report on Nov 1
• A short (~4 page) final report
• A short (10-15 minute) presentation
• Confirm your project with me by Oct. 1
Monday, September 10, 12
Dryad
Map Reduce & HDFS
Hive & HadoopDB
Pig & Dremel
MonetDB & DataCyclotron
Cassandra & BigTable/HBase
Zookeeper & Percolator
Chord & Dynamo
PIQL
Lipstick
Borealis & DBToaster
Monday, September 10, 12
Data-Flow Computation
Map/Reduce
SQL on M/R
Other Languages on M/R
Column Stores
Semistructured Databases
Distributed Consistency
Distributed Hash Tables
Enforced Scalability
Workflow Provenance
Stream Processors
Dryad
Map Reduce & HDFS
Hive & HadoopDB
Pig & Dremel
MonetDB & DataCyclotron
Cassandra & BigTable/HBase
Zookeeper & Percolator
Chord & Dynamo
PIQL
Very recent
Lipstick
Borealis & DBToaster
(presented last month)
Monday, September 10, 12
Data-Flow Computation
Map/Reduce
SQL on M/R
Other Languages on M/R
Column Stores
Semistructured Databases
Distributed Consistency
Distributed Hash Tables
Enforced Scalability
Workflow Provenance
Stream Processors
Questions about
the Course?
Monday, September 10, 12
Dryad
Michael Isard, Mihai Budiu,Yuan Yu, Andrew Birrell, Dennis Fetterly
Microsoft Research: Silicon Valley
(presented by Oliver Kennedy)
Monday, September 10, 12
Parallel Programming
Monday, September 10, 12
Parallel Programming
Monday, September 10, 12
Parallel Programming
Hard Problem!
Monday, September 10, 12
Parallel Programming
Locks!
Hard Problem!
Deadlock!
IPC!
Error Handling!
Monday, September 10, 12
What Has Worked?
Monday, September 10, 12
What Has Worked?
Monday, September 10, 12
What Has Worked?
The programmer
knows what’s
data-parallel.
Monday, September 10, 12
The computer
understands the
infrastructure.
What Has Worked?
cat users.dat
Monday, September 10, 12
What Has Worked?
cat users.dat
| sed ‘s/\([^ ]*\).*/\1/’
| grep ‘buffalo’
| wc -l
Unix pipes allow programmers to
compose ‘nuggets’ of computation
Monday, September 10, 12
Graph Programming
Use the same metaphor!
Join nuggets of code into a graph
Monday, September 10, 12
Graph Programming
How about an Example?
Monday, September 10, 12
Graph Programming
SELECT
FROM
JOIN
ON
AND
JOIN
ON
AND
Monday, September 10, 12
distinct p.objID
photoObjAll p
neighbors n
p.objID = n.objID
n.objID < n.neighborObjID
photoObjAll l
l.objID = n.neighborObjID
SimilarColor(l.rgb,n.rgb)
Graph Programming
SELECT
FROM
JOIN
ON
AND
JOIN
ON
AND
Monday, September 10, 12
distinct p.objID
Join
X
photoObjAll p
neighbors n
p.objID = n.objID
n.objID < n.neighborObjID
photoObjAll l
l.objID = n.neighborObjID
SimilarColor(l.rgb,n.rgb)
Graph Programming
SELECT
FROM
JOIN
ON
AND
JOIN
ON
AND
distinct p.objID
Join
X
photoObjAll p
neighbors n
p.objID = n.objID
n.objID < n.neighborObjID
photoObjAll l
l.objID = n.neighborObjID
SimilarColor(l.rgb,n.rgb)
Join Y
Monday, September 10, 12
Graph Programming
Y
X
Monday, September 10, 12
Graph Programming
Y
l
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
Read p⋈n
Emit n.neighborID : p⋈n
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
π
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
π
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
Read n.neighborID : p⋈n
Emit sorted p⋈n by n.neighborId
π
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
S
π
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
S
Channel
π
TCP
In-Mem FIFO
File
X
p
Monday, September 10, 12
n
Graph Programming
Y
l
S
Channel
π
TCP
In-Mem FIFO
File
X
p
Monday, September 10, 12
n
Graph Programming
Y
Y
l
p
Monday, September 10, 12
l
S
S
π
π
X
X
n
p
n
• Graph Programming
• Execution Model
• Evaluation
Monday, September 10, 12
Building a Graph
< V, E, I, O >
Monday, September 10, 12
Building a Graph
< V, E, I, O >
Monday, September 10, 12
Building a Graph
< V, E, I, O >
e
r
t
i
c
e
s
Monday, September 10, 12
d
g
e
s
n
p
u
t
v
e
r
t
i
c
e
s
u
t
p
u
t
v
e
r
t
i
c
e
s
Building a Graph
A
< {A}, {}, {A}, {A} >
Monday, September 10, 12
Building a Graph
A^k
A
< {A}, {}, {A}, {A} >
Monday, September 10, 12
Building a Graph
A^k
A
A
< {A1,A2,…}, {}, {A1,A2,…}, {A1,A2,…} >
Monday, September 10, 12
Building a Graph
k-Times
A
A^k
A
< {A1,A2,…}, {}, {A1,A2,…}, {A1,A2,…} >
Monday, September 10, 12
Building a Graph
k-Times
A
A^k
A
< {A1,A2,…}, {}, {A1,A2,…}, {A1,A2,…} >
Everything is Cloned
Monday, September 10, 12
Building a Graph
A^k >= B^k
B
B
A
A
< VA⨁VB, EA
Monday, September 10, 12
∪
EB
∪
Enew, IA, OB >
Building a Graph
A^k >= B^k
One-to-One
B
B
A
A
< VA⨁VB, EA
Monday, September 10, 12
∪
EB
∪
Enew, IA, OB >
Building a Graph
A^k >= B^k
B
B
A
A
< VA⨁VB, EA
Monday, September 10, 12
∪
EB
∪
Enew, IA, OB >
Building a Graph
A^k >> B^k
All-to-Any
B
B
A
A
< VA⨁VB, EA
Monday, September 10, 12
∪
EB
∪
Enew, IA, OB >
Building a Graph
(A>=C>=D>=B) || (A>=E=>B)
B
B
D
E
C
A
< VA⨁VB, EA
Monday, September 10, 12
A
∪
E B, I A
∪
I B, O A
∪
OB >
Building a Graph
(A>=C>=D>=B) || (A>=E=>B)
B
D
E
C
A
< VA⨁VB, EA
Monday, September 10, 12
∪
E B, I A
∪
I B, O A
∪
OB >
Building a Graph
(A>=C>=D>=B) || (A>=E=>B)
B
D
E
C
A
< VA⨁*VB, EA ∪*EB, IA ∪*IB, OA ∪*OB >
* = except merging duplicates
Monday, September 10, 12
Summary
A^k
Parallelize A, with k replicas
A >= B
Connect A’s outputs to B’s inputs
(one-to-one)
A >> B
Connect A’s outputs to B’s inputs
(all-to-any)
A || B
Merge graphs A and B
(deduplicating nodes in both A and B)
Monday, September 10, 12
• Graph Programming
• Execution Model
• Evaluation
Monday, September 10, 12
Job Execution
B
D
E
C
A
Monday, September 10, 12
Job Execution
Job Manager
B
D
E
C
A
The Job Manager coordinates graph execution
Monday, September 10, 12
Job Execution
B
D
E
Edges can be implemented as...
C
A
Monday, September 10, 12
A
Job Execution
B
D
E
Edges can be implemented as...
C
A
Monday, September 10, 12
A
Files,
Job Execution
B
D
E
Edges can be implemented as...
C
A
Monday, September 10, 12
A
Files, TCP Streams,
Job Execution
B
D
E
Edges can be implemented as...
C
A
Monday, September 10, 12
A
Files, TCP Streams, In-Memory FIFOs, ...etc
Job Execution
B
D
E
Edges can be implemented as...
C
A
Monday, September 10, 12
A
<Object>
Files, TCP Streams, In-Memory FIFOs, ...etc
Job Execution
<Object>
B
D
E
C
A
Monday, September 10, 12
A
The programmer doesn’t need to
worry about the channels
Job Execution
E
<Object>
B
D
E
C
A
Monday, September 10, 12
A
For In-Memory FIFOs, downstream
nodes are spawned immediately.
Job Execution
E
<Object>
B
D
E
C
A
Monday, September 10, 12
A
Job Execution
E
B
D
E
C
A
Monday, September 10, 12
A
Job Execution
E
B
D
E
C
A
Monday, September 10, 12
A
Files are completed before spawning
downstream nodes.
Job Execution
E
B
D
E
C
A
Monday, September 10, 12
A
<Object>
Files are completed before spawning
downstream nodes.
Job Execution
E
<Object>
B
D
E
C
A
Monday, September 10, 12
A
Files are completed before spawning
downstream nodes.
Job Execution
E
<Object>
B
D
E
C
A
Monday, September 10, 12
A
<Object>
<Object>
<Object>
Files are completed before spawning
downstream nodes.
Job Execution
E
<Object>
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
A
When the source node is
finished, the file is closed.
Job Execution
E
<Object>
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
When the source node is
finished, the file is closed.
Job Execution
E
C
<Object>
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
... and the downstream node is
spawned.
Job Execution
E
C
<Object>
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
... and the downstream node is
spawned.
Job Execution
E
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
... and the downstream node is
spawned.
Job Execution
XC
E
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
If the downstream node fails...
Job Execution
XC
E
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
If the downstream node fails...
... its connection to the Job Manager dies...
Job Execution
E
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
If the downstream node fails...
... its connection to the Job Manager dies...
... and the Job Manager restarts it.
Job Execution
E
C
<Object>
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
The node restarts its computation
from the start of the file.
Job Execution
E
C
<Object>
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
For a FIFO or Stream, the Job Manager recreates
the data by also restarting the downstream node.
Job Execution
E
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
For a FIFO or Stream, the Job Manager recreates
the data by also restarting the downstream node.
Job Execution
XE
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
For a FIFO or Stream, the Job Manager recreates
the data by also restarting the downstream node.
Job Execution
E
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
For a FIFO or Stream, the Job Manager recreates
the data by also restarting the downstream node.
Job Execution
E
C
<Object>
<Object>
<Object>
B
D
A
<Object>
E
C
A
Monday, September 10, 12
X
For a FIFO or Stream, the Job Manager recreates
the data by also restarting the downstream node.
Job Execution
E
<Object>
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
The downstream node is assumed to
be deterministic (and side-effect free)
Job Execution
E
C
<Object>
<Object>
<Object>
B
D
E
C
A
Monday, September 10, 12
X
A
The downstream node is assumed to
be deterministic (and side-effect free)
Runtime Graph Refinement
What about aggregation?
Monday, September 10, 12
Runtime Graph Refinement
B
A
B
A
A
A
A
A
A
Aggregation is Expensive!
Monday, September 10, 12
Runtime Graph Refinement
(Lots of work)
B
A
(Lots of Network Traffic)
B
A
A
A
A
A
A
Aggregation is Expensive!
Monday, September 10, 12
Runtime Graph Refinement
Server 2
B
A
Server 1
B
A
A
A
A
A
A
We can use runtime information to optimize!
Monday, September 10, 12
Runtime Graph Refinement
B
A
B
A1
A2 A2
A1
A2
A1
Once the Job Manager assigns nodes to machines...
Monday, September 10, 12
Runtime Graph Refinement
B
A
B
A1
A1
A1
A2
A2
A2
... we can apply a user-provided function to pre-aggregate.
Monday, September 10, 12
Runtime Graph Refinement
(Less Work)
B
Network Stream
(Lower Bandwidth)
B
B1
A
A1
A1
In-Memory
FIFO
B2
A1
A2
A2
A2
... we can apply a user-provided function to pre-aggregate.
Monday, September 10, 12
• Graph Programming
• Execution Model
• Evaluation
Monday, September 10, 12
Evaluation
SELECT
FROM
JOIN
ON
AND
JOIN
ON
AND
Monday, September 10, 12
distinct p.objID
photoObjAll p
neighbors n
p.objID = n.objID
n.objID < n.neighborObjID
photoObjAll l
l.objID = n.neighborObjID
SimilarColor(l.rgb,n.rgb)
Evaluation
16.0
Each
Dryad In-Memory
14.0
Dryad Two-Pass
Speed-up
12.0
k
SQLServer 2005
C
10.0
8.0
R
k
Q
R
6.0
S
is:
C
k
Each
R
S
is:
4.0
2.0
Q
0.0
0
2
4
6
Number of Computers
8
n
Q
D
C
P
MS
10
n
Figure 8: The speedup of the SQL query computation is nearlinear in the number of computers used. The baseline is relative
to Dryad
running on
a singleof
computer
and keeping
times arethe
given
in Table 2.
Sorting
increasing
amounts
data while
volume
:
Figure
8: The
of the Skyserver
computation
as the
numFigure
9: speed-up
The communication
graphQ18
to compute
a query
hisper computer fixed. The total data sorted by an -machine
ber of
computers
is varied.
The baseline
is relative
to DryadLINQ
togram.
Details
are in Section
6.3. This
figure shows
the first cut job
Bytes when .
ent is
around GBytes, or “naive”
version
thattimes
doesn’t
on encapsulated
a single computer
and
arescale
givenwell.
in Table 2.
n = 6 and up, again with close to linear speed-up, and running
approximately twice as fast as the two-pass variant. The
omputers
matches
ourexpectations:
SQLServer result
our specialized Dryad program
but not outrageously,
Dryad
2167 runs
451significantly,
242 135
92
faster
than SQLServer’s
general-purpose query engine. We
Monday,
September
10, 12
After trying a number of different encapsulation and dyto a namic
hand-tuned
sortschemes
strategy
program,
refinement
we used
arrivedby
atthe
the Dryad
communication
graphs
shown in Figure
10 than
for our
experiment. Each
subwhich
is somewhat
faster
DryadLINQ’s
automatic
Evaluation
16.0
Each
Dryad In-Memory
14.0
Speed-up
12.0
10.0
8.0
6.0
4.0
2.0
0.0
P:
D:
S:
MS:
C:
Read/Parse Input
Distribute Copies
In-Memory Sort
Merge Sort
Count
Dryad Two-Pass
k
SQLServer 2005
0
2
4
6
Number of Computers
8
R
Q
k
n
Q
C
R
Q
S
is:
C
k
Each
R
S
is:
D
C
P
MS
10
n
gure 8: The speedup of the SQL query computation is nearnear in the number of computers used. The baseline is relative
Dryad running on a single computer and times are given in Table 2.
Figure 9: The communication graph to compute a query histogram. Details are in Section 6.3. This figure shows the first cut
“naive” encapsulated version that doesn’t scale well.
Time: ?? (really bad)
= 6 and up, again with close to linear speed-up, and
pproximately twice as fast as the two-pass variant. The
QLServer result matches our expectations: our specialed Dryad program runs significantly, but not outrageously,
Monday, September 10, 12
ster
than SQLServer’s general-purpose query engine. We
After trying a number of different encapsulation and dynamic refinement schemes we arrived at the communication
graphs shown in Figure 10 for our experiment. Each sub-
Evaluation
Dynamically Refined Into
a
Each
450
R
450
Q'
is:
b
450
Each
T
R
C
Each
is:
R
450
33.4 GB
R
R
T
Q'
99,713
S
is:
118 GB
D
P
C
C
M
MS
MS
T
217
T
154 GB
Q'
10,405
99,713
Q'
10.2 TB
Time: 11 min 30 sec
Figure 10: Rearranging the vertices gives better scaling performance compared with Figure 9. The user supplies graph (a) specifying that
450 buckets should be used when distributing the output, and that each Q! vertex may receive up to 1GB of input while each T may receive up
to 600MB. The number of Q! and T vertices is determined at run time based on the number of partitions in the input and the network locations
and output sizes of preceding vertices in the graph, and the refined graph (b) is executed by the system. Details are in Section 6.3.
Monday, September 10, 12
tial data reduction and the reduce phase (our C vertex)
performs some additional relatively minor data reduction. A different topology might give better perfor-
each taking inputs from one or more previous stages or the
file system. Nebula transforms Dryad into a generalization
of the Unix piping mechanism and it allows programmers to
Evaluation
Figure 7: Sorting increasing amounts of data while keeping the volume
of data per computer fixed. The total data sorted by an -machine
experiment is around GBytes, or Bytes when .
Monday, September 10, 12
Computers
Dryad
2167
451
242
135
92
Figure 8: The speed-up of t
ber of computers is varied.
running on a single comput
to a hand-tuned sort st
Conclusions
•
The hard part of distributed programming is
infrastructure.
•
•
Programmers like thinking in code nuggets.
•
Dryad isn’t used much, but production dataflow
systems use virtually identical techniques.
Graphs are a natural representation of distributed
computations.
•
Monday, September 10, 12
E.g., Storm (Twitter’s analytics infrastructure)
Monday, September 10, 12
Conclusions
•
The hard part of distributed programming is
infrastructure.
•
•
Programmers like thinking in code nuggets.
•
Dryad isn’t used much, but production dataflow
systems use virtually identical techniques.
Graphs are a natural representation of distributed
computations.
•
Monday, September 10, 12
E.g., Storm (Twitter’s analytics infrastructure)
Conclusions
•
The hard part of distributed programming is
infrastructure.
•
•
Programmers like thinking in code nuggets.
•
Dryad isn’t used much, but production dataflow
systems use virtually identical techniques.
Graphs are a natural representation of distributed
computations.
•
E.g., Storm (Twitter’s analytics infrastructure)
Questions?
Monday, September 10, 12
Download