Lecture note 4

advertisement
TDD: Topics in Distributed Databases
Distributed Query Processing
 MapReduce
 Vertex-centric models for querying graphs
 Distributed query evaluation by partial evaluation
1
Coping with the sheer volume of big data
Using SSD of 6G/s, a linear scan of a data set D would take

1.9 days when D is of 1PB (1015B)

5.28 years when D is of 1EB (1018B)
Can we do better if we have n processors/SSDs?
Divide and conquer: split the work across n processors
interconnection network
P
P
P
M
M
M
DB
DB
DB
How to make use of parallelism in query answering?
2
MapReduce
3
MapReduce
 A programming model with two primitive functions:
 Map: <k1, v1>  list (k2, v2)
Google
 Reduce: <k2, list(v2)>  list (k3, v3)
 Input: a list <k1, v1> of key-value pairs
 Map: applied to each pair, computes key-value pairs <k2, v2>
•
The intermediate key-value pairs are hash-partitioned based
on k2. Each partition (k2, list(v2)) is sent to a reducer
 Reduce: takes a partition as input, and computes key-value
pairs <k3, v3>
The process may reiterate –
How doesmultiple
it work?map/reduce steps
4
Architecture (Hadoop)
<k1, v1>
<k1, v1> <k1, v1>
<k1, v1>
mapper
mapper
mapper
One block for each
Stored mapper
in DFS (a map task)
Partitioned in blocks (64M)
<k2, v2>
<k2, v2>
<k2, v2>
In local store of mappers
Hash partition (k2)
reducer
reducer
Multiple steps
<k3, v3>
<k3, v3>
Aggregate results
No need to worry about how the data is stored and sent
5
Connection with parallel database systems
<k1, v1> <k1, v1> <k1, v1>
mapper
<k2, v2>
mapper
<k2, v2>
<k1, v1>
mapper
<k2, v2>
Parallel
computation
What parallelism?
reducer
reducer
<k3, v3>
<k3, v3>
Parallel
computation
Data partitioned parallelism
6
Parallel database systems
<k1, v1> <k1, v1> <k1, v1>
mapper
mapper
<k2, v2>
<k2, v2>
<k1, v1>
mapper
<k2, v2>
reducer
reducer
<k3, v3>
<k3, v3>
interconnection network
P
P
P
M
M
M
DB
DB
DB
Restricted query languages: only two primitive
7
MapReduce implementation of relational operators
not necessarily a
key of R
Projection A R
Input: for each tuple t in R, a pair (key, value), where value = t
 Map(key, t)
• emit (t.A, t.A)
Apply to each input tuple, in parallel; emit new
tuples with projected attributes
 Reduce(hkey, hvalue[ ])
• emit(hkey, hkey)
the reducer is not necessary; but it eliminates
duplicates. Why?
Mappers: processing each tuple in parallel
8
Selection
Selection C R
Input: for each tuple t in R, a pair (key, value), where value = t
 Map(key, t)
• if C(t)
• then emit (t, “1”)
Apply to each input tuple, in parallel; select
tuples that satisfy condition C
 Reduce(hkey, hvalue[ ])
• emit(hkey, hkey)
How does it eliminate duplicates?
Reducers: eliminate duplicates
9
Union
A mapper is assigned chunks from either R1 or R2
Union R1  R2
Input: for each tuple t in R1 and s in R2, a pair (key, value)
 Map(key, t)
• emit (t, “1”)
A mapper just passes an input tuple to a reducer
 Reduce(hkey, hvalue[ ])
• emit(hkey, hkey)
Reducers simply eliminate duplicates
Map: process tuples of R1 and R2 uniformly
10
Set difference
distinguishable
Set difference R1  R2
Input: for each tuple t in R1 and s in R2, a pair (key, value)
 Map(key, t)
• if t is in R1
• then emit(t, “1”)
• else emit(t, “2”)
tag each tuple with its source
 Reduce(hkey, hvalue[ ])
• if only “1” appears in the list hvelue
• then emit(hkey, hkey)
Why does it work?
Reducers do the checking
11
Reduce-side join
Natural R1
R1.A = R2.B
R2, where R1[A, C], R2[B, D]
Input: for each tuple t in R1 and s in R2, a pair (key, value)
 Map(key, t)
• if t is in R1
• then emit(t.[A], (“1”, t[C]))
• else emit(t.[B], (“2”, t.[D]))
Hashing on join attributes
 Reduce(hkey, hvalue[ ])
• for each (“1”, t[C]) and each (“2”, s[D]) in the list hvalue
t[C],R2
s[D]),R3?
hkey)
How •to emit((hkey,
implement R1
Inefficient.
Nested Why?
loop
Reduce-side join (based on hash partitioning)
12
Map-side join
Recall
R1
R1.A = R2.B
R2
 Partition R1 and R2 into n partitions, by the same partitioning
function in R1.A and R2.B, via either range or hash partitioning
 Compute Ri1
i2 locally at processor i
R
R1.A = R2.B
 Merge the local results
Partitioned join
Map-side join:
 Input relations are partitioned and sorted based on join keys
 Map over R1 and read from the corresponding partition of R2
Merge join
 map(key, t)
Limitation: sort order and partitioning
i
• read R 2
• for each tuple s in relation Ri2
• if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A])
13
In-memory join
Recall
R1
R1.A < R2.B
R2
 Partition R1 into n partitions, by any partitioning method, and
distribute it across n processors
 Replicate the other relation R2 across all processors
 Compute Rj1
R1.A < R2.B
R2 locally at processor j
 Merge the local results
Fragment and replicate join
Broadcast join
 A smaller relation is broadcast to each node and stored in its
local memory
 The other relation is partitioned and distributed across mappers
 map(key, t)
Limitation: memory
• for each tuple s in relation R2 (local)
• if t[A] = s[B] then emit((t[A], t[C], s[D]), t[A])
14
Aggregation
R(A, B, C), compute sum(B) group by A
 Map(key, t)
• emit (t[A], t[B])
Grouping: done by MapReduce framework
 Reduce(hkey, hvalue[ ])
• sum := 0;
• for each value s in the list hvalue
• sum := sum + s;
• emit(hkey, sum)
Compute the aggregation for each group
Leveraging the MapReduce framework
15
Practice: validation of functional dependencies
 A functional dependency (FD) defined on schema R: X  Y
– For any instance D of R, D satisfies the FD if for any pair of
tuples t and t’, if t[X] = t’[X], then t[Y] = t’[Y]
– Violations of the FD in D:
{ t | there exists t’ in D, such that t[X] = t’[X], but t[Y]  t’[Y] }
 Develop a MapReduce algorithm to find all the violations of the
FD in D
key
– Map: for each tuple t, emit (t[X], t)
– Reduce: find all tuples t such that there exists t’, with but
t[Y]  t’[Y]; emit (1, t)
Does MapReduce support recursion?
: Write a MapReduce algorithm to validate a set of FDs?
16
Transitive closures
 The transitive closure TC of a relation R[A, B]
– R is a subset of TC
– if (a, b) and (b, c) are in TC, then (a, c) is in TC
That is,
• TC(x, y) :- R(x, y);
• TC(x, z) :- TC(x, y), TC(y, z).
 Develop a MapReduce algorithm that given R, computes TC
A fixpoint computation
– How to determine when to terminate?
– How to minimize redundant computation?
: Write a MapReduce algorithm
17
A naïve MapReduce algorithm
Given R(A, B), compute TC
Initially, the input relation R
 Map((a, b), value)
• emit (a, (“r”, b)); emit(b, (“l”, a));
Iteration: the output of reducers becomes
the input of mappers in the next round of
• for each (“l”, a) in hvalue MapReduce computation
 Reduce(b, hvalue[ ])
• for each (“r”, c) in hvalue
• emit(a, c); emit(b, c);
• emit(a, b);
One round of recursive computation:
 Apply the transitivity rule
 Restore (a, b), (b, c). Why?
Termination?
18
A MapReduce algorithm
Given R(A, B), compute TC
 Map((a, b), value)
• emit (a, (“r”, b)); emit(b, (“l”, a));
 Reduce(b, hvalue[ ])
• for each (“l”, a) in hvalue
• for each (“r”, c) in hvalue
• emit(a, c); emit(b, c);
• emit(a, b);
Termination: when the intermediate result no longer changes

How many rounds?
controlled by a non-MapReduce driver How to improve it?
Naïve: not very efficient. Why?
19
Smart transitive closure (recursive doubling)
1. P0(x, y) := { (x, x) | x is in R};
2. Q0 := R;
3. i : = 0;
4. while Qi   do
a)
b)
c)
d)
e)
• R: edge relation
• Pi(x, y): the distance from x to
y is in [0, 2i – 1]
• Qi(x, y): the distance from x to
y is exactly 2i
[2i-1, 2i – 1]
i := i + 1;
Pi(x, y) :=  (x, y) (Q i-1 (x, z) Pi-1 (z, y));
Recursive doubling
Pi := Pi  Pi - 1;
Qi(x, y) :=  (x, y) (Q i-1 (x, z)
Q i-1 (z, y));
Qi := Qi  Pi;
Remove those guys with a shorter path
How many rounds?
 Recursive doubling: log (|R|)
A project
 MapReduce: you know how to do join, union and set difference
Implement smart TC in MapReduce
20
Advantages of MapReduce
 Simple: one only needs to define two functions
no need to worry about how the data is stored, distributed and how
the operations are scheduled
 scalability: a large number of low end machines
• scale out (scale horizontally): adding a new computer to a
distributed software application; lost-cost “commodity”
• scale up (scale vertically): upgrade, add (costly) resources
to a single node
 independence: it can work with various storage layers
 flexibility: independent of data models or schema
Fault tolerance: why?
21
Fault tolerance
<k1, v1> <k1, v1> <k1, v1>
<k1, v1>
triplicated
mapper
mapper
mapper
<k2, v2>
<k2, v2>
<k2, v2>
reducer
reducer
<k3, v3>
<k3, v3>
Detecting failures and
reassigning the tasks of failed
nodes to healthy nodes
Redundancy checking to
achieve load balancing
Able to handle an average of 1.2 failures per analysis job
22
Pitfalls of MapReduce
 No schema: schema-free
 No index: index-free
Why bad?
Inefficient to do join
 A single dataflow: a single input and a single output
 No high-level languages: no SQL
 Carry invariant input data; no support for incremental
Functional programming
computation: redundant computation
 The MapReduce model does not provide a mechanism to maintain
global data structures that can be accessed and updated by all
Why low efficiency?
mappers and reducers; maximize communication: all-to-all
 Low efficiency: I/O optimization, utilization
Not minimize human effort, not to maximize performance
23
Inefficiency of MapReduce
<k1, v1> <k1, v1> <k1, v1>
<k1, v1>
mapper
mapper
mapper
<k2, v2>
<k2, v2>
<k2, v2>
reducer
reducer
<k3, v3>
<k3, v3>
Unnecessary shipment of
invariant input data in
each MapReduce round –
Haloop fixes
Blocking: Reduce does not
start until all Map tasks are
completed
Despite these, MapReduce is popular in industry
24
MapReduce platforms
 Apache Hadoop, used by Facebook, Yahoo, …
– Hive, Facebook, HiveQL (SQL)
– PIG, Yahoo, Pig Latin (SQL like)
– SCOPE, Microsoft, SQL
 NoSQL
– Cassandra, Facebook, CQL (no join)
– HBase, Google, distributed BigTable
– MongoDB, document-oriented (NoSQL)
 Distributed graph query engines
– Pregel, Google
A vertex-centric model
– TAO: Facebook,
– GraphLab, machine learning and data mining
– Neo4j, Neo Tech; Trinity, Microsoft; HyperGraphDB (knowledge) 25
Study some of these
Vertex-centric models
26
The BSP model (Pregel)
Bulk Synchronous Parallel
 Vertex: computation is definedanalogous
to run on each
vertex
to MapReduce
rounds
 Supersteps: within each, all vertices compute in parallel
– Each vertex modifies its states and that of outgoing edges
– Sends messages to other vertices (in the next superstep)
– Receives messages sent to it (from the last superstep)
 Termination:
Message passing
– Each vertex votes to halt
– When all vertices are inactive and no messages in transit
 Synchronization: supersteps
 Asynchronous: all vertices within each superstep
Vertex-centric, message passing
27
The vertex centric model of GraphLab
No supersteps
 Vertex: computation is defined to run on each vertex
 All vertices compute in parallel
– Each vertex reads and writes to data on adjacent nodes or
edges
asynchronous
 Consistency: serialization
– Full consistency: no overlap for concurrent updates
– Edge consistency: exclusive read-write to its vertex and
adjacent edges; read only to adjacent vertices
– Vertex consistency: all updates in parallel (sync operations)
 Asynchronous: all vertices
Machine learning, data mining
28
Vertex-centric models vs. MapReduce
 Vertex centric: think like a vertex; MapReduce: think like a graph
 Vertex centric: maximize parallelism – asynchronous, minimize
data shipment via message passing
MapReduce: inefficiency caused by blocking; distributing
intermediate results
 Vertex centric: limited to graphs; MapReduce: general
 Lack of global control: ordering for processing vertices in
recursive computation, incremental computation, etc
 Have to re-cast algorithms in these models, hard to reuse
existing (incremental) algorithms
Can we do better?
29
GRAPE: A parallel model based on partial evaluation
30
Querying distributed graphs
Given a big graph G, and n processors S1, …, Sn
G is partitioned into fragments (G1, …, Gn)
G is distributed to n processors: Gi is stored at Si
Parallel query answering
Input: G = (G1, …, Gn), distributed to (S1, …, Sn), and a query Q
Output: Q(G), the answer to Q in G
Each processor Si processes its local fragment Gi in parallel
Q(
Q( G1
) Q( G2
G
)
)
How does it work?
…
Q( Gn
)
Dividing a big G into small fragments of manageable size
31
GRAPE (GRAPh Engine)
Divide and conquer
manageable sizes
 partition G into fragments (G1, …, Gn), distributed to various sites
evaluate Q on smaller Gi
 upon receiving a query Q,
• evaluate Q( Gi ) in parallel
• collect partial answers at a coordinator site, and assemble
them to find the answer Q( G ) in the entire G
Each machine (site) Si
processes the same query Q,
uses only data stored in its local fragment Gi
data-partitioned parallelism
32
Partial evaluation
compute f( x )  f( s, d )
the part of known input
yet unavailable input
 conduct the part of computation that depends only on s
 generate a partial answer
a residual function
at each site, Gi as the known input
 Partial evaluation in distributed query processing
• evaluate Q( Gi ) in parallel
Gj as the yet unavailable input
• collect partial matches at a coordinator site, and assemble
them to find the answer Q( G ) in the entire G
as residual functions
The connection between partial evaluation and parallel processing
33
Coordinator
Each machine (site) Si is either
a coordinator
a worker: conduct local computation and produce partial answers
Coordinator: receive/post queries, control termination, and
assemble answers
 Upon receiving a query Q
• post Q to all workers
• Initialize a status flag for each worker, mutable by the
worker
 Terminate the computation when all flags are true
• Assemble partial answers from workers, and produce
the final answer Q(G)
Termination, partial answer assembling
34
Workers
Worker: conduct local computation and produce partial answers
upon receiving a query Q,
use local data Gi
• evaluate Q( Gi ) in parallel
only
With edges to other fragments
• send messages to request data for “border nodes”
Incremental
computation
 Incremental computation:
upon
receiving new messages M
• evaluate Q( Gi + M) in parallel
 set its flag true if no more changes to partial results, and send the
partial answer to the coordinator
This step repeats until the partial answer at site Si is ready
Local computation, partial evaluation, recursion, partial answers 35
35
Subgraph isomorphism
 Input: A pattern graph Q and a graph G
 Output: All the matches of Q in G, i.e., all subgraphs of G that
are isomorphic to Q
a bijective function
f
on nodes:
(u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G
 partition G into fragments (G1, …, Gn), distributed to n workers
 Coordinator: upon receiving a pattern query Q
• post pattern Q to each worker
• set flag[i] false for each worker Si
How to compute partial answers?
36
Local computation at each worker
Worker:
 upon receiving a query Q,
the diameter of Q
• send messages to request the d-neighbour for each border
node, denoted by Gd
No incremental
• Compute partial answer Q(Gi  Gd), by calling
any existing
computation is needed
subgraph isomorphism algorithm
• set flag[i] true, and send Q(Gi  Gd) to the coordinator
Coordinator: the final answer Q(G) is simply the union of all Q(Gd)
the data locality of subgraph isomorphism
The algorithm is correct, why? Correctness analysis
37
Performance guarantee
 Complexity analysis: let

t(|G|, |Q|) be the sequential complexity
 T(|G|, |Q|, n) be the parallel complexitypolynomial reduction
Then T(|G|, |Q|, n) = O(t(|G|, |Q|)/n), n: the number of processors
 Parallel scalability: the more processors, the faster the algorithm
(provided that p << |G|). Why?
Proof: computational costs +
communication costs
 Reuse of existing algorithms: the
worker can
employ
anyproject
Necessary
parts
for your
existing sequential algorithm forCorrectness
subgraph isomorphism
proof
compare this with MapReduce
Complexity analysis
Performance guarantees
Partial evaluation + data-partitioned parallelism
38
Transitive closures
 TC(x, y) :- R(x, y);
 TC(x, z) :- TC(x, y), TC(y, z).
 Represent R as a graph G: elements as nodes, and R as its edge
relation
 partition G into fragments (G1, …, Gn), distributed to n workers
Worker:
compute TC( Gi ) in parallel
send messages to request data for “border
nodes”
incremental
step
 given new messages M, incrementally compute TC(Gi + M)
 Termination: repeat until there exist no more changes
39
: compared with MapReduce: minimize unnecessary recomputation
Summary and review
 What is the MapReduce framework? Pros? Pitfalls?
 How to implement joins in the MapReduce framework?
 Develop algorithms in MapReduce
 Vertex-centric models for querying big graphs
 Distributed query processing with performance guarantee by
partial evaluation?
 What performance guarantees we want for evaluating graph
queries on distributed graphs?
 Compare the four parallel models: MapReduce, PBS, vertex-
centric, and GRAPE
 Correctness proof, complexity analysis and performance
guarantees for your algorithm
40
Project (1)
A “smart” algorithm for computing the transitive closure of R is given
in “Distributed Algorithms for the Transitive Closure”
http://asingleneuron.files.wordpress.com/2013/10/distributedalgorithmsfortr
ansitiveclosure.pdf
1. TC(x, y) := ; Q := R;
2. while Q   do
a) TC := Q TC;
b) TC := Q  TC  TC;
c) Q := Q
Q  TC;



Implement the algorithm in the programming models of
• MapReduce
Recall that you can implement
union and join in MapReduce
• GRAPE
Give correctness proof and complexity analysis
Experimentally evaluate and compare the implementations
41
Flashback: Graph simulation
 Input: a graph pattern graph Q and a graph G
 Output: Q(G) is a binary relation S on the nodes of Q and G
•
each node u in Q is mapped to a node v in G, such that
(u, v)∈ S
• for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an
edge (v, v’ ) in G, such that (u’,v’ )∈ S
Complexity: O((| V | + | VQ |) (| E | + | EQ| )) time
42
Project (2)



Develop two algorithms that, given a graph G and a pattern Q,
compute Q(G) via graph simulation, based on two of the
following models:
• MapReduce
• BSP (Pregel)
• The vertex-centric model of GraphLab
• GRAPE
Give correctness proof and complexity analysis of your
algorithms
Experimentally evaluate your algorithms
43
Project (3)

Given a directed graph G and a pair of nodes (s, t) in G, the
distance between s and t is the length of a shortest path from s
to t.

Develop two algorithms that, given a graph G, compute the
shortest distance for all pair of nodes in G, based on two of the
following models:
• MapReduce
• BSP (Pregel)
• The vertex-centric model of GraphLab
• GRAPE

Give correctness proof and complexity analysis of your
algorithms
Experimentally evaluate your algorithms

44
Project (4)



Show that GRAPE can be revised to support relational queries,
based on what you have learned about parallel/distributed query
processing. Show
• how to support relational operators
• how to optimize your queries
Implement a lightweight system to support relational queries in
GRAPE
• Basic relational operators
• SPC queries
Demonstrate your system
45
Project (5)


Write a survey on any of the following topics, by evaluating 5-6
representative papers/systems:
• Parallel models for query evaluation
• Distributed graph query engines
• Distributed database systems
• MapReduce platforms
• NoSQL systems
• …
Develop a set of criteria, and evaluate techniques/systems
based on the criteria
Demonstrate your understanding of the topics
46
Reading for the next week
http://homepages.inf.ed.ac.uk/wenfei/publication.html
1.
2.
3.
4.
5.
6.
7.
W. Fan, F. Geerts, and F. Neven. Making Queries Tractable on Big Data
with Preprocessing, VLDB 2013 (BD-tractability)
Y. Tao, W. Lin. X. Xiao. Minimal MapReduce Algorithms (MMC)
http://www.cse.cuhk.edu.hk/~taoyf/paper/sigmod13-mr.pdf
J. Hellerstein. The Declarative Imperative: Experiences and Conjectures
in Distributed logic. SIGMOD Record 39(1), 2010 (communication
complexity) http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS2010-90.pdf
Y. Cao, W. Fan, and W. Yu. Bounded Conjunctive Queries. VLDB 2014.
(scale independence)
W. Fan, J. Li, X. Wang, and Y. Wu. Query Preserving Graph
Compression, SIGMOD, 2012. (query-preserving compression)
W. Fan, X. Wang, and Y. Wu. Answering Graph Pattern Queries using
Views, ICDE 2014. (query answering using views)
F. Afrati, V. Borkar, M. Carey, N. Poluzotis, and J. D. Ullman. MapReduce Extensions and Recursive Queries, EDBT 2011.
47
https://asterix.ics.uci.edu/pub/EDBT11-afrati.pdf
Download