S P Q KEW IN

advertisement
SKEW IN PARALLEL
QUERY PROCESSING
Paraschos Koutris
Paul Beame
Dan Suciu
University of Washington
PODS 2014
MOTIVATION
• Understand the complexity of parallel query processing on
big data
– on shared-nothing architectures (e.g. MapReduce)
– even in the presence of data skew
• Dominating parameters of computation:
– Communication cost
– Number of communication rounds
2
THE MPC MODEL
• Computation proceeds in synchronous rounds
– Local Computation
– Global Communication
INPUT
(size = M)
#bits received at each rounds ≤ L
1
1
1
2
2
2
round 1
.
.
.
p
round r
. . .
.
.
.
.
.
.
p
OUTPUT
p
3
THE MPC MODEL
maximum load
The data is evenly distributed
Maximizes parallelism
Equivalent to sequential computation
No parallelism
What is the minimum load L of an MPC algorithm that
computes a Conjunctive Query Q in one round?
[Beame, K, Suciu, PODS 2013] Tight upper and lower bounds for
relations of equal size (M bits) and no skew
4
RESULTS
Computing a Conjunctive Query Q in the MPC model in one
round for relations with different sizes and skew
• Matching upper and lower bounds for any skew-free input
database and different relation sizes
• Almost matching upper and lower bounds in the presence
of skew
– Matching bounds in the case of simple joins
5
CONJUNCTIVE QUERIES
• Full Conjuctive Queries w/o self-joins:
– Q(x, y, z) = R(x, y), S(y, z), T(z, x)
[the triangle query]
• The hypergraph of the query Q:
– Variables as vertices
– Atoms as hyperedges
x
T
R
y
z
S
6
EXAMPLE: CARTESIAN PRODUCT
• The cartesian product:
Q(x,y) = S1(x), S2(y) with cardinalities m1, m2
• ALGORITHM
– Organize the p servers in a
– The load will be
– To minimize L choose
rectangle
S2(y)  (*, h2(y))
S1(x)  (h1(x), *)
• The algorithm is optimal
7
LOWER BOUNDS (1)
• For a cartesian product Q = S1 × S2 × … × Su the lower
bound for load is
• For a Conjunctive Query Q(x1,…, xk) = S1(…), …, Sl(…) any
subset of relations Sj1, Sj2, …, Sju without shared variables
(an edge packing for the hypergraph of Q) gives a lower
bound for the load
• The lower bound also holds with any fractional edge
packing
8
LOWER BOUNDS (2)
Theorem For a Conjunctive Query Q, where relation Sj
has size Mj (in bits), any MPC algorithm that computes Q
in one round with maximum load L must satisfy for some
constant c and for any fractional edge packing u:
Proof techniques:
• Using entropy to bound knowledge
• Friedgut’s inequality to bound the maximum size of a query
9
HYPERCUBE ALGORITHM
• Q(x1,…, xk) = S1(…), …, Sl(…)
• For each variable xi define the share to be an integer pi
such that:
p = p1 × .. × pk
• Assign each of the p servers to a point on the kdimensional hypercube:
[p] = [p1] × … × [pk]
• Hash each tuple to the appropriate subcube
e.g. S3 (x3, x4)  (* , *, h3(x3), h4(x4), *, …)
10
EXAMPLE: THE TRIANGLE QUERY
• Algorithm:
[Ganguly ’92, Afrati ’10, Suri ’11]
– The p servers form a cube: [p1/3] × [p1/3] × [p1/3]
– Send each tuple to servers:
• R(a, b)  (hx(a), hy(b), - )
• S(b, c)  (-, hy(b), hz(c) )
each tuple replicated p1/3 times
• T(c, a)  (hx(a), -, hz(c) )
(hx(a), hy(b), hz(c))
11
ANALYSIS OF HYPERCUBE (1)
• For a vector of shares p = (p1, …, pk), how is relation Sj
distributed to the servers?
• Ideally, each server receives
tuples
• Example: relation R(x, y) of the triangle query
– Ideal load L = M / #cells = M/p2/3
– If R has a single value in the x-column, the load
will instead be M/p1/3
– The load will be O(M/p2/3) if each value appears
in the x and y columns at most M/p1/3 times
p1/3
p1/3
12
ANALYSIS OF HYPERCUBE (2)
• In general, a relation Sj is skew-free w.r.t. to p if for any
subset of variables x of vars(Sj), every value appears at
most
• If every relation is skew-free w.r.t. p then the maximum
load of the HYPERCUBE algorithm is:
13
ANALYSIS OF HYPERCUBE (3)
• The maximum load of the HYPERCUBE algorithm is always
bounded by
• Join with shares px = py = pz = p1/3
– For a skew-free database, the load is O(M/p2/3)
– Otherwise, the load is always bounded by O(M/p1/3)
14
COMPUTING THE SHARES
• The optimal shares
Linear Program (LP)
are computed by solving a
15
ANALYSIS OF HYPERCUBE
By using an LP duality argument, we can prove that the load
matches the lower bound
Theorem For a conjunctive query Q, where relation Sj
has size Mj and is skew-free, there exist shares such that
the HYPERCUBE algorithm runs with maximum load
pk(Q) = set of all fractional edge packings
16
EDGE PACKINGS FOR THE TRIANGLE
Q(x, y, z) = R(x, y), S(y, z), T(z, x)
x
T
R
y
z
S
Egde packing u
Load (asymptotic)
(1/2, 1/2, 1/2)
(MRMSMT)1/3/p2/3
(1,0,0)
MR/p
(0,1,0)
MS/p
(0,0,1)
MT/p
17
THE PRESENCE OF SKEW
• A simple join
Q(x,y,z) = S1(x, z), S2(y, z)
• Optimal shares px = py = 1, pz = p
– Standard parallel hash-join
– If the database has no skew, L = O(max{M1, M2} /p)
– If it is skewed, the load can be as bad as O(M) (all tuples
are sent to the same server)
• For any value h of z, mj(h) = frequency of h in Sj
18
SKEW-AWARE JOIN (1)
Q(x,y,z) = S1(x, z), S2(y, z)
• Idea: identify the heavy hitters and treat them differently
• h is a heavy hitter in Sj if mj(h) > Mj/p
• h is light otherwise
CASE 1 (LIGHT)
• For all light values h, run the HyperCube algorithm (hashjoin on z) on all p servers
19
SKEW-AWARE JOIN (2)
CASE 2 (HEAVY)
For any heavy hitter h (either in S1 or S2)
• Compute the residual query (a cartesian product)
Q[z\h] = S1(x, h), S2(y, h)
using ph exclusive servers.
• Choose ph such that
– The sum of the ph is O(p)
– The load for every residual query Q[z\h] is the same
20
SKEW: SIMPLE JOIN
Theorem Any MPC algorithm that computes the join
query in one round must satisfy:
The skew-aware join achieves the above optimal load
21
SKEW IN CONJUNCTIVE QUERIES
• For any conjunctive query Q, our algorithm computes the
light values using HYPERCUBE
– Since there is no skew, this part is optimal
• For the heavy hitters, it considers the residual queries and
assigns appropriately an exclusive number of servers
– The values of the heavy hitters and their frequency must be known
to the algorithm
22
CONCLUSION
Summary
• Upper and lower bounds for computing Conjunctive
Queries in the MPC model in the presence of skew
Open Problems
• What is the load L when we consider more rounds?
• How do other classes of queries behave?
23
Thank you !
24
DUALITY: EDGE PACKING
Fractional edge packing: assign uj to Sj such that for each
variable xi, the sum of edges that contain it is at most 1
x
R
1/2
y
T
1/2
q(x, y, z) = R(x, y), S(y, z), T(z, x)
z
1/2
S
By duality, the minimum value of the LP is equal to the
maximum value, over all edge packings pk(q), of
25
Download