등식제약하의 비선형계획모형 - Intelligent Data Systems Laboratory

advertisement
Optimizing Joins in a Map-Reduce Environment
EDBT 2010
Presented by Foto Afrati, Jeffrey D. Ullman
Intelligent Database Systems Lab
School of Computer Science & Engineering
Seoul National University, Seoul, Korea
2010-11-12
Summarized by Jaeseok Myung
Outline
 Introduction
 2-Way Join vs. Multi-Way Join
 Optimization of Multi-Way Joins
 Important Special Cases
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 2/33
A Model for Cluster Computing

Files: A file is a set of tuples. It is stored in a file system such as GFS


Many processes can read and write a file in parallel
Assumption: infinite supply of processors

Any process (job) can be assigned to any one processor
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 3/33
The Cost Measure for MR Algorithms
 The communication cost of a process is the size of the input to
the process

This paper does not count the output size for a process
–
The output must be input to at least one other process
–
The final output is much smaller than its input
 The total communication cost is the sum of the communication
costs of all processes that constitute an algorithm
 The elapsed communication cost is defined on the acyclic graph
of processes

Consider a path through this graph, and sum the communication
costs of the processes along that path

The maximum sum, over all paths is the elapsed communication
cost
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 4/33
In this paper,
 We begin an investigation into optimization issues for algorithms
implemented in the MR environment
 In particular, we are interested in algorithms that minimize the
total communication cost

We begin the study of 2-way and multi-way joins

We introduce the notion of a “share” for each attribute of the mapkey. The product of the shares is a fixed constant k, which is the
number of Reduce processes we shall use to implement the join

The heart of the paper explores how to choose the map-key and
shares to minimize the communication cost
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 5/33
Outline
 Introduction
 2-Way Join vs. Multi-Way Join
 Optimization of Multi-Way Joins
 Important Special Cases
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 6/33
2-Way Join in MapReduce
R(A,B)
Input
R
S(B,C)
Reduce input
A
B
K
V
a0
b0
b0
(a0, R)
a1
b1
b0
(c0, S)
a2
b2
b0
(c1, S)
…
…
…
…
Map
S
B
C
b0
c0
b0
c1
b1
c2
…
…
Center for E-Business Technology
K
V
b1
(a1, R)
b1
(c2, S)
…
…
Copyright  2010 by CEBT
Final output
Reduce
A
B
C
a0
b0
c0
a0
b0
c1
a1
b1
c2
…
…
…
IDS Lab. Seminar – 7/33
2-Way Join in MapReduce
A
B
K
V
a0
b0
b0
(a0, R)
a1
b1
b0
(c0, S)
a2
b2
b0
(c1, S)
…
…
…
…
Center for E-Business Technology
Suppose we use k Reduce processes
The output of any Map process with
key b is sent to the Reduce process
for hash value h(b)
Copyright  2010 by CEBT
IDS Lab. Seminar – 8/33
Joining Several Relations at Once
R(A,B)
Input
S(B,C)
T(C,D)
Reduce input
R
S
Final output
Map
Reduce
T
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 9/33
Joining Several Relations at Once
R(A,B)
S(B,C)
T(C,D)
 Suppose we use k=m2 Reduce processes for some m

Values of B and C will each be hashed to m buckets
 Let h be a hash function with range 1, 2, …, m

Each tuple S(b, c) is sent to the Reduce process (h(b), h(c))
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 10/33
Joining Several Relations at Once
R(A,B)
S(B,C)
 Let h be a hash function
with range 1, 2, …, m
h(c) = 0

S(b, c) -> (h(b), h(c))

R(a, b) -> (h(b), all)
h(b) = 0

T(c, d) -> (all, h(c))
1
 Each Reduce process
computes the join of the
tuples it receives
T(C,D)
h(T.c) = 1
1
2
h(S.b) = 2
h(S.c) = 1
3
2
3
h(R.b) = 2
Reduce processes
(# of Reduce processes: 42 = 16)
m=4, k=16
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 11/33
Joining Several Relations at Once
R(A,B)
S(B,C)
T(C,D)
 h(b) = one of { 0, 1, 2, …, 9 }, h(c) = one of { a, b, c, …, z }
 Your map-key would be one of

{ 0a, 0b, …, 0z, 1a, …, 1z, …, 9z }
 For relation S

Each tuple (b, c) can be a value, and a key is one of map-keys
 For relation R

Each tuple (a, b) will be replicated, a key is one of h(b)a or h(b)b,
…
 For relation T

Each tuple (c, d) will be replicated, a key is one of 0h(c) or 1h(c), …
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 12/33
Outline
 Introduction
 2-Way Join vs. Multi-Way Join
 Optimization of Multi-Way Joins

Formalize of Optimization Problem

General algorithm for Optimization
 Important Special Cases
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 13/33
Formalize of Optimization Problem
R(A,B)
S(B,C)
T(A,C)
 The communication cost: rc + sa + tb, where

r, s, t: # of tuples in relations R, S, T

a, b, c: # of buckets for the attributes (shares)
 Why?

Consider a tuple (x, y) in relation R

(x, y) must be replicated and sent to the c different reducers
 We must minimize the expression rc+sa+tb subject to the
constraint that abc=k

Each of a, b, and c must be a positive integer
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 14/33
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
n개의 결정변수(x1, x2, …, xn)와 m개의 등식제약하의 비선형모형
Max.(또는 Min.) f(x1, x2, …, xn)
s. t.
g1(x1, x2, …, xn) = 0
g2(x1, x2, …, xn) = 0
:
gm(x1, x2, …, xn) = 0
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
라그랑지 승수법(Lagrange multiplier method)
원래의 모형에 대해 라그랑지 승수를 도입하여 목적함수와 등식의
제약식을 연결하는 라그랑지 함수(Lagrange function)를 만들어 제
약이 없는 비선형계획모형으로 변환한 후 극치를 찾는다.
i 번째 제약식에 대응하는 라그랑지 승수를 λi라 하면, 라그랑지 함수
L(x1, x2, …, xn, λ1, λ2, …, λm)
= f(x1, x2, …, xn) + λ1[g1(x1, x2, …, xn)]
+ λ2[g2(x1, x2, …, xn)]
:
+ λm[gm(x1, x2, …, xn)]
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
등식제약하에서 라그랑지승수법의 필요조건
 필요조건
(x1, x2, …, xn)가 원래 모형의 최적해가 되려면,
라그랑지 함수 L에 대하여 다음의 조건을 만족하여야 한다.
∂L
── = 0, j = 1, 2, …, n
∂xj
∂L
── = 0, i = 1, 2, …, m
∂λi
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
예제 모형
 S기계의 특수장비 생산계획문제
• 향후 2년간 1,000대의 특수장비를 제작ㆍ공급계획
• 생산비용은 각각 금년 100(만원)과 내년 80(만원)으로 추정
• 금년과 내년의 생산량이 다르면 생산량 차이의 제곱에 비례하는
추가 비용이 발생
금년의 생산량을 x1, 내년의 생산량을 x2라 하면
추가비용 C(x1, x2) 는
(x1 - x2)2
C(x1, x2) = ──────
100
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
총비용 TC = 정상생산비용 + 추가비용이므로,
다음의 비선형계획모형이 된다.
(x1 - x2)2
Min. TC(x1, x2) = 100x1 + 80x2 + ──────
100
s. t.
x1 + x2 = 1,000
라그랑지 승수를 λ라 하면, 라그랑지 함수는 다음과 같다.
(x1 - x2)2
L(x1, x2, λ) = 100x1 + 80x2 + ────── + λ(x1 + x2 - 1,000)
100
이를 x1, x2, λ에 대해 각각 편미분하여 이를 0으로 놓으면,
한밭대학교 산업경영공학과 강진규 교수
제
11 장 비선형계획법
▶ 등식제약하의 비선형계획모형
∂L
(x1 - x2)
─── = 100 + ────── - λ = 0
∂x1
50
∂L
(x1 - x2)
─── = 80 - ────── - λ = 0
∂x2
50
∂L
─── = x1 + x2 - 1,000 = 0
∂λ
• 위 식을 풀면, x1 = 250, x2 = 750, λ = 90, TC = 87,500(만원)
• (x1, x2) = (250, 750)이 총비용을 최소로 하는 값인지를 확인하기
위하여는, 2차 편미분 필요
• 라그랑지 승수 λ = 90의 의미 : 최적 상태에서 특수장비를 한 대
더 생산하면 90의 비용이 추가적으로 소요됨(LP의 쌍대변수값)
한밭대학교 산업경영공학과 강진규 교수
Problem Solving
 Problem solving using the method of Lagrange Multipliers

Take derivatives with respect to the three variables a, b, c

Multiply the three equations
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 21/33
An Example for Understanding
R(A,B)
S(B,C)
T(A,C)
 Example
k = 8 (# of Reduce processes)
r = s = t = 100M (# of tuples in relations)

# of buckets for the attributes (shares)
–

3
𝑘𝑟𝑡/𝑠 2 =
3
8 = 2, b = 2, c = 2
The minimum communication cost
–

𝑎=
rc + sa + tb = 600M
Meaning of solutions
–
We can determine a, b, c to optimize the communication cost
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 22/33
General Algorithm for Optimization
 Questions

How can we select the map-key attributes?
–

Dominated Attributes
What is the best # of buckets for each attribute?
–
Lagrange Multiplier Methods
 You can read section 3 of the paper

http://infolab.stanford.edu/~ullman/pub/join-mr.pdf
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 23/33
Outline
 Introduction
 2-Way Join vs. Multi-Way Join
 Optimization of Multi-Way Joins
 Important Special Cases

Star Join

Chain Join
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 24/33
Special Cases
 Star Joins

There is a fact table joined with several dimension tables
–
Fact table F: F(A1, A2, … An)
–
Dimension tables Di: Di(Ai, Bi)
 Chain Joins

A chain join is a join of the form
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 25/33
Star Joins
 Example

k = abcd
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 26/33
Star Joins
 We can subtract each equation from each other equation

sbcd = tacd = uabd = vabc

s/a = t/b = u/c = v/d
 We can use these equation to solve for b, c, and d in terms of a

b = at/s, c = au/s, d = av/s

k=a4tuv/s3 because k=abcd

𝑎=
4
𝑘𝑠 3 /𝑡𝑢𝑣, 𝑏 =
4
𝑘𝑡 3 /𝑠𝑢𝑣, 𝑐 =
4
𝑘𝑢3 /𝑠𝑡𝑣, 𝑑 =
4
𝑘𝑣 3 /𝑠𝑡𝑢
 Generalization
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 27/33
Outline
 Introduction
 2-Way Join vs. Multi-Way Join
 Optimization of Multi-Way Joins
 Important Special Cases
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 28/33
Experimental Settings
 Multi-node cluster composed of 4 PCs

Debian GNU/Linux

3.0GHz dual-core CPU, 1GB RAM, 160GB HDD

1Gbps LAN
 Tuning Hadoop Parameters

# of Reduce processes : 100

HDFS block size (max. size of each input split) : 128MB
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 29/33
Test Data Sets
Sizes of data sets, intermediate relations, and output
(unit: 1 million tuples)
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 30/33
Test Results
Processing times for the two methods
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 31/33
Outline
 Introduction
 2-Way Join vs. Multi-Way Join
 Optimization of Multi-Way Joins
 Important Special Cases
 Experiments
 Conclusion
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 32/33
Conclusion
 Proposed an algorithm for multi-way join that optimizes the
communication cost

How can we select the map-key attributes?
–

Dominated Attributes
What is the best # of buckets for each attribute?
–
Lagrange Multiplier Methods
 Examined the algorithm with two common kinds of joins

Star-join

Chain-join
Center for E-Business Technology
Copyright  2010 by CEBT
IDS Lab. Seminar – 33/33
Download