Optimizing Joins in a Map-Reduce Environment EDBT 2010 Presented by Foto Afrati, Jeffrey D. Ullman Intelligent Database Systems Lab School of Computer Science & Engineering Seoul National University, Seoul, Korea 2010-11-12 Summarized by Jaeseok Myung Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 2/33 A Model for Cluster Computing Files: A file is a set of tuples. It is stored in a file system such as GFS Many processes can read and write a file in parallel Assumption: infinite supply of processors Any process (job) can be assigned to any one processor Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 3/33 The Cost Measure for MR Algorithms The communication cost of a process is the size of the input to the process This paper does not count the output size for a process – The output must be input to at least one other process – The final output is much smaller than its input The total communication cost is the sum of the communication costs of all processes that constitute an algorithm The elapsed communication cost is defined on the acyclic graph of processes Consider a path through this graph, and sum the communication costs of the processes along that path The maximum sum, over all paths is the elapsed communication cost Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 4/33 In this paper, We begin an investigation into optimization issues for algorithms implemented in the MR environment In particular, we are interested in algorithms that minimize the total communication cost We begin the study of 2-way and multi-way joins We introduce the notion of a “share” for each attribute of the mapkey. The product of the shares is a fixed constant k, which is the number of Reduce processes we shall use to implement the join The heart of the paper explores how to choose the map-key and shares to minimize the communication cost Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 5/33 Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 6/33 2-Way Join in MapReduce R(A,B) Input R S(B,C) Reduce input A B K V a0 b0 b0 (a0, R) a1 b1 b0 (c0, S) a2 b2 b0 (c1, S) … … … … Map S B C b0 c0 b0 c1 b1 c2 … … Center for E-Business Technology K V b1 (a1, R) b1 (c2, S) … … Copyright 2010 by CEBT Final output Reduce A B C a0 b0 c0 a0 b0 c1 a1 b1 c2 … … … IDS Lab. Seminar – 7/33 2-Way Join in MapReduce A B K V a0 b0 b0 (a0, R) a1 b1 b0 (c0, S) a2 b2 b0 (c1, S) … … … … Center for E-Business Technology Suppose we use k Reduce processes The output of any Map process with key b is sent to the Reduce process for hash value h(b) Copyright 2010 by CEBT IDS Lab. Seminar – 8/33 Joining Several Relations at Once R(A,B) Input S(B,C) T(C,D) Reduce input R S Final output Map Reduce T Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 9/33 Joining Several Relations at Once R(A,B) S(B,C) T(C,D) Suppose we use k=m2 Reduce processes for some m Values of B and C will each be hashed to m buckets Let h be a hash function with range 1, 2, …, m Each tuple S(b, c) is sent to the Reduce process (h(b), h(c)) Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 10/33 Joining Several Relations at Once R(A,B) S(B,C) Let h be a hash function with range 1, 2, …, m h(c) = 0 S(b, c) -> (h(b), h(c)) R(a, b) -> (h(b), all) h(b) = 0 T(c, d) -> (all, h(c)) 1 Each Reduce process computes the join of the tuples it receives T(C,D) h(T.c) = 1 1 2 h(S.b) = 2 h(S.c) = 1 3 2 3 h(R.b) = 2 Reduce processes (# of Reduce processes: 42 = 16) m=4, k=16 Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 11/33 Joining Several Relations at Once R(A,B) S(B,C) T(C,D) h(b) = one of { 0, 1, 2, …, 9 }, h(c) = one of { a, b, c, …, z } Your map-key would be one of { 0a, 0b, …, 0z, 1a, …, 1z, …, 9z } For relation S Each tuple (b, c) can be a value, and a key is one of map-keys For relation R Each tuple (a, b) will be replicated, a key is one of h(b)a or h(b)b, … For relation T Each tuple (c, d) will be replicated, a key is one of 0h(c) or 1h(c), … Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 12/33 Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Formalize of Optimization Problem General algorithm for Optimization Important Special Cases Experiments Conclusion Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 13/33 Formalize of Optimization Problem R(A,B) S(B,C) T(A,C) The communication cost: rc + sa + tb, where r, s, t: # of tuples in relations R, S, T a, b, c: # of buckets for the attributes (shares) Why? Consider a tuple (x, y) in relation R (x, y) must be replicated and sent to the c different reducers We must minimize the expression rc+sa+tb subject to the constraint that abc=k Each of a, b, and c must be a positive integer Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 14/33 제 11 장 비선형계획법 ▶ 등식제약하의 비선형계획모형 n개의 결정변수(x1, x2, …, xn)와 m개의 등식제약하의 비선형모형 Max.(또는 Min.) f(x1, x2, …, xn) s. t. g1(x1, x2, …, xn) = 0 g2(x1, x2, …, xn) = 0 : gm(x1, x2, …, xn) = 0 한밭대학교 산업경영공학과 강진규 교수 제 11 장 비선형계획법 ▶ 등식제약하의 비선형계획모형 라그랑지 승수법(Lagrange multiplier method) 원래의 모형에 대해 라그랑지 승수를 도입하여 목적함수와 등식의 제약식을 연결하는 라그랑지 함수(Lagrange function)를 만들어 제 약이 없는 비선형계획모형으로 변환한 후 극치를 찾는다. i 번째 제약식에 대응하는 라그랑지 승수를 λi라 하면, 라그랑지 함수 L(x1, x2, …, xn, λ1, λ2, …, λm) = f(x1, x2, …, xn) + λ1[g1(x1, x2, …, xn)] + λ2[g2(x1, x2, …, xn)] : + λm[gm(x1, x2, …, xn)] 한밭대학교 산업경영공학과 강진규 교수 제 11 장 비선형계획법 ▶ 등식제약하의 비선형계획모형 등식제약하에서 라그랑지승수법의 필요조건 필요조건 (x1, x2, …, xn)가 원래 모형의 최적해가 되려면, 라그랑지 함수 L에 대하여 다음의 조건을 만족하여야 한다. ∂L ── = 0, j = 1, 2, …, n ∂xj ∂L ── = 0, i = 1, 2, …, m ∂λi 한밭대학교 산업경영공학과 강진규 교수 제 11 장 비선형계획법 ▶ 등식제약하의 비선형계획모형 예제 모형 S기계의 특수장비 생산계획문제 • 향후 2년간 1,000대의 특수장비를 제작ㆍ공급계획 • 생산비용은 각각 금년 100(만원)과 내년 80(만원)으로 추정 • 금년과 내년의 생산량이 다르면 생산량 차이의 제곱에 비례하는 추가 비용이 발생 금년의 생산량을 x1, 내년의 생산량을 x2라 하면 추가비용 C(x1, x2) 는 (x1 - x2)2 C(x1, x2) = ────── 100 한밭대학교 산업경영공학과 강진규 교수 제 11 장 비선형계획법 ▶ 등식제약하의 비선형계획모형 총비용 TC = 정상생산비용 + 추가비용이므로, 다음의 비선형계획모형이 된다. (x1 - x2)2 Min. TC(x1, x2) = 100x1 + 80x2 + ────── 100 s. t. x1 + x2 = 1,000 라그랑지 승수를 λ라 하면, 라그랑지 함수는 다음과 같다. (x1 - x2)2 L(x1, x2, λ) = 100x1 + 80x2 + ────── + λ(x1 + x2 - 1,000) 100 이를 x1, x2, λ에 대해 각각 편미분하여 이를 0으로 놓으면, 한밭대학교 산업경영공학과 강진규 교수 제 11 장 비선형계획법 ▶ 등식제약하의 비선형계획모형 ∂L (x1 - x2) ─── = 100 + ────── - λ = 0 ∂x1 50 ∂L (x1 - x2) ─── = 80 - ────── - λ = 0 ∂x2 50 ∂L ─── = x1 + x2 - 1,000 = 0 ∂λ • 위 식을 풀면, x1 = 250, x2 = 750, λ = 90, TC = 87,500(만원) • (x1, x2) = (250, 750)이 총비용을 최소로 하는 값인지를 확인하기 위하여는, 2차 편미분 필요 • 라그랑지 승수 λ = 90의 의미 : 최적 상태에서 특수장비를 한 대 더 생산하면 90의 비용이 추가적으로 소요됨(LP의 쌍대변수값) 한밭대학교 산업경영공학과 강진규 교수 Problem Solving Problem solving using the method of Lagrange Multipliers Take derivatives with respect to the three variables a, b, c Multiply the three equations Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 21/33 An Example for Understanding R(A,B) S(B,C) T(A,C) Example k = 8 (# of Reduce processes) r = s = t = 100M (# of tuples in relations) # of buckets for the attributes (shares) – 3 𝑘𝑟𝑡/𝑠 2 = 3 8 = 2, b = 2, c = 2 The minimum communication cost – 𝑎= rc + sa + tb = 600M Meaning of solutions – We can determine a, b, c to optimize the communication cost Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 22/33 General Algorithm for Optimization Questions How can we select the map-key attributes? – Dominated Attributes What is the best # of buckets for each attribute? – Lagrange Multiplier Methods You can read section 3 of the paper http://infolab.stanford.edu/~ullman/pub/join-mr.pdf Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 23/33 Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Star Join Chain Join Experiments Conclusion Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 24/33 Special Cases Star Joins There is a fact table joined with several dimension tables – Fact table F: F(A1, A2, … An) – Dimension tables Di: Di(Ai, Bi) Chain Joins A chain join is a join of the form Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 25/33 Star Joins Example k = abcd Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 26/33 Star Joins We can subtract each equation from each other equation sbcd = tacd = uabd = vabc s/a = t/b = u/c = v/d We can use these equation to solve for b, c, and d in terms of a b = at/s, c = au/s, d = av/s k=a4tuv/s3 because k=abcd 𝑎= 4 𝑘𝑠 3 /𝑡𝑢𝑣, 𝑏 = 4 𝑘𝑡 3 /𝑠𝑢𝑣, 𝑐 = 4 𝑘𝑢3 /𝑠𝑡𝑣, 𝑑 = 4 𝑘𝑣 3 /𝑠𝑡𝑢 Generalization Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 27/33 Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 28/33 Experimental Settings Multi-node cluster composed of 4 PCs Debian GNU/Linux 3.0GHz dual-core CPU, 1GB RAM, 160GB HDD 1Gbps LAN Tuning Hadoop Parameters # of Reduce processes : 100 HDFS block size (max. size of each input split) : 128MB Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 29/33 Test Data Sets Sizes of data sets, intermediate relations, and output (unit: 1 million tuples) Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 30/33 Test Results Processing times for the two methods Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 31/33 Outline Introduction 2-Way Join vs. Multi-Way Join Optimization of Multi-Way Joins Important Special Cases Experiments Conclusion Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 32/33 Conclusion Proposed an algorithm for multi-way join that optimizes the communication cost How can we select the map-key attributes? – Dominated Attributes What is the best # of buckets for each attribute? – Lagrange Multiplier Methods Examined the algorithm with two common kinds of joins Star-join Chain-join Center for E-Business Technology Copyright 2010 by CEBT IDS Lab. Seminar – 33/33