epiC: an Extensible and Scalable System for Processing Big Data Why we need another new MapReduce-like system? MapReduce Pregel/GraphLab/Dryad M/R framework cannot handle iterative processing efficiently Everything needs to be transformed into map and reduce functions DAG based data flow User should design how the graph is constructed and how different operators are linked Can we combine the advantages of both types of systems? Overview of epiC Unit works independently Units communicate via “email” Master works as mail server to forward the messages epiC is based on the actormodel master master master network master control msg message queue epiC code I/O library Unit: PageRank control msg master message service schedule service control msg message queue epiC code naming service I/O library Unit: PageRank message queue epiC code I/O library Unit: PageRank Distributed Storage System (DFS, Key-value store, Dstributed Database,...) Compare epiC to MapReduce and Pregel Using PageRank as an example MapReduce: Multi-iterations The second job loads the output of the first job to continue the processing iterations graph data score vector divide the score to neighbors compute the new score of each vertex map reduce map reduce ... ... map reduce score vector Compare epiC to MapReduce and Pregel Pregel: In each super-step, the vertex computes its new PageRank values and broadcasts the value to its neighbors receive scores from neighbors ... broadcast scores to neighbors compute new score ... Compare epiC to MapReduce and Pregel epiC 0. send messages to unit to activate it 1. Unit loads a partition of graph data and score vector based on the received message 2. compute new score vector of vertices 3. generate new score vector files 4. send messages to master network 2 0 message queue 3 epiC code I/O library Unit: PageRank 4 master network storage system 1 Compare epiC to MapReduce and Pregel Flexibility: Optimization: MR is not designed for such job. Pregel and epiC can express the algorithm more effectively. Unit in epiC is equivalent to the worker of Pregel Both MR and epiC supports customized optimization, e.g., buffering the intermediate results in local disk Extensibility: MR and Pregel have their pre-defined programming model, while in epiC, users can create their own. Using epiC to simulate MR Create two basic units: MapUnit and ReduceUnit MapUnit loads a partition of data and sends messages to all ReduceUnits ReduceUnit gets its input from the DFS. The locations of the input are obtained from the messages of MapUnits. Using epiC to simulate Relational DB Three Units are created: SingleTableUnit: Handles all processings on a single Table JoinUnit: Joins two or more tables AggregatUnit: applies the group by operator and computes the aggregation results Using epiC to simulate Relational DB Example: TPC-H Q3 5 steps are required TPC-H Q3 (Step 1) Master network Partition info of Customer Partition info of Orders SingleTableUnit select c_custkey from Customer where c_mktsegment = ':1' Partition info of Lineitem SingleTableUnit select o_orderdate, o_custkey, o_orderkey, o_shippriority from Orders where o_orderdate < date ':2' SingleTableUnit select l_orderkey, l_extendedprice, l_discount from Lineitem where l_shipdate > date ':2' Relational Data of Customer/Orders/Lineitem TPC-H Q3 (Step 2 and 3) Master network Partition info of the Partial Results of Lineitem and Orders JoinUnit Create Partition JoinView1 as (Lineitem join Orders) Partial Results of Customer/Lineitem Master network Partition info of the Partial Results of Customer and JoinView1 JoinUnit Create Partition JoinView2 as (Customer join JoinView1) Partial Results of Customer/JoinView1 TPC-H Q3 (Step 4 and 5) Master network Partition info of the Partial Results of JoinView2 SingleTableUnit Select * from JoinView2 Group By o_orderdate, o_shippriority Partial Results of JoinView2 Master network Partition info of Groups AggregateUnit Compute Aggregation Results for Each Group Partial Results of Group By