Mizan: Optimizing Graph Mining in Large Parallel Systems Panos Kalnis King Abdullah University of Science and Technology (KAUST) H. Jamjoom (IBM Watson) and Z. Khayyat, K. Awara (KAUST) Graphs: Are they Important? Graphs are everywhere KAUST Internet Web graph Social networks Biological networks Processing graphs Find patterns, rules, anomalies Rank web pages ‘Viral' or 'word-of-mouth' marketing Identify interactions among proteins Computer security: anomalies in email traffic 2 Graph Research in InfoCloud FD3: RDF query engine Distributed On-the-fly placement and indexing KAUST Panos isA works Yasser studies isA GraMi: Graph mining KAUST student E.g., find frequent subgraphs Mizan professor Framework for executing graph algorithms Distributed, large-scale GOAL: Graph DBMS 3 Existing Graph-processing Frameworks Map-Reduce based HADI, Pegasus Message passing KAUST Pregel Specialized graph engines Parallel Boost Graph Library (pBGL) 4 PageRank with Map-Reduce 4 3 1 2 1 5 1 4 1 2 v2 3 v3 1 v1 5 v5 4 v4 Map-2 3 Map-3 2 Map-1 1 2 3 2 v2 2 1 3 v2 2 v2 1 v2 1 v1 1 v1 Reduce-2 3 1 3 v3 4 1 1 v3 3 v3 4 v4 4 Reduce-3 Map-1 3 Reduce-1 Write on HDFS Map-2 5 Write on HDFS Reduce-1 2 2 v2 1 v2 1 v1 v3 1 v4 v5 v4 v2 3 1 v4 v3 4 v4 5 1 5 v5 5 v5 1 v5 v2 1 v1 v2 v3 v4 v5 3 v2 v3 4 v4 5 v5 Reduce-2 3 Map-3 2 KAUST Reduce-3 5 v5 5 Pregel[1] KAUST Bulk Synchronous Parallel model Statefull model: long-lived processes compute, communicate, and modify local state vs. data-flow model: process computes solely on input data and produces output data [1] G. Malewich et al., Pregel: a system for large scale graph processing, SIGMOD, 2010 6 Pregel Example: MAX 3 6 2 6 Example 6 1 6 from [Malewich et al., SIGMOD, 2010] 6 2 6 6 6 KAUST 6 6 6 6 7 Mizan - Overview Min-cut partitioning of input graph Point-to-point message passing Good for power-law graphs KAUST Random partitioning of input Ring overlay message passing Good for non-power-law graphs 8 α – Minimum-Cut Partitioning KAUST 9 METIS [2] [2] Karypis and Kumar, “Multilevel k-way Partitioning Scheme for Irregular Graphs”, JPDC, 1998 KAUST 10 α – Percentage of Edge Cuts with Minimum-Cut Partitioning Power-law KAUST Non-Power-law 11 α – Node Replication KAUST 12 α – Percentage of Edge Cuts with Node Replication Power-law KAUST Non-Power-law 13 KAUST Partition User’s code Cost of Min-Cut Partitioning 14 γ – Message-passing in a Ring Point-to-Point communication KAUST Ring-based communication Mizan-γ 15 Optimizer α Partitioning cost (min-cut) Pays off for power-law graphs γ Latency due to the ring KAUST Each message must be needed by many nodes Good for non-power law graphs Is the input power-law? Take a random sample Use [2] to compare with theoretical power-law distribution Compute pValue 0.1 ≤ pValue < 0.9 Power-law [2] A. Clauset et al., Power-Law Distributions in Empirical Data. SIAM Review, 51(4), 2009. 16 KAUST Real Synthetic Datasets & Optimizer’s Decisions 17 Example: Diameter Estimation KAUST 18 Non-Power-law 8 EC2 instances, Diameter estimation KAUST 19 Power-law 8 EC2 instances, Diameter estimation KAUST 20 Cloud Computing in KAUST KAUST Scientific & commercial Applications 21 IBM BlueGene/P – 3D Torus Network KAUST 22 IBM-BlueGene/P vs. Amazon EC2 KAUST IBM/P: 850MHz EC2: 2.4GHz 23 Points to remember Mizan: Framework for graph algorithms in large scale computing infrastructures KAUST α: Power-law graphs γ: Non-power-law graphs Runs on cloud and on supercomputers To do list: Dynamic graph placement Hybrid (alpha and gamma) Better optimizer 24 KAUST Questions? http://cloud.kaust.edu.sa