Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao Bin Cui Lei Chen Lin Ma Junjie Yao Ning Xu School of EECS, Peking University Hong Kong University of Science and Technology 1 Outline • Subgraph listing operation • Related work • PSgL framework • Evaluation • Conclusion 2 Introduction Motivation Motif Detection in Bioinformatics Triangle Counting in SN Cascades Counting in RN 3 Introduction Problem Definition 1 2 4 3 Subgraph Listing Operation o Input: pattern graph, data graph [both are undirected] o Output: all the occurrences of pattern graph in the data graph. Pattern graph Goal of our work o Efficiently listing subgraph in a large-scale graph 4 6 5 1 3 2 Data graph 4 Related Work Related Work Centralized algorithms Enumerate one by one [Chiba ’85, Wernicke ’06, Grochow ’07] Streaming algorithms Only counting and results are inaccurate [Buriol ’06, Bordino ’08, Zhao ’10] MapReduce based Parallel algorithms Decompose pattern graph + explicit join operation [Afrati ’13] Fixed exploration plan + implicit join operation [Plantenga ’13] Other efficient algorithms for specific pattern graph Triangle [Suri ’11, Chu ’11, Hu ’13] 5 Related Work Drawbacks in existing parallel solutions • MapReduce is not friendly to process graphs. • Join operation is expensive. • Do not take care of the balance of data distribution. • Data graph • Intermediate results The novel PSgL framework lists subgraph via graph traversal on in-memory stored native graph. 6 Contributions • We propose an efficient parallel subgraph listing framework, PSgL. • We introduce a cost model for the subgraph listing in PSgL. • We propose a simple but effective workload-aware distribution strategy, which facilitates PSgL to achieve good workload balance. • We design three independent mechanisms to reduce the size of intermediate results. 7 Preliminaries Partial subgraph instance • A data structure that records the mapping between pattern graph and data graph. • Denoted by 𝐺𝑝𝑠𝑖 • Assume the vertices of 𝐺𝑝 are numbered from 1 to |𝑉𝑝 |, we simply state 𝐺𝑝𝑠𝑖 as {map(1), map(2), ..., map(|𝑉𝑝 |)}. 4 1 2 4 3 6 5 1 {?,?,?,?} 1 {2,3,4,5} 1 {1,5,6,?} 3 2 4 6 5 3 2 4 6 5 3 2 8 Preliminaries Independence Property • 𝐺𝑝𝑠𝑖 Tree • A node is a 𝐺𝑝𝑠𝑖 • The children of a node are derived from expanding one mapped data vertex in the node. Gpsi {?,?,?,?} Gpsi {1,?,?,?} Gpsi {2,?,?,?} Gpsi {4,?,?,?} Gpsi {6,?,?,?} • Characteristics • A 𝐺𝑝𝑠𝑖 encodes a set of results. • 𝐺𝑝𝑠𝑖 s are independent from each other except the ones in its generation path. Gpsi {2,1,?,5} Gpsi {2,1,?,3} Gpsi {2,3,?,5} Gpsi {6,1,?,5} Gpsi {6,5,?,1} 𝐺𝑝𝑠𝑖 Tree 9 PSgL PSgL: Parallel Subgraph Listing Framework • PSgL follows the popular graph processing paradigm Iteration • vertex centric model • BSP model • PSgL iteratively generates 𝐺𝑝𝑠𝑖 in parallel; • Each 𝐺𝑝𝑠𝑖 is expanded by a data vertex. Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi Gpsi P1 P2 P3 Worker-1 Worker-2 Worker-3 10 PSgL Vertex program Algorithm of Expanding a 𝐺𝑝𝑠𝑖 - I • Partial Pattern Graph encodes • pattern graph, • 𝐺𝑝𝑠𝑖 , • progress state. • Three types of vertices • BLACK vertex is the one which has been expanded. • GRAY vertex has a mapped data vertex, but it has not been expanded. • WHITE vertex is the one which hasn’t been mapped to any data vertex. 1 4 2 Gp <1,3> <2,5> <4,2> <3,?> 3 + Gpsi = { 3, 5, ?, 2 } Gpp 11 PSgL Vertex program Algorithm of Expanding a Gpsi - II • Main logic • Changes one GRAY vertex into BLACK; • Validates the expanding vertex’s GRAY neighbors; • Makes the expanding vertex’s WHITE neighbor become GRAY. • Two observations • In each expansion, at least one pattern vertex is processed. • All GRAYs are the valid candidates for the next expansion. 1 4 2 Gp <1,3> <2,5> <4,2> <3,?> 3 + Gpsi = { 3, 5, ?, 2 } Gpp Example: expanding vertex <4, 2> 12 PSgL Analysis Efficiency of PSgL # of Gpsi processed by worker k # of iterations Total cost # of workers Three metrics: cost of processing a Gpsi • The number of iterations. • S is bounded by |MVC| ≤ S ≤ |Vp| - 1. • Workload balance. • Required by the max function. • The number of 𝐺𝑝𝑠𝑖 s *Refer to the paper for the details of estimating load(𝐺𝑝𝑠𝑖 ). 13 Optimization Workload balance - I • Partial subgraph instance distribution problem • There are N 𝐺𝑝𝑠𝑖 s to be processed by K workers, the goal is to find out a distribution strategy to achieve • NP-hard problem! • Naive Solutions • Random distribution strategy • Roulette wheel distribution strategy • 𝐺𝑝𝑠𝑖 has a higher probability to be expanded by a data vertex with smaller degree. 14 Optimization Workload balance - II • Workload aware distribution strategy • A general greedy-based heuristic rule. α Description Drawbacks 1 Selecting worker 𝑗 for the 𝑖 𝑡ℎ 𝐺𝑝𝑠𝑖 which has minimal overall workload 𝑊𝑖 local optimal 0 Selecting worker 𝑗 where the 𝐺𝑝𝑠𝑖 incurs the least increased workload 𝑤𝑖𝑗 imbalance Making a trade-off between local optimal and imbalance - 0.5 (*) All three strategies have the same worst bound which is K*|OPT|. But in practice, α = 0.5 performs best. 15 Optimization Comparison among various approaches Random Roulette 𝜶=0 𝜶=1 𝜶=0 16 Optimization Partial subgraph instance reduction - I • Pattern graph automorphism breaking • Using DFS to find the equivalent vertex group • Assign partial order for each equivalent vertex group • Initial pattern vertex selection • Introduce a cost model • General pattern graph 1 2 < 3 Automorphism Breaking Cost Model Best Initial Pattern Vertex Initial Pattern Vertex Section based on cost model • Enumerate all possible selections based on cost model • Cycle and clique • The vertex with lowest rank is the best one. 17 Optimization Partial subgraph instance reduction - II 1 • Online pruning invalid 𝑮𝒑𝒔𝒊 s 2 • Filter by the partial order and degree restriction • Prune with the help of a light weight global edge index 3 PG1 1 4 2 3 • Using bloom filter to index the ends of an edge Data Graph PG Gpsi # w/ index Gpsi # w/o index Pruning Ratio LiveJournal PG1(v1) 2.86 x 108 6.81 x 108 58.01% PG4(v1) 9.93 x 109 OOM unknown PG5(v1) 2.26 x 107 3.17 x 108 92.87% PG5(v3; v4) 7.38 x 109 2.04 x 1010 63.89% UsPatent PG4 1 2 5 3 4 PG5 18 Evaluation Evaluation - Comparing to MR solutions Afrati and SGIA-MR are the state-of-art MapReduce solutions. The ratios exceed 100 times are not visualized. PSgL: 4302s Afrati: 7291s 1 2 3 1 4 2 3 19 Evaluation Evaluation - Comparing to GraphLab Data Graph Pattern Graph Afrati PowerGraph PSgL Twitter 𝑃𝐺1 432min 2min 12.5min Wikipedia 𝑃𝐺1 871s 36s 125s WikiTalk 𝑃𝐺2 4402s 48s 318s WikiTalk 𝑃𝐺3 13743s 100s 494s WikiTalk 𝑃𝐺3 13743s OOM* 494s WikiTalk 𝑃𝐺4 1785s 127s 38s LiveJournal 𝑃𝐺4 2749s OOM 1330s * using a different traversal order. 1 2 3 𝑃𝐺1 1 4 1 4 1 4 2 3 2 3 2 3 𝑃𝐺2 𝑃𝐺3 𝑃𝐺4 20 Conclusion • Subgraph listing is a fundamental operation for massive graph analysis. • We propose an efficient parallel subgraph listing framework, PSgL. • Various distribution strategies • Cost model • Light-weight global edge index • The workload-aware distribution strategy can be extended to other balance problems. • A new execution engine is required for larger pattern graphs. 21 Thanks! 22 Backup Expr. – Scalability of PSgL Performance vs. Worker Number 23 Backup Expr. – Initial pattern vertex selection Livejournal Random graph Influences of the Initial Pattern Vertex on Various Data Graphs 24