Distributed Graph Pattern Matching Shuai Ma, Yang Cao, Jinpeng Huai, Tianyu Wo Graphs are everywhere, and quite a few are huge graphs! File systems Databases World Wide Web Social Networks Graph searching is a key to social searching engines! 2 Graph Pattern Matching • Given two graphs G1 (pattern graph) and G2 (data graph), – decide whether G1 “matches” G2 (Boolean queries); – identify “subgraphs” of G2 that match G1 • Applications – – – – – Web mirror detection/ Web site classification Complex object identification Software plagiarism detection Social network/biology analyses … • Matching Semantics – Traditional: Subgraph Isomorphism – Emerging applications: Graph Simulation and its extensions, etc.. A variety of emerging real-life applications! 3 Distributed Graph Pattern Matching • Real-life graphs are typically way too large: – Yahoo! web graph: 14 billion nodes – Facebook: over 0.8 billion users It is NOT practical to handle large graphs on single machines • Real-life graphs are naturally distributed: – Google, Yahoo and Facebook have large-scale data centers Distributed graph processing is inevitable It is nature to study “distributed graph pattern matching”! 4 Distributed Graph Pattern Matching • Given pattern graph Q(Vq, Eq) and fragmented data graph F = (F1, … , Fk) of G(V, E) distributed over k sites, • the distributed graph pattern matching problem is to find the maximum match in G for Q, via graph simulation. There exists a unique maximum match for graph simulation! 5 Graph Simulation • Given pattern graph Q(Vq, Eq) and data graph G(V, E), a binary relation R ⊆ Vq × V is said to be a match if – (1) for each (u, v) ∈ R, u and v have the same label; and – (2) for each edge (u, u′) ∈ Eq, there exists an edge (v, v′) in E such that (u′, v′) ∈ R. • Graph G matches pattern Q via graph simulation, if there exists a total match relation M – for each u ∈ Vq, there exists v ∈ V such that (u, v) ∈ M. – Intuitively, simulation preserves the labels and the child relationship of a graph pattern in its match. – Simulation was initially proposed for the analyses of programs; and simulation and its extensions were recently introduced for social networks. Subgraph isomorphism (NP-complete) vs. graph simulation (O(n2))! 6 Graph Simulation Set up a team to develop a new software product Graph simulation returns F3, F4 and F5; Subgraph isomorphism returns empty! Subgraph Isomorphism is too strict for emerging applications! 7 Properties of Graph Simulation Impacts of connected components (CCs) • Let pattern Q = {Q1, . . . , Qh} (h CCs). For any data graph G, – if Mi is the maximum match in G for Qi, – then M1 ∪ … ∪ Mh is the maximum match in G for Q. • Let data graph G = {G1, . . . , Gh} (h CCs). For any pattern graph G, – if Mi is the maximum match in Gi for Q, – then M1 ∪ … ∪ Mh is the maximum match in G for Q. Even if data graph G is connected, R(G) might be highly disconnected, by removing useless nodes and edges from G. • Any binary relation R ⊆ Vq × V on pattern graph Q(Vq,Eq) and data graph G(V,E) that contains the maximum match M in G for Q. – If Mi is the maximum match in R(G)i for Q, – then M1 ∪ … ∪ Mh is exactly the maximum match in G for Q, where R(G) consists of h CCs R(G)1, . . . , R(G)h). 8 Properties of Graph Simulation What can be computed locally? • The matched subgraph of Q1 and G1 is Gs = F3 ∪ F4 ∪ F5; • Removing any node or edge from Gs makes Q1 NOT match Gs. Graph simulation has poor data locality 9 Properties of Graph Simulation We turn to the data locality of single nodes • Checking whether data node v in G matches pattern node u in Q can be determined locally iff subgraph desc(Q, u) is a DAG. desc(Q1, SA) is the subgraph in Q1 with nodes SA, SD and ST What we have learned from the static analysis? • Treat each connected component in Q and G separately; • Use the data locality to check whether a node in G can be determined locally. 10 Complexity Analysis of Distributed Algorithms Model of Computation: • A cluster of identical machines (with one acted as coordinator); • Each machine can directly send arbitrary number of messages to another one; • All machines co-work with each other by local computations and message-passing. Complexity measures: 1. Visit times: the maximum visiting times of a machine (interactions) 2. Makespan: the evaluation completion time (efficiency) 3. Data shipment: the size of the total messages shipped among distinct machines (network band consumption) 11 Complexity Analysis of Distributed Algorithms Specifications for the distributed algorithms: • For each machine Si (1 ≤ i ≤ k) , – Local information: ( 1) pattern graph Q; (2) subgraph Gs,i of G; and (3) a marked binary relation Ri ⊆ Vq × V, where each match (u; v) 2 Ri is marked as true, false or unknown; and Ri can be updated by either messages or local computations. – Message: only local information is allowed to be exchanged – Local computations: update Ri by utilizing the semantics of graph simulation. local algorithms execute only local computations without involving message-passing during the computation, run in time of a polynomial of |Q| and |Gs,i|. Complexity bounds: 1. The optimal data shipment is |G| - 1, and it is tight. 2. The optimal visit times are 1, and it is tight. 3. The minimum makespan problem is NP-complete. Remarks: 1. Data shipment, visit times and makespan are controversial with each other. 2. A well-balanced strategy between makespan and the other two measures. 12 Distributed Evaluation of Graph Simulation • Stage 1: Coordinator SQ broadcasts Q to all k sites; • Stage 2: All sites, in parallel, partially evaluate Q on local fragments – partial match; • Stage 3: Ship those CCs across different machines to single machines, while minimizing data shipment and makespan; • Stage 4: Compute the maximum matches in those CCs originally across multiple machines in parallel; • Stage 5: Collect and assemble partial matches in the coordinator. Performance guarantees: 1. The total computational complexity is the same to the best-known centralized algorithm, while it invokes 4 rounds of message-passing and local evaluation only; 2. Total data shipment is bounded by |G| + 4|B| + |Q||G| + (k - 1) |Q|; 3. Each machine except coordinator SQ is visited with g + 2 times (g is the maximum number machines at which a CC resides in Stage2, and SQ is visited 2(k -1) times. Sacrifice data shipment and visit times for makspan! 13 Scheduling Data Shipment - Stage 3 The Scheduling Problem: Given h connected components, C1, … ,Ch, and an integer k, find an assignment of the connected component to k identical machines, so that both the makespan and the total data shipment are minimized. Approximation Hardness (data shipment, makespan): The scheduling problem is not approximable within (ε, max(k − 1, 2)) for any ε > 1. Performance guarantees of algorithm dSchedule: Algorithm dSchedule produces an assignment of the scheduling problem such that the makespan is within a factor (2 − 1/k) of the optimal one. Remarks: 1. A heuristic is used to minimize the data shipment, 2. A greedy approach is adopted to guarantee the performance of the makesapn. 3. The algorithm runs in O(kh), and is very efficient. Hence, its evaluation could not cause a bottleneck. 14 Optimization Techniques Using data locality Determine whether (u, v) belongs to the maximum match M in G for Q: Case 1: when there are no boundary nodes in desc(G, v) of fragmented graph Gj; Case 2: when there are boundary nodes in fragmented graph Gj, but subgraph desc (Q, u) of Q is a DAG (SA, SA2): Case 1 (BA, BA2): Case 2 Minimization Minimizing pattern graphs (Q ≡ Qm) Given pattern graph Q, we compute a minimized equivalent pattern graph Qm such that for any data graph G, G matches Q iff G matches Qm, via graph simulation. 15 Experimental Study Real life datasets: Google Web graph: 875,713 nodes and 5,105,039 edges Amazon product co-buy network: 548,552 nodes and 1,788,725 edges Synthetic graph generator: (108 nodes and 3,981,071,706 edges) Three parameters: 1. The number n of nodes; 2. The number nα of edges; and 3. The number l of node labels Algorithms: Algorithm disHHK and its optimized version disHHK+ Optimal algorithms naiveMatchds(data shipment) and naiveMatchvt (visit times) Machines: The experiments were run on a cluster of 16 machines, all with 2 Intel Xeon E5620 CPUs and 64GB memory 16 Experimental Study 1. All algorithms scale well except naiveMatchds and naiveMatchvt 2. disHHK+ consistently reduces about [1/5, 1/4] running time of disHHK 17 Experimental Study 1. All algorithms ship about 1/10000 of the data graphs 2. disHHK+ and disHHK even ship less data than naiveMatchds when data graphs are large and sparse 18 Experimental Study disHHK+ and disHHK have [30%, 53%] more visit times than naiveMatchds, as expected 19 Conclusion We have formulated and investigated the distributed graph pattern matching problem, via graph simulation. We have given a static analysis of graph simulation – Utility of connected components – Study of data locality We have studied the complexity of a large class of distributed algorithms for graph simulation. – A message-passing computation model – Makespan, data shipment, and visit times (controversial with each other) We have proposed a distributed algorithm for graph simulation – The scheduling problem – Optimization techniques – Experimental verification A first step towards the big picture of distributed graph pattern matching 20