QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism Graph pattern matching via graph simulation Revisions of graph simulation for social network analysis 1 The need for studying graph pattern matching Applications • pattern recognition • knowledge discovery • intelligence analysis • transportation network analysis • Web site classification, • social position and community detection • social media marketing • knowledge fusion • ... Prevalent use in traditional and emerging applications 2 Subgraph isomorphism: complexity and algorithm 3 Social Graphs Directed graph G = (V, E, fA) Assume fA(u) has a unique attribute: label attributes fA(u): label A I Med DB Gen Soc Eco Eco Simplification: node labels Chem 4 Subgraph isomorphism A function f from the nodes of Q to the nodes of G: For each node u in Q, u and f(u) have the same label; There exists an edge (u, u’) in Q if and only if there exists an edge (f(u), f(u’)) in G A A B B D E Q v1 B v2 E D G A bijection: identical label matching, edge-to-edge relations 55 Matching by subgraph isomorphism Input: A directed graph G, and a graph pattern Q Output: all subgraphs of G that are isomorphic to Q NP-complete Exponentially many matches Complexity • Remains NP-hard even when • Q is a tree and G is a forest • Q is acyclic and G is a tree PTIME if Q is a forest and G is a tree The lower bounds is rather robust intractable 6 Algorithms for computing subgraph isomorphism Input: pattern Q and graph G Output: all isomorphic mappings P from Q to G P: partial mappings, initially empty Match(P) • if P covers all nodes in Q then output P; • else compute the set S(P) of all candidate pairs for inclusion in P • for each pair p = (u, v) in S(P) nodes that are directly connected to those already in P, with the • then P’ P {p}; call Match(P’); same labels • restore data structures Guarantee correctness for each pair p = (u, v) in S(P): • if p passes feasibility check enumerate all possible extensions, for refinement if the feasibility test is not successful, drop it and try the next Recursion, refinement 7 VF2 Match(P) • if P covers all nodes in Q then output P; • else compute the set S(P) of all candidate pairs for inclusion in P • for each pair p = (u, v) in S(P) • restore data structures Five k-look-ahead rules, to make sure that P is a partial isomorphic • if p passes feasibility check mapping • then P’ P {p}; call Match(P’); Guarantee correctness Feasibility rules: for each pair (u, v) in P and reduce backtracking their predecessors are P. already and M. included L. P. Cordella, Foggia,mapped C. Sansone, Vento. in P A (Sub)Graph Isomorphism Algorithm for their successors can possibly be mapped Matching Large Graphs, IEEE Trans. Certain conditions on Anal. cardinalities of predecessors and Pattern Mach. Intell. 26, 2004 successors to ensure correctness and expandability VF2: a popular algorithm for subgraph isomorphism 8 Ullman’s algorithm Use adjacency matrices of G and Q, their transposes, and a form of permutation matrices Backtrack(P) • if P covers all nodes in Q then output P and return; • for each node u in Q that is not yet in P • find a node v in G; p (u, v); P’ P {p}; • if P’ makes a partial mapping (injective function, preserving edges) • then call Backtrack(P’); Expanding permutation matrices representing P for each candidate pair p = (u, v): enumerate all possible extensions, for refinement Backtracking: no matter whether the test is successful or not, go back to the previous level and try another p J. R. Ullman. An Algorithm for Subgraph Isomorphism. JACM 1976 An algorithm that is still being used 9 Graph simulation: complexity and algorithm 10 Graph Simulation A binary relation R on the nodes of Q and the nodes of G: For each node u in Q, there exists a node v in G such that (u, v) is in R, and u and v have the same label; If there exists an edge (u, u’) in Q and each pair (u, v) is in R, then there exists an edge (v, v’) in G such that (u’, v’) is in R A A B B D E Q v1 B v2 E D relations as opposed to functions G A relation: identical label matching, edge-to-edge mapping 11 11 Matching by graph simulation Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R Maximum simulation relation: always exists and is unique • If a match relation exists, then there exists a maximum one • Otherwise, it is the empty set – still maximum Complexity: O((| V | + | VQ |) (| E | + | EQ| ) The output is a unique relation, possibly of size |Q||V| Use relations instead of functions Quadratic time 12 Data locality Given a pattern Q, a graph G and a node v in G, can we decide whether v matches some node in Q by inspecting only nodes within d hops of v, where d is determined by Q only? Graph simulation does not have the data locality Q G d: the diameter of Q We only need to inspect the d-neighborhood of v Subgraph isomorphism has the data locality Graph simulation: a recursive computation 13 Algorithm for computing graph simulation Input: pattern Q and graph G Output: for each u in Q, sim(u): the matches w in G Similarity(P) • for all nodes u in Q do with the same label; moreover, if u has an outgoing edge, so does w • sim(u) the set of candidate matches w in G; • while there exist (u, v) in Q and w in sim(u) (in G) that violate the simulation condition successor(w) sim(v) = • sim(u) sim(u) {w}; • output sim(u) for all u in Q refinement successor(w) sim(v) = • There exists an edge from u to v in Q, but the candidate w of u has no corresponding edge to a node w’ that matches v Correct, but not in quadratic time 14 speedup For each node u in pattern Q, prevsim(u) a superset of sim(u) • once considered for candidate matches of u • for each edge (u, v) in Q and each w in sim(u) successor(w) prevsim(v) • terminate if prevsim(u) = sim(u) for all nodes u in G prevsim(u) sim(u): invalid candidates If successor(w) prevsim(v) = • Can’t be refined further w should be removed from sim(u); u: a predecessor of v Propagate violations upward Once w is removed, it is never put back Each node in prevsim(u) is looked up only once 15 Algorithm with the same label; moreover, if u has an outgoing edge, so does w Similarity(P) • for all nodes v in Q do • sim(v) the set of candidate matches in G; • prevsim(v) the set of all the nodes in G; • while there exists a node v in Q and such that sim(v) prevsim(v) • remove predecessor(sim(v)) predecessor(prevsim(v)); • for all u in predecessor(v) do • sim(u) sim(u) remove; • prevsim(v) sim(v); • output sim(v) for all v in Q A dynamically maintained remove Propagate up refinement For each w prevsim(v) sim(v), w is checked only once, hence |VQ| |V| in total Can be implemented in O((| V | + | VQ |) (| E | + | EQ| ) time 16 Graph simulation revised for social network analysis 17 Graph pattern matching: The conventional Input: a query Q and a data graph G, Output: all the matches of Q in G. • subgraph isomorphism a bijective function f on nodes: (u,u’ ) ∈ Q iff (f(u), f(u’)) ∈ G • graph simulation a binary relation S on nodes for each (u,v)∈ S, each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ )∈ S 18 Can we use the conventional notions for social network analysis? Example query: graph pattern matching Find all matches of a pattern in a graph B Identify suspects in a drug ring B A1 Am 1 AM 3 S W W 3 W W W W FW pattern graph W W “Understanding the structure of drug trafficking organizations” 19 Pattern matching in social graphs relation instead of function not allowed by bijection B A1 B Am 1 AM 3 W W S 3 W W W W FW edges to paths W W For both scalability and effectiveness Neither subgraph isomorphism nor graph simulation works 20 Social Graphs label, keywords, blogs, Directed graph G = (V, E, fA) comments, rating … attributes fA(u): a tuple (A1 = a1, ..., An = an) A I Med DB Gen (‘dept’=CS, ‘field’=AI) (‘dept’=CS, ‘field’=DB) Soc Chem (‘dept’=Bio, ‘field’=Gen) Eco Eco (‘dept’=Bio, ‘field’=Eco) Social graphs: modeling attributes 21 Bounded patterns Search condition Pattern graph: Q = (VQ, EQ, fv, fe) fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥ fe(u,u’): a constant k or a symbol ∗, bound within k hops * Unbounded CS Med Bounded 3 * 2 Bio fv(): ‘dept’=CS 3 2 Soc Incorporating search conditions and bounds on the number of hops22 Bounded Simulation for each (u,v)∈ S, attributes fA(v) satisfies predicate fv(u) G = (V, E, fA) matches Q (VQ, E fvQ,, fthere bounded for=each u∈ v∈ Vsimulation, such that if Q, V e) viaexists each (u,u’ ) in EQ(u,v) is mapped path from v ∈SS⊆ VQto×a V there exists a binary relation such that S to v’ of length fe(u,u’ ) in G, (u’,v’ )∈ S is a total mapping, satisfies search conditions and bounds on edge-to-path mappings * CS Med AI Med DB Gen S 3 * Bio 2 3 Chem 2 Soc Soc Eco There exists a unique maximum match Mapping edges to bounded paths 23 Bounded simulation in social graphs B relation instead of function A1 B Am 1 AM 3 W W S 3 W W W W FW edges to paths W W The set of all suspects involved in a drug ring 24 Complexity Input: Pattern Q and data graph G Always exist Output: Q(G), the unique maximum match relation cubic time O(| V | | E | + | EQ| | V |2 + | VQ| | V |) Subgraph isomorphism: intractable comparable: Q is small in practice Graph simulation: O((| V | + | VQ |) (| E | + | EQ| ) Query driven approximation: use bounded simulation instead of subgraph isomorphism. Criteria: Lower complexity Effectiveness: the query answersThe arereading sensible Algorithm? list To identify sensible matches and be computable in low PTIME 25 Bounded simulation vs. graph simulation Graph simulation: a special case of bounded simulation The same bound 1 on all pattern edges (edge-to-edge mapping) Unique attributes vs. search conditions: label equality O((| VG | + | VQ |) (| EG | + | EQ| ) vs. O(| VG | | EG | + | EQ| | VG |2 + | VQ| | VG|) Process calculus Web site classification Social position detection, … Capture more sensible matches in social graphs (by 80%) 26 Homeomorphism and monomorphism Graph homeomorphism: G = (V, E) matches Q = (VQ, EQ) an injective function from VQ V function rather than relation edges to pairwise node-disjoint simple paths in G constraints on paths Monomorphism revised: G = (V, E) matches Q = (VQ, EQ) an injective function from VQ V edges to nonempty paths in G Intractable, even when Q is a tree and G is a DAG Strike a balance between expressive power and complexity 27 Graph pattern matching: • Incorporating edge relationships 28 Edge relationships S: supervise pattern C: co-author C C Ann, CS C S+ CS S S S Mat, DB John, DB C C Bill, Bio Tom, Bio C DB Bio C C Pat, DB Don, Gen Bio What is this pattern to find? 29 Edge relation Facebook Mikhail Twitter Alice Sunita Jose (Alice, Facebook) (Alice, Sunita) (Jose, Twitter) (Jose, Sunita) (Mikhail, Facebook) (Mikhail, Twitter) (Sunita, Facebook) (Sunita, Alice) (Sunita, Jose) 30 Graph encodings: Adding edge types Facebook fan-of Mikhail fan-of friend-of Alice fan-of fan-of friend-of Sunita fan-of Twitter Jose (Alice, fan-of, Facebook) (Alice, friend-of, Sunita) Adding edge labels (Jose, fan-of, Twitter) (Jose, friend-of, Sunita) (Mikhail, fan-of, Facebook) (Mikhail, fan-of, Twitter) (Sunita, fan-of, Facebook) (Sunita, friend-of, Alice) (Sunita, friend-of, Jose) 31 Graph encodings: Adding weights Facebook fan-of 0.8 fan-of 0.5 0.7 fan-of friend-of friend-of Alice 0.9 0.3 Sunita fan-of Mikhail 0.7 fan-of Twitter 0.5 Jose (Alice, fan-of, 0.5, Facebook) (Alice, friend-of, 0.9, Sunita) (Jose, fan-of, 0.5, Twitter) Even further, you can add weights and others (Jose, friend-of, 0.3, Sunita) (Mikhail, fan-of, 0.8, Facebook) (Mikhail, fan-of, 0.7, Twitter) (Sunita, fan-of, 0.7, Facebook) (Sunita, friend-of, 0.9, Alice) (Sunita, friend-of, 0.3, Jose) 32 Regular patterns Pattern: Q = (VQ, EQ, fv, fe) fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥ Unbounded fe(u,u’ ): a regular expression of the form F ::= c | ck | c+ | FF C S+ CS Simple regular expressions: fairly common optimizing patterns (checking containment in linear-time) low complexity in matching DB Bio C Bounded Bio Mapping edges to paths satisfying associated regular expressions33 Complexity Input: Pattern Q and data graph G Output: Q(G) m: the number of distinct colors in Q O(| V | | E | + m | EQ| | V |2 + | VQ| | V |) bounded simulation: a special case single color c (hence m = 1) fe(u,u’ ) = c general regular expressions? Adding edge colors does not incur extra complexity 34 Graph pattern matching: • Capturing graph topology 35 Limitations of graph simulation pattern graph A disconnected graph matches a connected pattern The yellow node in the pattern has 3 “parents”, in contrast to 1 in the data graph An undirected cycle matches a tree Simulation does not preserve the topologic in matching 36 Limitations of graph simulation pattern graph A cycle with two nodes matches a cycle of unbounded length The match relation may be excessively large When social distances increase, the closeness of relationships decrease The need for revising simulation to enforce locality 37 Dual simulation for each (u,v)∈ S, each (u,u’ ) in EQ is mapped to an edge G = (V, E, fA) matches Q(v, = v’ (V)Q,inEG, fv, fv’ Q, (u’, e) )via ∈ Sdual simulation, if there exists a binary S ⊆u)Vin that toSan edge Q × relation each (u’, EQ Vissuch mapped is a total mapping, (v’, v) in G, (u’, v’ ) ∈ S satisfies search conditions, and preserves both “child” and “parent” relationships Q(G) : a unique maximum match relation Preserve “parent” relationships and connectivity 38 Locality diameter dQ: the maximum shortest distance (undirected paths) 2 1 dQ-radius subgraph G[v, dQ] : centered at v, within dQ hops v Excessive match Locality: matches contained in G[v, dQ] for some v 39 Strong simulation G matches Q via strong simulation, if there exists a node v in G such that G[v, dQ] matches Q via dual simulation – duality – local Match: the subgraph GS of G[v, dQ] representing the maximum match S for each (u,v) in the maximum match S, v is in GS for each edge (u,u’ ) in Q, (v, v’ ) is in GS if (u’,v’ )∈ S Matching: given Q and G, find the set Q(G) of all matches 40 Preserving the topology of patterns Child and parent relationships connectivity: if Q is connected (via undirected path), so is GS cycles: a directed (resp. undirected) cycle in Q matches a directed (resp. undirected) cycle in GS bounded matches: – the diameter of GS is at most 2 * dQ – |M(Q, G)| |V| What about graph simulation? 41 Strong simulation vs. graph simulation hierarchy G matches Q via subgraph isomorphism preserve topology, but not bounded match G matches Q via strong simulation does not preserve parents, connectivity, undirected G matches Q via dual simulation cycles, bounded match Complexity of strong simulation G matches Q via graph simulation Input: Pattern Q and data graph G cubic time Output: Q(G) O(| V | (| V | + (| VQ| + | EQ|) (| V | + | E |)) A balance between the complexity and the ability to preserve topology42 Making strong simulation stronger? Bounded cycles If G matches Q, then the longest simple cycle in G is no longer than its counterpart in Q Bisimulation instead of simulation: find all subgraphs that are bisimilar to a pattern for each (u,v)∈ S, each (u,u’ ) in EQ is mapped to an edge (v, v’ ) in Gs, (u’,v’ )∈ S each edge (v, v’ ) in Gs is mapped to an edge (u,u’ ) in EQ, (u’, v’ )∈ S Both extensions make matching from PTIME to intractable 43 Summing up 44 Various notions for graph pattern matching matching complexity |M(Q, G)| subgraph isomorphism NP-complete |V| |VQ| graph simulation quadratic time |V| |VQ| bounded simulation cubic time |V| |VQ| regular matching cubic time |V| |VQ| strong simulation cubic time |V| Query driven approximation: from subgraph isomorphism (intractable) to strong simulation or bounded simulation (cubic-time) 45 Summary Graph pattern matching – Subgraph isomorphism – Graph simulation – Bounded simulation – Regular matching – Strong simulation – ... A uniform framework for these Querying both topology and data content • What query language should we use for social data analysis? • Strike a balance between the expressivity and complexity Reading: W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012. (survey of graph pattern matching) The study has raised as many questions as it has answered 46 Summary and review What is subgraph isomorphism? Complexity? Algorithm? Name a few applications What is graph simulation? Complexity? Understand its algorithm. Name a few applications Why do we need to revise conventional graph pattern matching for social network analysis? How should we do it? Why? Understand bounded simulation. Read its algorithm. Complexity? What is strong simulation? Complexity? Name a few applications in which strong simulation is useful. Find other revisions of conventional graph pattern matching that are not covered in the lecture. 47 Project (1) Recall bounded graph simulation Implement an algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via bounded simulation Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability with the size of G Write a survey on revisions of conventional graph simulation, as related work A development project 48 Project (2) Recall graph simulation Develop a MapReduce algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via graph simulation Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability with the size of G Write a survey on revisions of conventional graph simulation, as part of the related work A research and development project 49 Project (3) Recall subgraph isomorphism Develop two algorithms that, given a pattern Q and a graph G, computes the maximum match of Q in G via subgraph isomorphism, in • MapReduce (see Lecture 4) • BSP (see Lecture 5) Develop optimization strategies to reduce parallel computational cost and data shipment cost Experimentally evaluate your algorithms, especially their scalability with the size of G Write a survey on parallel algorithms for subgraph isomorphism A development project 50 Papers for you to review • M. R. Henzinger, T. Henzinger, and P. Kopke. Computing simulations on finite and infinite graphs. FOCS, 1995. http://infoscience.epfl.ch/record/99332/files/HenzingerHK95.pdf • L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004 (search Google scholar) A. Fard, M. U. Nisar, J. A. Miller, L. Ramaswamy, Distriuted and scalable graph pattern matching: models and algorithms. Int. J. Big Data. http://cobweb.cs.uga.edu/~ar/papers/IJBD_final.pdf • W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010. • W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions to Graph Reachability and Pattern Queries, ICDE 2011. • S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo: Strong simulation: Capturing 51 topology in graph pattern matching. TODS 39(1): 4, 2014.