Querying and Mining Large Graph Databases Jeffrey Xu Yu Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong Query Processing Parallel query processing Distributed query processing Adaptive query processing Continuous query processing Spatial database query processing Data warehouse, online analytical query processing (OLAP), datacube, view management, Iceberg queries Multi-database query processing Data mining query processing Object-Oriented query processing Extensible database query processing XML/RDF/Graph query processing 2 Books and Surveys on QP Readings in Database Systems (FouthEdition), Michael Stonebraker and Joseph M.Hellerstein, Morgan Kaufmann, http://redbook.cs.berkeley.edu/bib4.html Principles of Database Query Processing for Advanced Applications, Clement T. Yu and Weiyi Meng, Morgan Kaufmann, 1998 Matthias Jarke and Jürgen Koch: Query Optimization in Database Systems, ACM Computing Surveys, 16(2), 111-152, 1984. Michael V. Mannino, Paicheng Chu, and Thomas Sager: Statistical Profile Estimation in Database Systems, ACM Computing Surveys, 20(3), 191-221, 1988. Priti Mishra and Margaret H. Eich: Join Processing in Relational Databases, ACM Computing Surveys, 24(1), 63-113, 1992. D. J. DeWitt and J. Gray: Parallel Database Systems: The Future of High Performance Database Processing, Communications of the ACM, June 1992. Goetz Graefe: Query Evaluation Techniques for Large Databases, ACM Computing Surveys, 25(2), 73-170, 1993. Yannis E. Ioannidis: Query Optimization, ACM Computing Surveys, 28(1), 121123, 1996. Surajit Chaudhuri: An Overview of Query Optimization in Relational Systems, PODS, 34-43, 1998. D. Kossmann: The State of the Art in Distributed Query Processing, ACM Computing Surveys, 32(4), 422-469, 2000. 3 The Main Theme of DBMSs Structural behavior and operational behavior More data structures and more operations? What to be added? At which level to support? How to support? The more the better? Can it be complete? Supporting Structural Behavior Entity-Relationship Model, Peter P. Chen, ACM TODS, 1976 Database Abstractions: Aggregation and Generalization, John M. Smith and Diane C.P. Smith, ACM TODS, 1977 Molecular Objects, Abstract Data Types and Data Models: A Framework, Don S. Batory and Alejandro P. Buchman, VLDB’84 Survey of Graph Database Models, Renzo Angles and Claudio Gutierrez, ACM Computing Surveys, Vol. 40, No.1, 2008 Adding New Structures into RDBMS (1) Complex Objects in IBM System-R (Raymond A. Lorie and Wil Plouffe, SIGMOD’83) System-R is a database system built at IBM San Jose Research in the 1970’s. System-R introduced SQL. Lore and Plouffe propoosed to add the hidden pointers to support tree-structured schema. Adding New Structures into RDBMS (2) Implementation of Data Abstraction in RDBS Ingres (James Ong, Dennis Fogg and Michael Stonebraker, SIGMOD-Record, 1984) Quel as a Data Type (Michael Stonebraker, et al., SIGMOD’84) Nested Relations (Akifumi Makinouchi, VLDB’77) OO DBMS Manifesto Object-Oriented Database System Manifesto (Malcolm Atkinson, Francois Benchilhon, David DeWitt, Klaus Dittrich, David Maier and Stanley Zdonik, DOOD, 1989) Third-Generation Database System (Michael Stonebraker, Lawrence A. Rowe, Bruce G. Lindsay, Jim Gray, Michael J. Carey, Micahel L Brodie, Philip A Bernstein, and David Beech, DS-4, 1990. Semistructured and XML Querying Semistructured Information (Dallan Quass, Anand Rajaraman, Yehoshua Sagiv, Jeffrey D. Ullman, and Jeffifer Widom, DOOD’950 XML query processing became a hot issue since 1998. Early Work To Support Graphs in DBMSs A Database as a Graph (GOOD) A graph-oriented object database model (Marc Gyssens, Jan Paredaens, Jan Van den Bussche and Dirk Van Gucht, PODS’90) Both schema and objects are graphs Data manipulations are graph transformation Based on graph grammars and pattern matching 4 operators: node/edge insertion/deletion An abstract operator to abstract objects that share the same set of properties Computational Completeness Relational complete with the 4 operators. Simulate nested relational algebra with the abstract and the 4 operators Support Abstract Graph Structures A Graphical Query Language Supporting Recursion (Isabel F. Cruz, Alberto O. Mendelzon, Peter Wood, SIGMOD’87) Its expressive power is comparable to relational query languages. Explicitly and Views Support graphs on RDBMS explicitly GraphDB: Modeling and Querying Graphs in Databases (Ralf Hartmut Guting, VLDB’94) Graph Views Database Graph Views: A Practical Model to Manage Persistent Graph (Alejandro Gutierrez, Philippe Pucheral, Hermann Steffen and Jean-Marc Thevenin, VLDB’94) Build a graph view on a DBMS (RDBMS or OODBMS) Graph view operations: union, intersection, node-difference, edge-difference, selection, and projection. Query processing on graph views: set-oriented and pipelined Graph Storage Representing web graphs (Sriram Raghavan, Hector Garcia-Molina, ICDE’03) The Web graph is huge. Construct a two level graph. Partition nodes into N1, N2, …. Construct a super-node graph where a node is a partition Ni. Maintain page-based connection information inside Ni, and between Ni and Nj. How to partition? Books on Social Networks Social and Economic Networks by Matthew O. Jackon Social Network Data Analysis by Charu C. Aggarwal Exploratory Social Network Analysis with Pajek by Wouter de Nooy, Andrej Mrvar, and Vladimir Batagelj Networks, Crowds, and Markets: Reasoning about a Highly Connected World by David Easley and John Keinberg Networks An Introduction by M.E.J. Newman Managing and Mining Graph Data by Charu C. Aggawal and Haixun Wang Some Online Courses Mining of Massive Datasets (Anand Rajaraman and Jeff Ullman) http://infolab.stanford.edu/~ullman/mmds.html Networks, Crowds, and Markets: Reasoning about a highly connected world, by David Easley and Jon Kleinberg http://www.cs.cornell.edu/home/kleinber/network s-book Topics in Data Management & Mining – Social Networks, Laks V.S. Lakshmanan http://www.cs.ubc.ca/~laks/534l/cpsc534l.html Some Recent Tutorials Mining Heterogeneous Information Networks by Jiawei Han, Yizhou Sun, Xifeng Yan, and Philip S. Yu (KDD’10) Mining Knowledge from Databases: An Information Network Analysis by Jiawei Han, Yizhou Sun, Xifeng Yan, and Philip S. Yu (SIGMOD’10) Querying Large Graph Databases by Yiping Ke, James Cheng, and Jeffrey Xu Yu (DASFAA’10) Stanford Large Network Dataset Collection http://snap.stanford.edu/data Social networks Communication networks Citation networks Collaboration networks Web graphs Amazon networks Internet networks Road networks Autonomous systems Signed networks Wikipedia networks and metadata Twitter and Memetracker Graph Database http://en.wikipedia.org/wiki/Graph_database Pregel: Google’s internal graph processing platform Trinity: Microsoft Research Asia Neo4j: commercial graph database … Graph Reachability Query Two Possible Solutions Traverse G(V, E) to answer reachability queries Low query performance: O(|E|) query time Precompute and store the transitive closure T Fast query processing Large storage requirement: O(|V|2) Interval-based Labeling For tree-structured data Assign each node one interval [start, end] such as the pair of [pre-order, post-order] during DFS Reachable iff one node’s interval contains the other’s Space-efficient: O(|V| log |V|) Extend to DAG [Agrawal et al. SIGMOD89] To cover all descendants, a node may contain multiple intervals Space complexity: O(|V|2) in the worst case. It can be used to support directed graphs. a [1,16] a [1,16] [2,9] b c [2,9] b [10,11] c d [12,15] d [5,8][12,15] [10,11] e f [3,4] [5,6] g [7,8] h [13,14] e f [3,4] [5,6] g [7,8] h [13,14] 2-hop Labeling [Cohen et al.SODA02] For each node a, maintain two sets of labels (nodes): Lin(a) and Lout(a) For each connection (a,b), choose a node c on the path from a to b (center node) add c to Lout(a) and to Lin(b) Then (a,b)Transitive Closure T Lout(a)Lin(b)≠ a c Minimize the sum of the label sizes (NP-complete approximation required) b 2-hop Labeling (Example) a Lin a c b f,d,b,c b d f e Lout c f,d d f e h g i d,b f g f h f i f,h 2-hop Labeling (Example) a Lin a c b f,d,b,c b d f e Lout c f,d d f e h g i d,b f g f h f i f,h 2-hop Labeling (Example) a Lin a c b f,d,b,c b d f e Lout c f,d d f e h g i d,b f g f h f i f,h 2-hop Labeling (Example) a Lin a c b f,d,b,c b d f e Lout c f,d d f e h g i d,b f g f h f i f,h 2-hop Labeling (Example) a Lin a c b f,d,b,c b d f e Lout c f,d d f e h g i d,b f g f h f i f,h Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections. 1 2 4 3 5 6 Initial step: All connections are uncovered Consider the center graph of candidates initial density: 2 I 1 4 2 5 6 O Edges 8 8 1.33 I O 24 6 density (We canofcover densest 8 connections subgraph with 6 same (here: cover as entries) initial density) Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections. 1 2 4 3 5 6 Initial step: All connections are uncovered Consider the center graph of candidates I 1 4 2 5 3 6 4 O initial density: in subgraph Cover connections with Edges greatest density with 12 12 center node 1.71 corresponding I O 43 7 density of densest subgraph = initial density (graph is complete) Approximation Algorithm What are good center nodes? Nodes that can cover many uncovered connections. 1 2 4 3 5 6 Next step: Some connections already covered Consider the center graph of candidates I 1 2 2 O Repeat this algorithm until all connections are covered Theorem: Generated Cover is optimal up to a logarithmic factor 2-hop Labeling Construction T=TC While T is not empty Choose S: elements of T to be processed Update Lins and Louts w.r.t. to S Remove elements from T’ w.r.t to S Construction Example T a a c b c b d d f f e h e h g g i i Construction Example T Bipartite graph Bh a a c b c d i h d f e f h h g i A subgraph of Bh: 9/7 34 Construction Example T Bipartite graph Bf a a c b c d d f e f g h i f h The densest subgraph Sf of Bf, 15/8 g i The densest among all nodes 35 Construction Example Sf a c d f Lin g a Lout f b h i c f d f e f –Update Lins and Louts w.r.t. to Sf, f f g f h f i f f 36 Construction Example T Updated T a a c b c b d d f e f h g e h g i i 37 Construction Example Finding the densest subgraph is not obvious An NP-hard problem -> SET COVER Use SET COVER heuristics to minimize the size 38 Optimizing Performance [Schenkel et al. EDBT04, ICDE‘05] Density of densest subgraph of a node‘s center graph never increases when connections are covered Do not rank all the denest subgraphs in every iteration. Precompute estimates, recompute on demand. Is that enough? Transitive Closure: 344,992,370 connections 2-Hop Cover: 1,289,930 entries compression factor of ~267 queries are still fast (~7.6 entries/node) Computation took 45 hours and 80 GB RAM! HOPI: Divide and Conquer Framework of an Algorithm: (I) Partition the graph such that the transitive closures of the partitions fit into memory and the weight of crossing edges is minimized (II) Compute the 2-hop cover for each partition (III) Combine the 2-hop covers of the partitions into the final cover Final Results for Index Creation Transitive Closure: 344,992,370 connections 2-Hop Cover: 9,999,052 entries compression factor of ~34.5 queries are still ok (~59.2 entries/node) build time is good (~23 minutes with 1 CPU and 1GB RAM) Min Set-Cover: Least Overlapping vs Min Cardinality [Cheng et al. EDBT’06] Tested with a DAG (|V|=2,000, |E|=4,000) R-tree Approach [Cheng et al. EDBT’06] Do not generate transitive closure Map the 2-hop problem onto a two-dimensional grid, and Compute 2-hop labeling using operations against rectangles with help of an R-tree. An Example Utilizing Internal Based Labeling Interval Based Codes Reachability Map A Reachability Map Example A Reachability Map Example An R-Tree Based Algorithm Selecting Bipartite Graphs Hierarchical Partitioning [Cheng et al. EDBT’08] Reachability Queries I/O Cost Minimization: Reachability Queries Processing over Masive Graphs Scaling Reachability Computation on Large Graphs The Exact Distance to Destination in Undirected World Label-constraint Reachability Queries K-Reach Keyword Search in RDB Find Information from Relational Databases RDBs are structured with rich schema information Complex Schema SQL query language long learning curves select paper.title from conference c, paper p, author a1, author a2, write w1, write w2 where c.cid = p.cid AND p.pid = w1.pid AND p.pid = w2.pid AND w1.aid = a1.aid AND w2.aid = a2.aid AND a1.name = “John” AND a2.name = “John” AND c.name = SIGMOD Keyword Search in RDB Given a relational database, RDB, consider the RDB as a directed graph G(V, E). Given a set of keywords, find interconnected tuple structures. Connected Trees Minimal Total Joining Networks of Tuples A Simple RDB Author AID a1 a2 Cite CID c1 c2 c3 Paper Name Jim Kate PD1 p1 p5 p5 PD2 p3 p4 p3 PID Write Title WID AID PID P1 Database System w1 a1 p5 P2 Algorithm design w2 a1 p2 P3 Algorithm Analysis w3 a1 p1 P4 Database Schema w4 a2 p1 P5 Query Processing w5 a2 p3 c2 c3 c1 p5 p2 p4 p1 p3 w1 w2 w3 w4 w5 a1 a2 Trees (Jim Database Algorithm) p2 Author AID a1 a2 Cite CID c1 c2 c3 Paper Name Jim Kate PD1 p1 p5 p5 PD2 p3 p4 p3 PID w2 WID AID PID a1 P1 Database System w1 a1 p5 P2 Algorithm design w2 a1 p2 P3 Algorithm Analysis w3 a1 p1 P4 Database Schema w4 a2 p1 P5 Query Processing w5 a2 p3 c2 p5 Title p2 c3 p4 p1 Write c1 p1 a1 c1 p1 w3 w3 w3 p3 w2 w3 w4 a2 p3 w4 w5 c3 w5 p5 a1 p1 a2 a1 c2 w1 p3 w1 p4 a1 p3 Trees (Jim Database Algorithm) p2 Author AID a1 a2 Cite CID c1 c2 c3 Paper Name Jim Kate PD1 p1 p5 p5 PD2 p3 p4 p3 PID w2 WID AID PID a1 P1 Database System w1 a1 p5 P2 Algorithm design w2 a1 p2 P3 Algorithm Analysis w3 a1 p1 P4 Database Schema w4 a2 p1 P5 Query Processing w5 a2 p3 c2 p5 Title p2 c3 p4 p1 Write c1 p1 a1 c1 p1 w3 w3 w3 p3 w2 w3 w4 a2 p3 w4 w5 c3 w5 p5 a1 p1 a2 a1 c2 w1 p3 w1 p4 a1 p3 Trees (Jim Database Algorithm) p2 Author AID a1 a2 Cite CID c1 c2 c3 Paper Name Jim Kate PD1 p1 p5 p5 PD2 p3 p4 p3 PID w2 WID AID PID a1 P1 Database System w1 a1 p5 P2 Algorithm design w2 a1 p2 P3 Algorithm Analysis w3 a1 p1 P4 Database Schema w4 a2 p1 P5 Query Processing w5 a2 p3 c2 p5 Title p2 c3 p4 p1 Write c1 p1 a1 c1 p1 w3 w3 w3 p3 w2 w3 w4 a2 p3 w4 w5 c3 w5 p5 a1 p1 a2 a1 c2 w1 p3 w1 p4 a1 p3 MTJNT (Jim Database Algorithm) p2 Author AID a1 a2 Cite CID c1 c2 c3 Paper Name Jim Kate PD1 p1 p5 p5 PD2 p3 p4 p3 PID w2 WID AID PID a1 P1 Database System w1 a1 p5 P2 Algorithm design w2 a1 p2 P3 Algorithm Analysis w3 a1 p1 P4 Database Schema w4 a2 p1 P5 Query Processing w5 a2 p3 c2 p5 Title p2 c3 p4 p1 Write c1 p1 a1 c1 p1 w3 w3 w3 p3 w2 w3 w4 a2 p3 w4 w5 c3 w5 p5 a1 p1 a2 a1 c2 w1 p3 w1 p4 a1 p3 Two Basic Approaches Take RDB as it is. Generate a set of Candidate Networks (CNs) Evaluate each CN using SQL Take RDB as a graph by materalization. Design a graph algorithm to find all answers Finding Answer Trees Intuition: travel backwards from keyword nodes till you hit a common node Query: sudarshan roy paper MultiQuery Optimization writes authors Sudarshan Prasan Roy Backward Search: Algorithm Run concurrent single source shortest path iterators from each node matching a keyword Traverse the graph edges in reverse direction Output next nearest node on each get-next() call Do best-first search across iterators Output node if in the intersection of sets of nodes reached from each keyword Backward Search: Limitations Wasteful exploration of graph: Frequently occurring keywords “Hub” nodes in the graph (high in-degree) “Shashank Sudarshan Database” … Schema Legend Database … author writes Shashank Sudarshan paper Bidirectional Search: Motivation Bidir Search: Intuition First cut solution: Don’t go backward if a keyword matches many nodes Don’t go backward if a node points to a hub Instead explore forward from other keywords Bidir Search: Example “Shashank Sudarshan Database” … Database … … Schema Legend author Shashank writes Sudarshan paper Top-K Minimum Group Steiner Tree A Parameterized Approach Dynamic Programming [Ding et al. ICDE’07] A Naïve Approach 72 Dynamic Programming Equation 73 Dynamic Programming Equation 74 The Order to Compute T(v,p) 75 Finding Maximal Cliques in Massive Networks by H*-graph James Cheng, Yiping Ke, Ada Wai-Chee Fu, Jeffrey Xu Yu, Linhong Zhu (SIGMOD’10) Maximal Clique Enumeration (MCE) A long-standing problem Find all maximal cliques in a given graph Applications in graph theory and other areas Maximal common induced subgraphs, maximal common edge subgraphs, maximal independent set, … Social networks, clustering in dynamic networks, detection patterns in terrorist networks, computational biology, financial network analysis, … 77 Why This Problem? All existing algorithms are in-memory The best algorithm requires memory space of (|V|+|E|) Real-world networks become exceedingly large and keep growing The Web: over 1 trillion webpages (Google) Social networks: 400 million users (Facebook) Citation graphs: 1.3+ million publications (DBLP) Existing algorithms fail to handle large networks 78 Our Contributions We propose the first external-memory MCE algorithm At each recursive step A recursive approach Pick a subgraph g from the large given graph G Find all max-cliques related to g Remove g from G Two Questions Which subgraph should we choose at each time? How to ensure soundness and completeness? 79 What is H? h-index “h-index” for a graph The maximum h for a scientist who has h publications with citations of at least h The maximum h for a graph G that has h vertices with degree of at least h H = {the h vertices} a w ID y x b r z c d t e s q c b a d e w x y s r z q t Deg 7 6 5 5 5 4 4 3 2 2 2 1 4 h = 5 ( 5 vertices with deg ≥ 5 ) H = {c, b, a, d, e} 80 H-graph and H+-graph H-graph: subgraph of G induced by H Hnb : the neighbors of h-vertices in G H+-graph: subgraph of G induced by (H ∪ Hnb) The + means the extension from H to neighbors H+graph Hgraph b r z w a y x c t e d 81 s q H*-graph H*-graph: a graph that “lies” between H-graph and H+-graph Exclude edges among h-neighbors from H+-graph Contain edges incident on at least an h-vertex H+H*graph Hgraph b r z w a y x c t e d 82 s q H*-graph We use H*-graph as the subgraph of G for computing local max-cliques in recursive steps Why H*-graph? H*-graph gives the core of G as well as the connection from the core to other parts of G Provide rich information for MCE H*-graph is only a small portion of G Small enough to be kept in memory 83 Size of H*-graph Most real networks are scale-free Degree distribution follows power law R: exponent in power law (constant, normally between -0.8 -0.7) n: |V(G)| We obtain theoretical bounds on the size of H*graph for scale-free networks 84 H*-graph v.s. H-graph and H+-graph Size of H-graph Size of H+-graph For a network G with R = -0.7 and n = 1M H*-graph is within [12%, 15%] of G Too small! H-graph is at most 4% of G H+-graph can be as large as 65% of G Too large! 85 ExtMCE Recursive step By one scan of G Extract H*-graph from G Compute local max-cliques By existing in-mem algorithm Extend local max-cliques to global ones Remove H*-graph from G Repeat until G becomes empty Neighborhood-Privacy Protected Shortest Distance Computing in Cloud This is a joint work by Jun Gao, Jeffrey Xu Yu, Ruoming Jin, Jiashuai Zhou, Tengjiao Wang, Dongqing Yang (SIGMOD’11) Graph data management in cloud Cloud Computing Graph data applications • Social network, knowledge network ... Time consuming graph operations • The shortest distance computing takes O(n2) • The breadth-first- search requires O(n+m) • ...... Can we use the cloud server to manage graph Advantage of cloud data, such as to computing answer shortest • High computational power distance? • Easy maintenance • Easy re-provisioning of resources • …… Security issues in graph outsourcing Attacks on outsourced graph Structural Pattern Attack Reconstruction Attack Use sub-graph to re-identify the target part Recover the original graph from the outsourced graph. Security leakage Regulation of sensitive data violated Untrusted answers produced by cloud server We have to strike a balance between the security and the computational cost saving using cloud server Framework of graph outsourcing Client Side Original Graph (2) Graph Transformation Cloud Server Outsourced (1) Graph Link graph Query Results (3) Query Rewriting Result Combination Query Evaluation A reasonable security model on the outsourced graph An efficient method to transform the original graph into the outsourced graph An approach to rewrite the query and combine the results Structural Anonymization Structural anonymization in publishing 1-neighborhood [icde08], k-degree [sigmod08], k-automorphism [vldb09], k-isomorphism [sigmod10], etc Using the least amount of modifications of the original graph a 1 b 1 6 2 d 3 9 h 4 3 6 3 6 4 2 2 k l o 9 8 a b e c g 7 d h f g 1 v j k l v 3 6 u o i p 4 i 8 j u e 7 f 5 3 c 2 p Original graph 4-isomorphism find 4 sub-graphs w x z Attacker’s query No shortest distance preservation No consideration of edge weight y 1- Neighborhood-d-Radius Graph Intuition: Protect the neighborhood information and the close relationship between nodes. b a 6 2 9 h 2 2 4 3 o 3 4 i 9 5 g 7 4 k b 6 6 8 j u e 7 f 5 d 3 3 c 2 1 v 3 6 l 8 Original graph p h 3 f 16 16 5 o 10 11 9 10 10 v w4 x 1 y 3 5 z 6 i 2-radius graph Attacker’s query (1-neighborhood): for any node pair u and v ∈ Vo, (u, v) ∉ E (d-radius): for any node pair u and v ∈ Vo, δG(u, v) >= d. Privacy protection: Cannot find any meaningful results for any query pattern Utilization: Shortest Distance Computation Outsourced graph b 3 16 5 11 5 h f 16 10 v 9 10 o i 5 10 Link Graph 1 a 6 Original graph d 1 b 2 5 3 9 e 4 h 8 j u 3 6 f 7 3 o g 6 i 4 2 2 4 c 2 7 v 1 k l 9 8 3 6 p Given a node pair u and v, the shortest distance can be discovered with G u, v minw(u, x) G x, y w( y, v) …… o u 93 v Graph Transformation Problem Given a graph G = (V,E) and d, the graph transformation produces outsourced graphs Go = {G1, ...Gj}, and a local link graph Gl, which achieves the following objectives: Security Utility Each outsourced graph is a 1-neighborhood-d-radius graph; The union of Go and Gl can answer the shortest distance in the original graph; Local computational cost The space cost of Gl and the cost of the shortest distance computation on the client side are minimized. 94 Diversified Ranking Scalable Diversified Ranking on Large Graphs, Rong-Hua Li and Jeffrey Xu Yu, ICDM’11 Diversifying Top-K Results, Lu Qin, Jeffrey Xu Yu, Lijun Chang: VLDB’12 Scalable Diversified Ranking (1) Relevance of the top-K nodes (denoted by a set S) is achieved by large Personalized PageRank scores. Diversity of the top-K nodes is achieved by a large expansion ratio. Expansion ratio of S: σ(S)=|N(S)|/n Larger expansion ratio implies more diversity The node in S Expended nodes by S σ(S)=0.6 σ(S)=0.9 Scalable Diversified Ranking (2) The k-step expansion ratio of S: σk(S)=|Nk(S)|/n Larger expansion ratio implies more diversity The node in S Expended nodes by S σ2(S)=0.8 σ2(S)=1 Graph Clustering Based on Structural and Attribute Similarities A desired clustering of attributed graph should achieve a good balance between the following two things. Structural cohesiveness: Vertices within one cluster are close to each other in terms of structure, while vertices between clusters are distant from each other. Attribute homogeneity: Vertices within one cluster have similar attribute values, while vertices between clusters have quite different attribute values. 98 A Coauthor Network Example r1. XML r3. XML, Skyline r2. XML r4. XML r5. XML r6. XML r9. Skyline r10. Skyline r11. Skyline r7. XML r8. XML Traditional Coauthor Structure-based Attribute-based Cluster Cluster graph Structural/Attribute Cluster 99 Different Clustering Approaches on the Graph with Multiple Attributes Structure-based Clustering Attribute-based Clustering Vertices with heterogeneous values in a cluster Lose much structure information Structural/Attribute Cluster Vertices with homogeneous values in a cluster Keep most structure information 100 Our Proposed Clustering Solution 101 Attribute Augmented Coauthor Graph Then we use neighborhood random walk distance on the augmented graph to combine structural and attribute similarities 102 Thank You!