Discovery-Driven Graph Summarization ICDE 2010 Ning Zhang , Yuanyuan Tian , Jignesh M. Patel University of Wisconsin-Madison, IBM Almaden Research Center, USA Presented by Sung Eun, Park 9/26/2010 Intelligent Database Systems Lab. Center for E-Business Technology Seoul National University Seoul, Korea School of Computer Science & Engineering Seoul National University Contents Introduction & Preliminaries k-SNAP Summarization – Efficient Aggregation for Graph Summarization(Sigmod’08) Categorization of Numerical Attributes CANAL algorithm – Try to Merge Groups and Select Cutoffs Automatic Discovery of Interesting Summaries Measuring Interestingness of Summaries Automatic Discovery of Interesting Summaries Experimental Results Conclusion Copyright 2010 by CEBT 2 Introduction Large Graph datasets are ubiquitous! Graph summarization can assist in uncovering useful insights about the patterns hidden in the underlying data Proposed Graph Summarization Approach (previous work) Copyright 2010 by CEBT 3 Preliminaries Proposed Graph Summarization Approach (previous work) A summary graph by grouping nodes based on user-selected node attributes and relationships Author Research Field The # of Publication Coauthor a1 DB 5->LP a2, a10 a2 DB 10->MP a1, a3 a3 AI 30->HP a2, a6 … … 10->MP a1 5->LP OS DB a2 DB DB a2 10->MP DB AI a3 30->HP AI AI a4 DB a3 a5 30->HP a5 a6 AI 23->HP a6 18->MP 28->HP 8->LP 28->HP Copyright 2010 by CEBT 4 Preliminaries Proposed Graph Summarization Approach (previous work) SNAP : Nodes of each group are homogeneous with respect to user-selected attributes and relationships. – May result in a large number of small groups, in the worst case each node may end up an individual group. k-SNAP : users can control the number of groups in the summary graph as k. → <k-SNAP: Top-down approach> Copyright 2010 by CEBT < k-SNAP : different resoultion> 5 Preliminaries k-SNAP Strong relationship Weak relationship Low/Moderately/Highly Cited groups Link represents the participation rate participation rate= (The num berof nodes participated in the relationship) (The num berof nodesin both groups) Copyright 2010 by CEBT 6 Preliminaries Δ-measure How different it is to a hypothetical “ideal summary” Given a graph G, the Δ-measure of a grouping of nodes Φ = {G1, G2, ..., Gk} is defined as follows: For Every pair of groups , sum… Differences to the ideal summary Small Δ value indicates good summary Copyright 2010 by CEBT 7 Introduction k-SNAP Produces summaries which themselves are also graphs Two limitations 1. k-SNAP uses categorical node attributes, but even for domain experts, providing clear cutoffs is tricky and not always possible. →introduces CANAL alg’m that automatically categorize numerical attributes values based on both the attributes values and the link structures of nodes in the graph 2. k-SNAP allows summaries with different resolutions, but users may have to go through a large number of summaries until some interesting summaries are found. → propose a measure to assess the interestingness of summaries Copyright 2010 by CEBT 8 Automatic Discovery of Interesting Summaries CANAL algorithm Input: Graph G, Numerical node attribute value a, Desired number category : C Intuition : Find cutoffs that increases Δ value the most. Calculate Δ increases G1 G2 G3 nodes that contains same attribute value G4 Gk … … Numerical node attribute Value Every iteration.. Until there is only one group left – Pick the adjacent pair that has the most similar relationship pattern to the other groups – Merge the pair and calculate Δ increases after – Pick C-1 cutoffs that has the biggest Δ increases Copyright 2010 by CEBT 9 Automatic Discovery of Interestingness Summaries Copyright 2010 by CEBT 10 Automatic Discovery of Interestingness Summaries Copyright 2010 by CEBT 11 Experimental Setting Datasets DBLP DB Dataset : Bibliography data – Coauthorship graph(undirected) NODE : authors, 7,445 nodes EDGE : coauthorship, 19,971 edges Attirbute : publication number CiteSeer Dataset – Citation graphs(directed) NODE : article and a directed EDGE : a citation Attirbute : Number of citations Copyright 2010 by CEBT Experimental Result Effectiveness of the CANAL algorithm cutoffs generated by CANAL vs. cutoffs manually selected in the previous work The cutoffs produced by CANAL results in Δ/k values that are very close to the manually selected ones. Δ measure (good summary has small value) Copyright 2010 by CEBT Experimental Result Efficiency of the CANAL Algorithm Execution time of the CANAL algorithm nicely scales with increasing data sizes – C=3, note that different C values do not significantly affect the run time of the CANAL algorithm, because all the cutoff candidates are considered. Copyright 2010 by CEBT Experimental Result Effectiveness of the Interestingness Measure Interestingness of summaries Copyright 2010 by CEBT Experimental Result Two types of Interesting summaries Overall Summary – Conciseness is relatively small : k=4 – Coverage : the nodes participating to the strong relationship ↑& - – Diversity : ↑ as much as LP of size 2680 & - Copyright 2010 by CEBT Experimental Result Two types of Interesting summaries Informative Summary Only strongly collaborate to HP – Conciseness : a little bigger than “overall summary” – Coverage : ↑ as much as new LP group of size 1261,- – Diversity : ↑ as much as new LP group of size 1261, - Copyright 2010 by CEBT Only show weak relationship Conclusion Overcome two key limitations in the previous work introduces CANAL algorithm that automatically categorize numerical attributes values based on both the attributes values and the link structures of nodes in the graph propose a measure to assess the interestingness of summaries Copyright 2010 by CEBT Q&A Thank you 19