Discovery-Driven Graph Summarization

advertisement
Discovery-Driven Graph Summarization
ICDE 2010
Ning Zhang , Yuanyuan Tian , Jignesh M. Patel
University of Wisconsin-Madison, IBM Almaden Research Center, USA
Presented by Sung Eun, Park
9/26/2010
Intelligent Database Systems Lab.
Center for E-Business Technology
Seoul National University
Seoul, Korea
School of Computer Science & Engineering
Seoul National University
Contents
 Introduction & Preliminaries

k-SNAP Summarization
–
Efficient Aggregation for Graph Summarization(Sigmod’08)
 Categorization of Numerical Attributes

CANAL algorithm
–
Try to Merge Groups and Select Cutoffs
 Automatic Discovery of Interesting Summaries

Measuring Interestingness of Summaries

Automatic Discovery of Interesting Summaries
 Experimental Results
 Conclusion
Copyright 2010 by CEBT
2
Introduction
 Large Graph datasets are ubiquitous!

Graph summarization can assist in uncovering useful insights
about the patterns hidden in the underlying data
 Proposed Graph Summarization Approach (previous work)
Copyright 2010 by CEBT
3
Preliminaries
 Proposed Graph Summarization Approach (previous work)

A summary graph by grouping nodes based on user-selected
node attributes and relationships
Author
Research
Field
The # of
Publication
Coauthor
a1
DB
5->LP
a2, a10
a2
DB
10->MP
a1, a3
a3
AI
30->HP
a2, a6
…
…
10->MP
a1
5->LP
OS
DB
a2
DB
DB
a2
10->MP
DB
AI
a3
30->HP
AI
AI
a4
DB
a3
a5
30->HP
a5
a6
AI
23->HP
a6
18->MP
28->HP
8->LP
28->HP
Copyright 2010 by CEBT
4
Preliminaries
 Proposed Graph Summarization Approach (previous work)

SNAP : Nodes of each group are homogeneous with respect to
user-selected attributes and relationships.
–

May result in a large number of small groups, in the worst case each
node may end up an individual group.
k-SNAP : users can control the number of groups in the summary
graph as k.
→
<k-SNAP: Top-down approach>
Copyright 2010 by CEBT
< k-SNAP : different resoultion>
5
Preliminaries
 k-SNAP
Strong
relationship
Weak
relationship

Low/Moderately/Highly
Cited groups

Link represents the
participation rate
participation rate=
(The num berof nodes participated in the relationship)
(The num berof nodesin both groups)
Copyright 2010 by CEBT
6
Preliminaries
 Δ-measure

How different it is to a hypothetical “ideal summary”
 Given a graph G, the Δ-measure of a grouping of nodes
Φ = {G1, G2, ..., Gk} is defined as follows:
For Every pair of
groups , sum…
Differences to the
ideal summary
 Small Δ value indicates good summary
Copyright 2010 by CEBT
7
Introduction
 k-SNAP

Produces summaries which themselves are also graphs

Two limitations
1.
k-SNAP uses categorical node attributes, but even for domain experts,
providing clear cutoffs is tricky and not always possible.
→introduces CANAL alg’m that automatically categorize
numerical attributes values based on both the attributes values
and the link structures of nodes in the graph
2.
k-SNAP allows summaries with different resolutions, but users may
have to go through a large number of summaries until some
interesting summaries are found.
→ propose a measure to assess the interestingness of
summaries
Copyright 2010 by CEBT
8
Automatic Discovery of Interesting Summaries

CANAL algorithm

Input: Graph G, Numerical node attribute value a, Desired number
category : C

Intuition : Find cutoffs that increases Δ value the most.
Calculate
Δ increases
G1
G2
G3
nodes that contains
same attribute value
G4
Gk
…
…

Numerical
node attribute
Value
Every iteration.. Until there is only one group left
–
Pick the adjacent pair that has the most similar relationship pattern to the
other groups
–
Merge the pair and calculate Δ increases after
–
Pick C-1 cutoffs that has the biggest Δ increases
Copyright 2010 by CEBT
9
Automatic Discovery of Interestingness Summaries

Copyright 2010 by CEBT
10
Automatic Discovery of Interestingness Summaries

Copyright 2010 by CEBT
11
Experimental Setting
 Datasets

DBLP DB Dataset : Bibliography data
–

Coauthorship graph(undirected)

NODE : authors, 7,445 nodes

EDGE : coauthorship, 19,971 edges

Attirbute : publication number
CiteSeer Dataset
–
Citation graphs(directed)

NODE : article and a directed

EDGE : a citation

Attirbute : Number of citations
Copyright 2010 by CEBT
Experimental Result
 Effectiveness of the CANAL algorithm

cutoffs generated by CANAL vs. cutoffs manually selected in the
previous work

The cutoffs produced by CANAL results in Δ/k values that are
very close to the manually selected ones.
Δ measure
(good summary
has small value)
Copyright 2010 by CEBT
Experimental Result
 Efficiency of the CANAL Algorithm

Execution time of the CANAL algorithm nicely scales with
increasing data sizes
–
C=3, note that different C values do not significantly affect the run
time of the CANAL algorithm, because all the cutoff candidates are
considered.
Copyright 2010 by CEBT
Experimental Result
 Effectiveness of the Interestingness Measure

Interestingness of summaries
Copyright 2010 by CEBT
Experimental Result
 Two types of Interesting summaries
Overall Summary
–
Conciseness is relatively small : k=4
–
Coverage : the nodes participating to the strong relationship ↑& -
–
Diversity : ↑ as much as LP of size 2680 & -
Copyright 2010 by CEBT
Experimental Result
 Two types of Interesting summaries
Informative Summary
Only strongly
collaborate to
HP
–
Conciseness : a little bigger than “overall summary”
–
Coverage : ↑ as much as new LP group of size 1261,-
–
Diversity : ↑ as much as new LP group of size 1261, -
Copyright 2010 by CEBT
Only show
weak
relationship
Conclusion
 Overcome two key limitations in the previous work

introduces CANAL algorithm that automatically categorize
numerical attributes values based on both the attributes values
and the link structures of nodes in the graph

propose a measure to assess the interestingness of summaries
Copyright 2010 by CEBT
Q&A
Thank you
19
Download