Slides - Microsoft Research

advertisement
Evolutionary Clustering and
Analysis of Bibliographic Networks
Manish Gupta (UIUC)
Charu C. Aggarwal (IBM)
Jiawei Han (UIUC)
Yizhou Sun (UIUC)
ASONAM 2011
Introduction
• Information networks are everywhere: social
networks, web, academic networks, biological
networks.
• Heterogeneous information networks
– Contain multi-typed nodes.
– Richer representation compared to homogeneous
networks.
• We study clustering and evolution diagnosis in
massive heterogeneous information networks.
Contributions
• We present an evolutionary clustering
algorithm for heterogeneous information
networks (ENetClus)
• We define metrics to characterize clustering
behavior
• We perform study of evolution in a
bibliographic heterogeneous network: DBLP
ENetClus features
•
•
•
•
•
•
Multi-typed
Evolutionary
Temporal smoothness
Agglomerative
Multiple granularities
Based on NetClus
Study over DBLP
Evolution metrics
•
•
•
•
•
Consistency
Quality
Cluster Sizes
Evolution rate
Cluster appearance/
disappearance
• Stability of objects
• Sociability of objects
• Social influence
Problem Formulation
•
•
•
•
Net-Cluster
Net-Cluster tree
Net-Cluster tree sequence
Problem: Given a graph
sequence GS,
Level 1
generateK=3a net-cluster tree sequence CTS such
that the trees are consistent and represent
Level 2
high-quality clusters.
...
CT1
CT2
CTN
CTS
Level 3
Level 1
nc
K=3
nc
nc
nc
nc
nc
nc
nc
nc
nc
Level 2
nc
nc
nc
Level 3
...
CT1
CT2
CTN
Approaches
• Problem: Perform evolutionary clustering over a
sequence of heterogeneous network snapshots
• Approaches
– Use homogeneous clustering techniques
• Does not exploit rich typed information in network
• Objects related to same entity may get clustered into
different clusters.
– Use some heterogeneous network clustering
algorithm
• May provide high snapshot clustering quality
• But may not provide good consistency between clusterings
across snapshots
NetClus
• NetClus is an algorithm to perform clustering
over heterogeneous network.
• It performs iterative ranking of clustering of
objects.
• A probabilistic generative model is used to model
the probability of generation of different objects
from each cluster.
• A maximum likelihood technique is used to
evaluate the posterior probability of presence of
an object in a cluster.
NetClus
• Priors: Initialize prior probabilities {𝑃(𝑜|𝑐𝑘)}𝐾
𝑘=1
• Initialize: Generate initial net-clusters. {𝑐𝑘0 }𝐾
𝑘=1
• Rank: Build probabilistic generative model for each netcluster, i.e., {𝑃(𝑜|𝑐𝑘𝑡 )}𝐾
𝑘=1
• Cluster-target: Compute p(𝑐𝑘𝑡 |𝑜) for target objects and
adjust their cluster assignments.
• Iterate: Repeat steps 3 and 4 until the clusters don’t
change significantly.
• Cluster-attribute: Calculate p(𝑐𝑘∗ |𝑜) for each attribute
object in each net-cluster.
• Return p(𝑐𝑘∗ |𝑜)
ENetClus
• For the first time instant, initialization of priors and net
clusters is similar to NetClus
• For other time instants
– The prior probability of an object o belonging to cluster ck is
defined as its representativeness in the corresponding cluster
within the net-cluster tree for the previous time instant.
– A target object o is assigned to cluster ck with probability pk
where pk is the normalized sum of the prior probabilities of
neighboring attribute type objects.
• Ranking is similar to NetClus except that prior probabilities
are also used along with the authority based ranking. Prior
weight controls the effect of priors and hence the temporal
smoothness.
How is ENetClus better than NetClus?
NetClus: Inconsistent clusters
Snapshot1
Snapshot2
Snapshot3
ENetClus: Consistent clusters
Snapshot1
Snapshot2
Snapshot3
Metrics
• Membership probability of object o to cluster ci is
denoted by
• Consistency:
• Chained path consistency: product of consistency over
each interval in the sequence
Metrics
• Snapshot Quality
– Compactness
– Entropy
Metrics
O’: Objects at time y but not at y-1
O’’: Objects at time y
O’’’: Objects at time y but not at y+1
Metrics
• Stability of objects
– Degree to which an object is stable with respect to
its cluster or network
• Sociability of objects
– Degree to which an object interacts with different
clusters
• Effect of social influence: normality
– Normality is the degree to which an object follows
the cluster trend
Experiments
• Dataset
– DBLP
• 1993 to 2008, 654K papers, 484K authors, 107K title
terms and 3900 conferences
• Number of clusters = 4
• Levels of net Cluster tree = 4
• Prior weight varied from 0 to 1
– Four_area
• DM, DB, IR, ML papers
• 1993 to 2008, 29K papers, 28K authors, 20 conferences
Related work
• Clustering graphs: Mincut, Min-max cut, Spectral,
density-based, RankClus [Sun EDBT 09], NetClus
[Sun KDD 09]
• Evolutionary clustering: k-means [Chak KDD06],
spectral [Chi KDD07], text streams [Mei KDD05],
social network structure [Kuma KDD06]
• Evolutionary graph studies: GraphScope [Sun
KDD07], density-based [Kim VLDB09], analysis
[Back KDD06, Lesk KDD05, Lesk KDD08],
communities using FacetNet [Lin WWW08],
individual objects [Asur KDD07]
Conclusion
• A clustering algorithm for evolution diagnosis
of heterogeneous information networks.
• Metrics for novel insights into the evolution
both at the object level and the clustering
level
• Analysis and evolutionary study of DBLP
Acknowledgements
Research was sponsored in part by the U.S. National
Science Foundation under grant IIS-09-05215, and by the
Army Research Laboratory under Cooperative Agreement
Number W911NF-09-2-0053 (NS-CTA). The views and
conclusions contained in this document are those of the
authors and should not be interpreted as representing
the official policies, either expressed or implied, of the
Army Research Laboratory or the U.S. Government. The
U.S. Government is authorized to reproduce and
distribute
reprints
for
Government
purposes
notwithstanding any copyright notation here on.
References (1)
References (2)
References (3)
Download