Tools for Large Graph Mining Thesis Committee: Christos Faloutsos Chris Olston

advertisement
Tools for Large Graph Mining
- Deepayan Chakrabarti
Thesis Committee:

Christos Faloutsos

Chris Olston

Guy Blelloch

Jon Kleinberg (Cornell)
1
Introduction
Internet Map
[lumeta.com]
Food Web
[Martinez ’91]
Protein Interactions
[genomebiology.com]
► Graphs are ubiquitous
Friendship Network
[Moody ’01]
2
Introduction

What can we do with
graphs?

How quickly will a disease
spread on this graph?
“Needle exchange”
networks of drug users
[Weeks et al. 2002]
3
Introduction
“Key” terrorist

What can we do with
graphs?




How quickly will a disease
spread on this graph?
Who are the “strange
bedfellows”?
Who are the key people?
…
Hijacker network [Krebs ‘01]
► Graph analysis can have great impact
4
Graph Mining: Two Paths
Specific applications
General issues
• Node grouping
• Realistic graph generation
• Viral propagation
• Graph patterns and “laws”
• Frequent pattern mining
• Graph evolution over time?
• Fast message routing
5
Our Work
Specific applications
General issues
• Node grouping
• Realistic graph generation
• Viral propagation
• Graph patterns and “laws”
• Frequent pattern mining
• Graph evolution over time?
• Fast message routing
6
Our Work

Node Grouping

Find “natural” partitions and outliers
automatically.
 Viralapplications
Propagation
Specific
General issues
 Will a
• Node grouping
virus spread and
become
an generation
• Realistic
graph
epidemic?
• Graph patterns and “laws”
• Viral propagation
 Graph Generation
• Frequent pattern mining

• Graph evolution over time?
How can we mimic a given real-world
graph?
• Fast message routing
7
Roadmap
Focus of
this talk
Specific applications
1 • Node grouping
2 • Viral propagation
General issues
3 • Realistic graph generation
• Graph patterns and “laws”
Find “natural” partitions
and outliers automatically
4 Conclusions
8
Customers
Products
Customer Groups
Node Grouping [KDD 04]
Customers
Products
Product Groups
Simultaneously group customers and products,
or, documents and words,
or, users and preferences …
9
Both are
fine
Product Groups
Customer Groups
Customer Groups
Node Grouping [KDD 04]
Product Groups
Row and column groups
• need not be along a diagonal, and
• need not be equal in number
10
Motivation
 Visualization
 Summarization
 Detection of outlier nodes and edges
 Compression, and others…
11
Node Grouping
Desiderata:
1. Simultaneously discover row and column groups
2. Fully Automatic: No “magic numbers”
3. Scalable to large matrices
4. Online: New data should not require full recomputations
12
Closely Related Work

Information Theoretic Co-clustering
[Dhillon+/2003]

Number of row and column groups must be
specified
Desiderata:
 Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
 Scalable to large graphs
 Online
13
Other Related Work


K-means and variants:
[Pelleg+/2000, Hamerly+/2003]
Do not cluster rows and cols
simultaneously
“Frequent itemsets”:
User must specify “support”
[Agrawal+/1994]

Information Retrieval:
[Deerwester+1990, Hoffman/1999]

Graph Partitioning:
[Karypis+/1998]
Choosing the number of
“concepts”
Number of partitions
Measure of imbalance between
clusters
14
versus
Column groups
Column groups
Good
Clustering
Why is this
better?
Row groups
Row groups
What makes a cross-association “good”?
1. Similar nodes are
grouped together
2. As few groups as
necessary
A few,
homogeneous
blocks
Good
Compression
implies
15
Main Idea
Good
Compression
implies
Good
Clustering
Binary Matrix
Row groups
density pi1 = % of dots
Cost of describing
1)
size
*
H(p
Σi
+ Σi n 1, n 0 and groups
i
i
i
Code Cost
Column groups
Description
Cost
16
Examples
One row group,
one column group
high
Total Encoding Cost =
Σi size * H(pi1) + Σi
Code Cost
low
low
Cost of describing
ni1, ni0 and groups
Description
Cost
high
m row group,
n column group
17
versus
Column groups
Why is this
better?
Row groups
Row groups
What makes a cross-association “good”?
Column groups
low
low
Cost of describing
1)
size
*
H(p
+ Σi n 1, n 0 and groups
i
Total Encoding Cost = Σi
i
i
Code Cost
Description
Cost
18
Formal problem statement
Given a binary matrix,
Re-organize the rows and columns into groups, and
Choose the number of row and column groups, to
Minimize the total encoding cost.
19
Formal problem statement
Note: No
Parameters
Given a binary matrix,
Re-organize the rows and columns into groups, and
Choose the number of row and column groups, to
Minimize the total encoding cost.
20
Algorithms
l = 5 col groups
k = 5 row groups
k=1,
l=2
k=2,
l=2
k=2,
l=3
k=3,
l=3
k=3,
l=4
k=4,
l=4
k=4,
l=5
21
Algorithms
l=5
k=5
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Final crossassociation
Choose better
values for k and l
22
Fixed k and l
l=5
k=5
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Final crossassociation
Choose better
values for k and l
23
Fixed k and l
Row groups
Re-assign: for each row x
re-assign it to the row
group which minimizes the
code cost
Column groups
Row groups
1.Row re-assigns
2.Column re-assigns
3. and repeat …
Column groups
24
Choosing k and l
l=5
k=5
Find good groups
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Final crossassociation
Choose better
values for k and l
25
Row groups
Row groups
Choosing k and l
Column groups
Column groups
Split:
1. Find the most “inhomogeneous” group.
2. Remove the rows/columns which make it inhomogeneous.
3. Create a new group for these rows/columns.
26
Algorithms
l=5
k=5
Find good groups
Re-assigns
for fixed k and l
Start with
initial matrix
Lower the
encoding cost
Final crossassociation
Choose better
Splits
values
for k and l
27
Experiments
l = 5 col groups
k = 5 row groups
“Customer-Product” graph
with Zipfian sizes, no noise
28
Experiments
l = 8 col groups
k = 6 row groups
“Quasi block-diagonal” graph with
Zipfian sizes, noise=10%
29
Experiments
l = 3 col groups
k = 2 row groups
“White Noise” graph: we find the
existing spurious patterns
30
Experiments
“CLASSIC”
Documents
• 3,893 documents
• 4,303 words
• 176,347 “dots”
Words
Combination of 3 sources:
• MEDLINE (medical)
• CISI (info. retrieval)
• CRANFIELD (aerodynamics)
31
Documents
Experiments
Words
“CLASSIC” graph of documents &
words: k=15, l=19
32
Experiments
insipidus, alveolar, aortic,
death, …
blood, disease, clinical,
cell, …
MEDLINE
(medical)
“CLASSIC” graph of documents &
words: k=15, l=19
33
Experiments
providing, studying,
records, development, …
abstract, notation, works,
construct, …
MEDLINE
(medical)
CISI
(Information
Retrieval)
“CLASSIC” graph of documents &
words: k=15, l=19
34
Experiments
shape, nasa,
leading, assumed, …
MEDLINE
(medical)
CISI
(Information
Retrieval)
CRANFIELD
(aerodynamics)
“CLASSIC” graph of documents &
words: k=15, l=19
35
Experiments
paint, examination, fall,
raise, leave, based, …
MEDLINE
(medical)
CISI
(Information
Retrieval)
CRANFIELD
(aerodynamics)
“CLASSIC” graph of documents &
words: k=15, l=19
36
Experiments
NSF Grant Proposals
“GRANTS”
• 13,297 documents
• 5,298 words
• 805,063 “dots”
Words in abstract
37
NSF Grant Proposals
Experiments
Words in abstract
“GRANTS” graph of documents & words:
k=41, l=28
38
Experiments
encoding, characters, bind, nucleus
The Cross-Associations refer
to topics:
• Genetics
“GRANTS” graph of documents & words:
k=41, l=28
39
Experiments
coupling, deposition, plasma, beam
The Cross-Associations refer
to topics:
• Genetics
• Physics
“GRANTS” graph of documents & words:
k=41, l=28
40
Experiments
manifolds, operators, harmonic
The Cross-Associations refer
to topics:
• Genetics
• Physics
• Mathematics
•…
“GRANTS” graph of documents & words:
k=41, l=28
41
Experiments
Time (secs)
Splits
Re-assigns
Number of “dots”
Linear on the number of “dots”: Scalable
42
Summary of Node Grouping
Desiderata:
 Simultaneously discover row and column groups
 Fully Automatic: No “magic numbers”
 Scalable to large matrices
 Online: New data does not need full recomputation
43
Extensions
 We can use the same MDL-based
framework for other problems:
1. Self-graphs
2. Detection of outlier edges
44
Extension #1 [PKDD 04]
 Self-graphs, such as



Co-authorship graphs
Social networks
The Internet, and the World-wide Web
Authors
Products
Customers
Bipartite graph
Self-graph
45
Extension #1 [PKDD 04]
 Self-graphs


Rows and columns represent the same nodes
so row re-assigns affect column re-assigns…
Authors
Products
Customers
Bipartite graph
Self-graph
46
Experiments
DBLP dataset
Authors
• 6,090 authors in:
Authors
•
SIGMOD
•
ICDE
•
VLDB
•
PODS
•
ICDT
• 175,494 co-citation
or co-authorship
links
47
Authors
Author groups
Experiments
Authors
Author groups
Stonebraker,
DeWitt, Carey
k=8 author groups found
48
Extension #2 [PKDD 04]
 Outlier edges


Which links should not exist?
(illegal contact/access?)
Which links are missing?
(missing data?)
49
Extension #2 [PKDD 04]
Nodes
Node Groups
Outlier edges
Nodes
Outliers
Node Groups
Deviations from
“normality”
Lower quality
compression
Find edges whose removal
maximally reduces cost
50
Roadmap
Specific applications
1 • Node grouping
2 • Viral propagation
General issues
3 • Realistic graph generation
• Graph patterns and “laws”
Will a virus spread and
become an epidemic?
4 Conclusions
51
The SIS (or “flu”) model



(Virus) birth rate β : probability than an
infected neighbor attacks
(Virus) death rate δ : probability that an
infected node heals
Cured = Susceptible
Prob. δ
Healthy
N2
Prob. β
N1
N
Infected
Undirected network
N3
52
The SIS (or “flu”) model


Competition between virus birth and death
Epidemic or extinction?

depends on the ratio β/δ

but also on the network topology
Epidemic
or
Extinction
Example of the effect of network topology
53
Epidemic threshold

The epidemic threshold τ is the value
such that
 If β/δ < τ  there is no epidemic
 where β = birth rate, and δ = death
rate
54
Previous models
Question: What is the epidemic threshold?
Answer #1: 1/<k>
[Kephart and White ’91,
’93]
Answer #2:
BUT
Homogeneity assumption:
All nodes have the same degree
(but most graphs have power laws)
BUT
Mean-field assumption:
All nodes of the same degree are
equally affected
(but susceptibility should depend
on position in network too)
<k>/<k2>
[Pastor-Satorras and
Vespignani ’01]
55
The full solution is intractable!

The full Markov Chain



has 2N states  intractable
so, a simplification is needed.
Independence assumption:


Probability that two neighbors are infected =
Product of individual probabilities of infection
This is a point estimate of the full Markov Chain.
56
Our model

A non-linear dynamical system (NLDS)

which makes no assumptions about the topology
Probability of
being infected
Adjacency
matrix
N
1-pi,t = [1-pi,t-1 + δpi,t-1] . ∏ (1-β.Aji.pj,tj=1
)
1
Healthy
at time t
Healthy at
time t-1
Infected but
cured
No infection received
from another node
57
Epidemic threshold

[Theorem 1] We have no epidemic if:
(Virus) Death
rate
Epidemic threshold
β/δ < τ = 1/ λ1,A
(Virus) Birth rate
largest eigenvalue
of adj. matrix A
► λ1,A alone decides viral epidemics!
58
Recall the definition of eigenvalues
eigenvalue
A
X = λA X
λ1,A = largest eigenvalue
≈ size of the largest “blob”
59
Number of Infected Nodes
……
50
……
Experiments (100-node Star)
Star
β= 0.016
45
40
β/δ > τ
(above threshold)
35
30
25
20
15
β/δ = τ
(close to the
threshold)
10
5
0
0
50
100
150
200
Time
δ:
0.04
0.08
0.12
0.16
0.20
β/δ < τ
(below threshold)
60
Experiments (Oregon)
Number of Infected Nodes
500
Oregon
β = 0.001
10,900 nodes and
31,180 edges
β/δ > τ
(above threshold)
400
300
200
β/δ = τ
(at the threshold)
100
0
0
250
500
750
Time
δ:
0.05
0.06
1000
β/δ < τ
(below threshold)
0.07
61
Extensions
 This dynamical-systems framework can
exploited further
1. The rate of decay of the infection
2. Information survival thresholds in sensor/P2P
networks
62
Extension #1

Below the threshold:
How quickly does an infection die
out?

[Theorem 2] Exponentially quickly
63
Number of infected nodes
(log-scale)
Experiment (10K Star Graph)
Linear on log-lin scale
 exponential decay
Time-steps (linear-scale)
“Score” s = β/δ * λ1,A = “fraction” of threshold
64
Number of infected nodes
(log-scale)
Experiment (Oregon Graph)
Linear on log-lin scale
 exponential decay
Time-steps (linear-scale)
“Score” s = β/δ * λ1,A = “fraction” of threshold
65
Extension #2

Information survival in
sensor networks
[+ Leskovec, Faloutsos, Guestrin, Madden]
•
Sensors gain new information
66
Extension #2

Information survival in
sensor networks
[+ Leskovec, Faloutsos, Guestrin, Madden]
•
Sensors gain new information
•
but they may die due to harsh environment or battery failure
•
so they occasionally try to transmit data to nearby sensors
•
and failed sensors are occasionally replaced.
67
Extension #2

Information survival in
sensor networks
[+ Leskovec, Faloutsos, Guestrin, Madden]
•
Sensors gain new information
•
but they may die due to harsh environment or battery failure
•
so they occasionally try to transmit data to nearby sensors
•
and failed sensors are occasionally replaced.
•
Under what conditions does the information survive?
68
Extension #2
 [Theorem 1] The information dies
out exponentially quickly if
Resurrection
rate
Failure rate
of sensors
Retransmission
rate
Largest eigenvalue of the
“link quality” matrix
69
Roadmap
Specific applications
1 • Node grouping
2 • Viral propagation
General issues
3 • Realistic graph generation
• Graph patterns and “laws”
How can we generate a
“realistic” graph, that mimics a
given real-world?
4 Conclusions
Skip
70
Experiments (Clickstream bipartite graph)
Some personal
webpage
Websites
Count
Clickstream
R-MAT
+
x
Yahoo, Google
and others
Users
In-degree
71
Experiments (Clickstream bipartite graph)
Email-checking
surfers
Websites
Count
Clickstream
R-MAT
+
x
“All-night”
surfers
Users
Out-degree
72
Experiments (Clickstream bipartite graph)
Count vs Out-degree
Singular value vs
Rank
Count vs In-degree
Hop-plot
Left “Network value” Right “Network value”
►R-MAT can match real-world graphs
73
Roadmap
Specific applications
1 • Node grouping
2 • Viral propagation
General issues
3 • Realistic graph generation
• Graph patterns and “laws”
4 Conclusions
74
Conclusions

Two paths in graph mining:

Specific applications:



Viral Propagation  non-linear dynamical system,
epidemic depends on largest eigenvalue
Node Grouping  MDL-based approach for automatic
grouping
General issues:


Graph Patterns  Marks of “realism” in a graph
Graph Generators  R-MAT, a scalable generator
matching many of the patterns
75
Software
http://www-2.cs.cmu.edu/~deepay/#Sw
CrossAssociations







To find natural node groups.
Used by “anonymous” large accounting firm.
Used by Intel Research, Cambridge, UK.
Used at UC, Riverside (net intrusion detection).
Used at the University of Porto, Portugal
NetMine



To extract graph patterns quickly + build realistic graphs.
Used by Northrop Grumman corp.
F4


A non-linear time series forecasting package.
76
===CROSS-ASSOCIATIONS===








Why simultaneous
grouping?
Differences from coclustering and others?
Other parameter-fitting
criteria?
Cost surface
Exact cost function
Exact complexity, wallclock times
Soft clustering
Different weights for code
and description costs?






Precision-recall for
CLASSIC
Inter-group “affinities”
Collaborative filtering and
recommendation systems?
CA versus bipartite cores
Extras
General comments on CA
communities
77
===Viral Propagation===







Comparison with previous methods
Accuracy of dynamical system
Relationship with full Markov chain
Experiments on information survival threshold
Comparison with Infinite Particle Systems
Intuition behind the largest eigenvalue
Correlated failures
78
===R-MAT===






Graph patterns
Generator desiderata
Description of R-MAT
Experiments on a directed graph
R-MAT communities via Cross-Associations?
R-MAT versus tree-based generators
79
===Graphs in general===


Relational learning
Graph Kernels
80
Simultaneous grouping is useful
Sparse blocks, with little
in common between rows
Grouping rows
first would
collapse these
two into one!
Index
81
Cross-Associations ≠ Co-clustering !
Information-theoretic
co-clustering
Cross-Associations
1. Lossy Compression.
1. Lossless Compression.
2. Approximates the
original matrix, while
trying to minimize KLdivergence.
2. Always provides complete
information about the
matrix, for any number of
row and column groups.
3. The number of row and
column groups must be
given by the user.
3. Chosen automatically
using the MDL principle.
Index
82
Other parameter-fitting methods

The Gap statistic [Tibshirani+ ’01]


Minimize the “gap” of log-likelihood of intra-cluster
distances from the expected log-likelihood.
But



Needs a distance function between graph nodes
Needs a “reference” distribution
Needs multiple MCMC runs to remove “variance
due to sampling”  more time.
Index
83
Other parameter-fitting methods

Stability-based method [Ben-Hur+ ’02, ‘03]




Run clustering multiple times on samples of data,
for several values of “k”
For low k, clustering is stable; for high k, unstable
Choose this transition point.
But


Needs many runs of the clustering algorithm
Arguments possible over definition of transition
point
Index
84
Precision-Recall for CLASSIC
Index
85
Cost surface (total cost)
Surface plot
l
Contour plot
k
With increasing k and l: Total cost decays very rapidly
initially, but then starts increasing slowly
Index
86
Cost surface (code cost only)
Surface plot
l
Contour plot
k
With increasing k and l: Code cost decays very rapidly
Index
87
Encoding Cost Function
Description cost
Total encoding cost =
Code
cost
log*(k) + log*(l) +
N.log(N) + M.log(M) +
Σ log(ai) + Σ log(bj) +
ΣΣ log(aibj+1) +
(cluster number)
(row/col order)
(cluster sizes)
(block densities)
ΣΣ aibj . H(pi,j)
Index
88
Complexity of CA

O(E. (k2+l2)) ignoring the number of re-assign
iterations, which is typically low.
Index
89
Time / Σ(k+l)
Complexity of CA
Number of edges
Index
90
Nodes
Node Groups
Inter-group distances
Nodes
Grp1
Grp2
Grp3
Node Groups
Two groups
are “close”
Merging them does not
increase cost by much
distance(i,j) = relative increase in
cost on merging i and j
Index
91
Inter-group distances
5.5
Grp2
4.5
5.1
Grp3
Node Groups
Grp1
Grp1
Grp2
Grp3
Node Groups
Two groups
are “close”
Merging them does not
increase cost by much
distance(i,j) = relative increase in
cost on merging i and j
Index
92
Experiments
Grp8
Author groups
Grp1
Author groups
Stonebraker,
DeWitt, Carey
Inter-group distances can aid in visualization
Index
93
Collaborative filtering and
recommendation systems




Q: If someone likes a product X, will (s)he like
product Y?
A: Check if others who liked X also liked Y.
Focus on distances between people, typically
cosine similarity
and not on clustering
Index
94
CA and bipartite cores: related but
different
Authorities
Hubs
A 3x2 bipartite core
Kumar et al. [1999] say that bipartite cores
correspond to communities.
Index
95
CA and bipartite cores: related but
different




CA finds two communities there: one for hubs, and one
for authorities.
We gracefully handle cases where a few links are
missing.
CA considers connections between all sets of clusters,
and not just two sets.
Not each node need belong to a non-trivial bipartite core.
CA is (informally) a generalization
Index
96
Comparison with soft clustering


Soft clustering  each node belongs to each
cluster with some probability
Hard clustering  one cluster per node
Index
97
Comparison with soft clustering
1. Far more degrees of freedom
1. Parameter fitting is harder
2. Algorithms can be costlier
2. Hard clustering is better for exploratory data
analysis
3. Some real-world problems require hard
clustering  e.g., fraud detection for
accountants
Index
98
Weights for code cost vs description cost

Total = 1. (code cost) + 1. (description cost)
Physical meaning: Total number of bits

Total = α. (code cost) + β. (description cost)

Physical meaning: Number of encoding bits
under some prior

Index
99
Formula for re-assigns
Row groups
Re-assign: for each row x
Column groups
Index
100
Choosing k and l
l=5
k=5
Split:
1. Find the row group R with the maximum entropy per row
2. Choose the rows in R whose removal reduces the entropy
per row in R
3. Send these rows to the new row group, and set k=k+1
Index
101
Experiments
Epinions dataset
User groups
• 75,888 users
• 508,960 “dots”, one
“dot” per “trust”
relationship
User groups
k=19 groups found
Small dense “core”
Index
102
Comparison with previous methods


Our threshold subsumes the homogeneous
model  Proof
We are more accurate than the Mean-Field
Assumption model.
Index
103
Comparison with previous methods
10K Star Graph
Index
104
Comparison with previous methods
Oregon Graph
Index
105
Accuracy of dynamical system
10K Star Graph
Index
106
Accuracy of dynamical system
Oregon Graph
Index
107
Accuracy of dynamical system
10K Star Graph
Index
108
Accuracy of dynamical system
Oregon Graph
Index
109
Relationship with full Markov Chain

The full Markov Chain is of the form:
Prob(infection at time t) = Xt-1 + Yt-1 – Zt-1
Non-linear
component


Independence assumption leads to a point
estimate for Zt-1  non-linear dynamical
system.
Still non-linear, but now tractable
Index
110
Experiments: Information survival



INTEL sensor map (54 nodes)
MIT sensor map (40 nodes)
and others…
Index
111
Experiments: Information survival
INTEL sensor map
Index
112
Survival threshold on INTEL
Index
113
Survival threshold on INTEL
Index
114
Experiments: Information survival
MIT sensor map
Index
115
Survival threshold on MIT
Index
116
Survival threshold on MIT
Index
117
Infinite Particle Systems


“Contact Process” ≈ SIS model
Differences:




Infinite graphs only  the questions asked are
different
Very specific topologies  lattices, trees
Exact thresholds have not been found for these;
proving existence of thresholds is important
Our results match those on the finite line
graph [Durrett+ ’88]
Index
118
Intuition behind the largest eigenvalue


Approximately  size of the largest “blob”
Consider the special case of a “caveman”
graph
Largest
eigenvalue = 4
Index
119
Intuition behind the largest eigenvalue

Approximately  size of the largest “blob”
Largest eigenvalue =
4.016
Index
120
Graph Patterns

Power Laws
Count vs Outdegree
The “epinions” graph with
75,888 nodes and
508,960 edges
Count vs Indegree
Index
121
Graph Patterns

Power Laws
Count vs Outdegree
The “epinions” graph with
75,888 nodes and
508,960 edges
Count vs Indegree
Index
122
Graph Patterns
Power Laws and deviations
(DGX/Lognormals [Bi+ ’01])
Count

Count vs Indegree
Degree
Index
123




Power Laws
and deviations
Small-world
“Community” effect
…
# reachable pairs
Graph Patterns
Effective
Diameter
hops
Index
124
Graph Generator Desiderata




Power Laws
and deviations
Small-world
“Community” effect
…




Other desiderata
Few parameters
Fast parameter-fitting
Scalable graph
generation
Simple extension to
undirected, bipartite
and weighted graphs
Most current graph generators fail to
match some of these.
Index
125
The R-MAT generator


Intuition: The “80-20 law”
[SIAM DM’04]
From
Subdivide the adjacency
matrix
and choose one quadrant
with probability (a,b,c,d)
To
a
(0.5)
b
(0.1)
c
(0.15)
d
(0.25)
2n

2n
Index
126
The R-MAT generator





[SIAM DM’04]
Subdivide the adjacency
matrix
and choose one quadrant
with probability (a,b,c,d)
Recurse till we reach a
1*1 cell
where we place an edge
and repeat for all edges.
Intuition: The “80-20 law”
a
b
c
d
a
2n

d
c
2n
Index
127
The R-MAT generator


[SIAM DM’04]
Only 3 parameters
a, b and c
(d = 1-a-b-c).
We have a fast
parameter fitting
algorithm.
Intuition: The “80-20 law”
a
b
c
d
a
2n

d
c
2n
Index
128
Experiments (Epinions directed graph)
Effective
Diameter
Count vs Indegree
Eigenvalue vs Rank
Count vs Outdegree
“Network value”
Hop-plot
Count vs Stress
►R-MAT matches directed graphs
Index
129
R-MAT communities and CrossAssociations


R-MAT builds communities in graphs, and
Cross-Associations finds them.
Relationship?


R-MAT builds a hierarchy of communities, while
CA finds a flat set of communities
Linkage in the sizes of communities found by CA:


When the R-MAT parameters are very skewed, the
community sizes for CA are skewed
and vice versa
Index
130
R-MAT and tree-based generators

Recursive splitting in R-MAT ≈ following a
tree from root to leaf.

Relationship with other tree-based generators
[Kleinberg ’01, Watts+ ’02]?


The R-MAT tree has edges as leaves, the others
have nodes
Tree-distance between nodes is used to connect
nodes in other generators, but what does treedistance between edges mean?
Index
131
Comparison with relational learning
Relational Learning
(typical)
Graph Mining
(typical)
1.
Aims to find small
structure/patterns at the
local level
1.
Emphasis on global aspects
of large graphs
2.
Labeled nodes and edges
2.
Unlabeled graphs
3.
Semantics of labels are
important
3.
More focused on topological
structure and properties
4.
Algorithms are typically
costlier
4.
Scalability is more important
Index
132
===OTHER WORK===

OTHER WORK
133
Other Work

Time Series Prediction
[CIKM 2002]



We use the fractal dimension of the data
This is related to chaos theory
and Lyapunov exponents…
134
Other Work

Logistic Parabola
Time Series Prediction
[CIKM 2002]
135
Other Work

Lorenz attractor
Time Series Prediction
[CIKM 2002]
136
Other Work

Laser fluctuations
Time Series Prediction
[CIKM 2002]
137
Other Work

Adaptive histograms with error guarantees
[+ Ashraf Aboulnaga, Yufei Tao, Christos Faloutsos]
Count
Count
Insertions,
deletions
• Maintain count
probabilities for buckets
• to give statistically
correct query result-size
estimation
• and query feedback
Prob.
•+…
Salary
138
Other Work

User-personalization


Relevance feedback in multimedia image
search


Patent number 6,611,834 (IBM)
Filed for patent (IBM)
Building 3D models using robot camera and
rangefinder data [ICML 2001]
139
===EXTRAS===
140
Conclusions

Two paths in graph mining:

Specific applications:


Viral Propagation  Resilience testing, information
dissemination, rumor spreading
Node Grouping  automatically grouping nodes, AND
finding the correct number of groups
References:
1. Fully automatic Cross-Associations,
by Chakrabarti, Papadimitriou, Modha and Faloutsos, in KDD 2004
2. AutoPart: Parameter-free graph partitioning and Outlier detection,
by Chakrabarti, in PKDD 2004
3. Epidemic spreading in real networks: An eigenvalue viewpoint,
by Wang, Chakrabarti, Wang and Faloutsos, in SRDS 2003
141
Conclusions

Two paths in graph mining:


Specific applications
General issues:


Graph Patterns  Marks of “realism” in a graph
Graph Generators  R-MAT, a fast, scalable generator
matching many of the patterns
References:
1. R-MAT: A recursive model for graph mining,
by Chakrabarti, Zhan and Faloutsos in SIAM Data Mining 2004.
2. NetMine: New mining tools for large graphs,
by Chakrabarti, Zhan, Blandford, Faloutsos and Blelloch, in the SIAM
2004 Workshop on Link analysis, counter-terrorism and privacy
142
Other References



F4: Large Scale Automated Forecasting using Fractals,
by D. Chakrabarti and C. Faloutsos, in CIKM 2002.
Using EM to Learn 3D Models of Indoor Environments
with Mobile Robots,
by Y. Liu, R. Emery, D. Chakrabarti, W. Burgard and S.
Thrun, in ICML 2001
Graph Mining: Laws, Generators and Algorithms,
by D. Chakrabarti and C. Faloutsos,
under submission to ACM Computing Surveys
143
References --- graphs
1.
2.
3.
4.
5.
R-MAT: A recursive model for graph mining, by D.
Chakrabarti, Y. Zhan, C. Faloutsos in SIAM Data Mining
2004.
Epidemic spreading in real networks: An eigenvalue
viewpoint, by Y. Wang, D. Chakrabarti, C. Wang and C.
Faloutsos, in SRDS 2003
Fully automatic Cross-Associations, by D. Chakrabarti, S.
Papadimitriou, D. Modha and C. Faloutsos, in KDD 2004
AutoPart: Parameter-free graph partitioning and Outlier
detection, by D. Chakrabarti, in PKDD 2004
NetMine: New mining tools for large graphs, by D.
Chakrabarti, Y. Zhan, D. Blandford, C. Faloutsos and G.
Blelloch, in the SIAM 2004 Workshop on Link analysis,
counter-terrorism and privacy
144
Roadmap
Specific applications
1 • Node grouping
2 • Viral propagation
General issues
3 • Realistic graph generation
• Graph patterns and “laws”
4 Other Work
5 Conclusions
145
Experiments (Clickstream bipartite graph)
Some personal
webpage
Websites
Count
Clickstream
+
Yahoo, Google
and others
Users
In-degree
146
Experiments (Clickstream bipartite graph)
Email-checking
surfers
Websites
Count
Clickstream
+
“All-night”
surfers
Users
Out-degree
147
Experiments (Clickstream bipartite graph)
Users
# Reachable pairs
Websites
Clickstream
R-MAT
Hops
148
Graph Generation

Important for:




Simulations of new algorithms
Compression using a good graph generation
model
Insight into the graph formation process
Our R-MAT (Recursive MATrix) generator can
match many common graph patterns.
149
Recall the definition of eigenvalues
A
X = λA X
λA = eigenvalue of A
λ1,A = largest eigenvalue
β/δ < τ = 1/ λ1,A
150
Tools for Large Graph
Mining
Deepayan Chakrabarti
Carnegie Mellon University
151
Download