FAST COUNTING OF TRIANGLES IN
LARGE NETWORKS:
ALGORITHMS AND LAWS
Charalampos (Babis) Tsourakakis
School of Computer Science
Carnegie Mellon University http://www.cs.cmu.edu/~ctsourak
RPI Theory Seminar, 24 November 2008
2
Given an undirected, simple graph G(V,E) a triangle is a set of 3 vertices such that any two of them by an edge of the graph.
Related Problems Our focus a) Decide if a graph is triangle-free. b) Count the total number of triangles δ( G).
c) Count the number of triangles δ( v) that each vertex v participates at.
( v )
| {( u , w )
E : ( v , u )
E , ( v , w )
E } | d) List the triangles that each vertex v participates at.
RPI, November 2008
Why is triangle counting important*?
3
Social Network Analysis:
“Friends of friends are friends” [WF94]
Web Spam Detection [BPCG08]
Hidden Thematic Structure of the
Web [EM02]
Motif Detection e.g. biological networks [YPSB05]
*few indicative reasons, from the graph mining perspective
RPI, November 2008
4
Furthermore, two often used metrics are:
Clustering Coefficient
CC ( G )
1
| V |'
v
V ' cc ( v )
1
| V |'
v
V '
( v )
( v ) where:
V '
{ v : d ( v )
2 } and
( v )
d ( v )
2
Transitivity Ratio
TR
3
( G
( G )
) where:
( G )
1
3
v
V
( v ) and
( G )
v
V
( v ) v
RPI, November 2008
Triple at node v
Triangle
5
• Related Work
• Proposed Method
• Experiments
• Triangle-related Laws
• Triangles in Kronecker Graphs
• Future Work & Open Problems
RPI, November 2008
Counting methods
6
Dense graphs
Time complexity
Space complexity
Fast
O(n 2.37
)
O ( n 2 )
Low space
O(n 3 )
O ( m)
Sparse graphs
Time complexity
Space complexity
Fast
O(m 0.7
n 1.2
+n 2+o(1) )
Θ( n 2 )
(eventually)
RPI, November 2008
Low space d max
Θ( m)
7
• Related Work
• Proposed Method
• Experiments
• Triangle-related Laws
• Triangles in Kronecker Graphs
• Future Work & Open Problems
RPI, November 2008
8
EigenTriangle theorem
EigenTriangleLocal theorem
EigenTriangle algorithm
EigenTriangleLocal algorithm
Efficiency & Complexity
Power law degree distributions
Gershgorin discs
Real world network spectra
RPI, November 2008
9
Theorem
The number of triangles δ( G) in an undirected, simple graph G(V,E) is given by: where
1
( G )
2
...
i
| V |
1
6
i
3
| V | are the eigenvalues of the adjacency matrix of graph G.
RPI, November 2008
10
Call A the adjacency matrix of the graph. Consider the i-th diagonal element of A 3 , α ii
. This element is equal to the number of triangles vertex i participates at. So the trace is 6 δ( G) because each triangle is counted 6 times (3 participating vertices and is also counted as i-j-k, and i-k-j). Furthermore, if Ax= λ x , then λ 3 is an eigenvalue of A 3 (*) and vice versa if λ is an eigenvalue of A eigenvalue of A.
3 , then is an
* A 3 x=AAAx=AA λ x= λΑΑ x= λΑλ x= λ 2 Α x =λ 3 x
RPI, November 2008
11
Theorem
The number of triangles δ( i) vertex i partipates at is equal to:
( i )
j
| V |
1
3 j
2 u ij
2 where is the j-th entry of the i-th eigenvector ij u i
Proof [Sketch]
Follows from the previous theorem and the fact that
A is symmetric, therefore diagonalizable and also
A
3
U
3
U
T
RPI, November 2008
12
RPI, November 2008
13
Why are these two algorithms efficient?
RPI, November 2008
14
Skewed degree distribution ubiquitous in nature!
Have been termed as “the signature of human activity”[FKP02] but appear as well to all other kind of networks, e.g. biological.
See [N05][M04] for generative models of power law distributions.
Typically referred to as power-laws (even if sometimes we abuse the strict definition of a power
RPI, November 2008
15
Newman [N05] demonstrated how often power laws appear using may different types of networks, ranging from word frequencies to population of cities.
Many cities have a small population
RPI, November 2008
Few cities have a huge population
16
Theorem
Let B an arbitrary matrix. Then the eigenvalues λ of
B are located in the union of the n discs
|
b kk
|
j
k
| b kj
|
For a proof see Demmel [D97], p.82.
RPI, November 2008
17
Bounds on the airports network (Observe how loose)
RPI, November 2008
18
Political blogs
RPI, November 2008
Airports
19
Zooming in the top eigenvalues and plotting the rank vs. the eigenvalue in log-log scale reveals that the top eigenvalues follow a power law [FFF99]
Some years later, Mihail & Papadimitriou [MP02] and Chung, Lu and Vu [CLV03] proved this fact.
RPI, November 2008
20
Simple & clear:
Use a low-rank approximation of A 3 to estimate the diagonal elements and the trace.
Suggests also a way of thinking:
Take advantage of special properties (e.g. power laws) to reduce the complexity of certain computational tasks in real-world networks.
RPI, November 2008
21
Almost symmetry of the spectrum around 0 for the bulk of the eigenvalues except the top ones is the first main reason.
Cubes amplify strongly this phenomenon!
RPI, November 2008
22
Main computational bottleneck that determines the complexity is the Lanczos method.
Lanczos runs in linear time with respect to the nonzero entries of the matrix, i.e. the edges, assuming that we compute a few constant number of eigenvalues.
Convergence of Lanczos is fast due to the eigenvalue power law (see Kaniel-Paige theory
[GL89])
RPI, November 2008
23
• Related Work
• Proposed Method
• Experiments
• Triangle-related Laws
• Triangles in Kronecker Graphs
• Future Work & Open Problems
RPI, November 2008
24
RPI, November 2008
25
Node Iterator algorithm considers each node at the time, looks at its neighbors and checks how many among them are connected among them.
d
We report the results as the speedup that
EigenTriangle algorithm gives compared to the running time of the Node Iterator .
RPI, November 2008
26
RPI, November 2008
27
RPI, November 2008
28
Some interesting facts for the two scatterplots:
Mean required approximations rank for at least
95% is 6.2
Speedups are between 33.7x and 1159x.
The mean speedup is 250.
Notice the increasing speedup as the size of the network grows.
RPI, November 2008
29
RPI, November 2008
Zooming in this point
Evaluating the Local Counting Method
30
Pearson’s correlation coefficient ρ
Relative Reconstruction Error
RRE
1
| V | i
| V |
1
|
(
)
(
)
' ( i ) |
Political Blogs:
RRE 7*10 -4
ρ 99.97%
RPI, November 2008
31
#Eigenvalues vs. ρ for three networks
Observe how a low rank results in almost optimal results.
This holds for surprisingly many real world networks
RPI, November 2008
32
• Related Work
• Proposed Method
• Experiments
• Triangle-related Laws
• Triangles in Kronecker Graphs
• Future Work & Open Problems
RPI, November 2008
33
Plots the number of triangles δ (x-axis) vs. the count of vertices with δ participating triangles.
(a) (b)
(c) a) EPINIONS, who trusts-whos b) ASN, social network c) HEP_TH, collaboration network
RPI, November 2008
34
Plots the degree d i
(x-axis) vs. the mean number of triangles that nodes with degree d i participate at.
Epinions
RPI, November 2008
ASN
35
• Related Work
• Proposed Method
• Experiments
• New Triangle-related Laws
• Triangles in Kronecker Graphs
• Future Work & Open Problems
RPI, November 2008
36
This model was introduced in [LCKF05]. It is based on the simple operation of the Kronecker product to generate graphs that mimic real world networks.
Deterministic Kronecker Graphs: Kronecker Product of the adjacency matrix at the current step k with the initiator adjacency matrix (typically small).
Stochastic Kronecker Graphs: Kronecker Product of the matrix at the current step k with the initiator matrix. Initiator matrix contains probabilities.
For more details see [LF07].
RPI, November 2008
37
Some notation first:
A: nxn initiatior adjacency matrix of the undirected, simple graph G
A
B = A [k] k-th Kronecker product
λ=(λ
1
,...,λ n
) the eigenvalues of A
Δ( G
A
), Δ( G
Β
) #triangles of G
A
, G
Β
Theorem [KroneckerTRC]
Δ(G
B
)
6 k Δ(G
A
) k
1
, k
0
RPI, November 2008
38
We use induction on the number of recursion steps k.
For k=0 the theorem trivially holds.
Assume now that KroneckerTRC holds now for some r
1
.Call C=A [r] , D=A [r+1] and the eigenvalues of C,
[ μ i
] i=1..s
.By the assumption
Δ(G c
)
6 r Δ(G
A
) r
1
The eigenvalues of D are given by the Kronecker product . By the EigenTriangle theorem, the number of triangles in D is given by:
RPI, November 2008
39
( G
D
)
i n s
1 j
1
i
3
3 j
6
6
( G
A
) i s
1
i
3
6
s i
1
i
6 n j
1
3 j
i s
1
i
3
6
( G
A
)
6
6
( G
A
)
( G
C
)
6 r
1
( G
A
) r
2
Q.E.D
RPI, November 2008
40
• Related Work
• Proposed Method
• Experiments
• New Triangle-related Laws
• Triangles in Kronecker Graphs
• Future Work & Open Problems
RPI, November 2008
41
Theoretical Challenge I:
Spectra of real world networks
Can we prove things about the distribution of the eigenvalues, adopting a random graph model such as the expected degree model G(w) [CLV03]?
An analog to Wigner’s semicircle law for random
Erdos-Renyi graphs (see Furedi-Komlos [FK81])
RPI, November 2008
Spectrum of
G
1
40 ,
2 over 100000
Iterations
[S07]
42
Theoretical Challenge I:
Spectra of real world networks
Empirically, the rest the spectrum:
Something about
RPI, November 2008
43
Theoretical Challenge II:
Eigenvectors of real world networks
Things even “worse” than the case of spectra. Very few knowledge about the eigenvectors.
Related work:
See [P08] for random graphs.
RPI, November 2008
44
Theoretical Challenge III:
Degree Triangle Law
Prove using the expected degree random graph model G(w) the pattern we saw (see [S04])
Conjecture:
The relationship we observed probably appears for some cases of the slope of the degree distribution. Further experiments, recently showed that for some graphs this pattern does not hold.
RPI, November 2008
45
Experimental Challenge I:
Compare with Streaming Methods
Streaming or Semi-Streaming methods, perform one or O(1) passes over the graph.
[YKS02]
[BFLSS06]
[BPCG08]
Common Underlying Idea: Sophisticated sampling methods
Implement and compare.
RPI, November 2008
46
Practical Challenge I:
Triangles in Large Scale Graph Mining
Many Giga-byte and Peta-byte sized graphs.
How to handle these graphs?
HADOOP
EigenTriangle algorithms are based just on simple matrix vector multiplications.
Easy to parallelize in all sorts of architectures
(distributed memory , shared memory).
See [ DHV93 ] for the details.
RPI, November 2008
47
PEGASUS: Peta-Graph Mining from the Triangle perspective
Soon…
Stay tuned!
On-going work with U Kang and
Christos Faloutsos in collaboration with Yahoo! Research.
Among others: Implement
EigenTriangle algorithms in
HADOOP and compare to other methods.
Find outliers in graphs with many billions of edges wrt triangles.
RPI, November 2008
48
RPI, November 2008
49
Christos Faloutsos
Yiannis Koutis
For the helpful discussions
RPI, November 2008
50
Maria Tsiarli
For the PEGASUS logo
RPI, November 2008
51
RPI, November 2008
52
[WF94] Wasserman, Faust: “Social Network Analysis: Methods and
Applications (Structural Analysis in the Social Sciences)”
[EM02] Eckmann, Moses: “Curvature of co-links uncovers hidden thematic layers in the World Wide Web”
[BPCG08] Becchetti, Boldi, Castillo, Gionis Efficient Semi-Streaming
Algorithms for Local Triangle Counting in Massive Graphs
[FKP02] Fabrikant, Koutsoupias, Papadimitriou: “Heuristically Optimized
Trade-offs: A New Paradigm for Power Laws in the Internet”
[N05] Newman: “Power laws, Pareto distributions and Zipf's law”
[M04] Mitzenmacher: “A brief history of generative models for power law and lognormal distributions”
[FK81] Furedi-Komlos: “Eigenvalues of random symmetric matrices”
RPI, November 2008
53
[S04] Danilo Sergi: “Random graph model with power-law distributed triangle subgraphs”
[D97] Demmel: “Applied Numerical Algebra”
[LCKF05] Leskovec, Chakrabarti, Kleinberg, Faloutsos: “Realistic,
Mathematically Tractable Graph Generation and Evolution using Kronecker
Multiplication”
[LK07] Leskovec, Faloutsos: “Scalable Modeling of Real Graphs using
Kronecker Multiplication”
[FFF09] Faloutsos, Faloutsos, Faloutsos: “On power-law relationships of the
Internet topology”
[MP02] Mihail, Papadimitriou: “On the Eigenvalue Power Law”
[CLV03] Chung, Lu, Vu: “Spectra of Random Graphs with given expected degrees”
RPI, November 2008
54
[YKS02] Yossef, Kumar, Sivakumar: “Scalable Modeling of Real Graphs using
Kronecker Multiplication”
[GL89] Golub, Van Loan: “Matrix Computations”
[BFLSS06] Buriol, Frahling, Leonardi, Spaccamela, Sohler: “Counting triangles in data streams”
[DHV93] Demmel, Heath, Vorst: “Parallel Numerical Linear Algebra”
[YPSB05] Ye, Peyser, Spencer, Bader: “Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast”
[P08] Mitra Pradipta: “Entrywise Bounds for Eigenvectors of Random Graphs”
[FDBV01] Farkas, Derenyi, Barabasi, Vicsek: “Spectra of "real-world" graphs:
Beyond the semi-circle law”
[S07] Spielman’s “Spectral Graph Theory and its Applications” class (YALE): http://www.cs.yale.edu/homes/spielman/eigs/
RPI, November 2008
55
[F08] Faloutsos’ “Multimedia Databases and Data Mining” class (CMU): http://www.cs.cmu.edu/~christos/courses/826.S08
For more references, take a look also in the paper: http://www.cs.cmu.edu/~ctsourak/tsourICDM08.pdf
RPI, November 2008