talkRPI - Carnegie Mellon University

advertisement

FAST COUNTING OF TRIANGLES IN

LARGE NETWORKS:

ALGORITHMS AND LAWS

Charalampos (Babis) Tsourakakis

School of Computer Science

Carnegie Mellon University http://www.cs.cmu.edu/~ctsourak

RPI Theory Seminar, 24 November 2008

Counting Triangles

2

Given an undirected, simple graph G(V,E) a triangle is a set of 3 vertices such that any two of them by an edge of the graph.

Related Problems Our focus a) Decide if a graph is triangle-free. b) Count the total number of triangles δ( G).

c) Count the number of triangles δ( v) that each vertex v participates at.

( v )

| {( u , w )

E : ( v , u )

E , ( v , w )

E } | d) List the triangles that each vertex v participates at.

RPI, November 2008

Why is triangle counting important*?

3

Social Network Analysis:

“Friends of friends are friends” [WF94]

Web Spam Detection [BPCG08]

Hidden Thematic Structure of the

Web [EM02]

Motif Detection e.g. biological networks [YPSB05]

*few indicative reasons, from the graph mining perspective

RPI, November 2008

Why is triangle counting important?

4

Furthermore, two often used metrics are:

Clustering Coefficient

CC ( G )

1

| V |'

 v

V ' cc ( v )

1

| V |'

 v

V '

( v )

( v ) where:

V '

{ v : d ( v )

2 } and

( v )

 d ( v )

2



Transitivity Ratio

TR

3

( G

( G )

) where: 

( G )

1

3

 v

V

( v ) and

( G )

   v

V

( v ) v

RPI, November 2008

Triple at node v

Triangle

5

Outline

• Related Work

• Proposed Method

• Experiments

• Triangle-related Laws

• Triangles in Kronecker Graphs

• Future Work & Open Problems

RPI, November 2008

Counting methods

6

Dense graphs

Time complexity

Space complexity

Fast

O(n 2.37

)

O ( n 2 )

Low space

O(n 3 )

O ( m)

Sparse graphs

Time complexity

Space complexity

Fast

O(m 0.7

n 1.2

+n 2+o(1) )

Θ( n 2 )

(eventually)

RPI, November 2008

Low space d max

Θ( m)

7

Outline

• Related Work

• Proposed Method

• Experiments

• Triangle-related Laws

• Triangles in Kronecker Graphs

• Future Work & Open Problems

RPI, November 2008

Outline of the Proposed Method

8

EigenTriangle theorem

EigenTriangleLocal theorem

EigenTriangle algorithm

EigenTriangleLocal algorithm

Efficiency & Complexity

Power law degree distributions

Gershgorin discs

Real world network spectra

RPI, November 2008

Theorem [EigenTriangle]

9

Theorem

The number of triangles δ( G) in an undirected, simple graph G(V,E) is given by: where

1

( G )

2

...

i

| V  |

1

6

 i

3

| V | are the eigenvalues of the adjacency matrix of graph G.

RPI, November 2008

Proof

10

Call A the adjacency matrix of the graph. Consider the i-th diagonal element of A 3 , α ii

. This element is equal to the number of triangles vertex i participates at. So the trace is 6 δ( G) because each triangle is counted 6 times (3 participating vertices and is also counted as i-j-k, and i-k-j). Furthermore, if Ax= λ x , then λ 3 is an eigenvalue of A 3 (*) and vice versa if λ is an eigenvalue of A eigenvalue of A.

3 , then is an

* A 3 x=AAAx=AA λ x= λΑΑ x= λΑλ x= λ 2 Α x =λ 3 x

RPI, November 2008

Theorem [EigenTriangleLocal]

11

Theorem

The number of triangles δ( i) vertex i partipates at is equal to:

( i )

 j

| V  |

1

3 j

2 u ij

2 where is the j-th entry of the i-th eigenvector ij u i

Proof [Sketch]

Follows from the previous theorem and the fact that

A is symmetric, therefore diagonalizable and also

A

3 

U

 3

U

T

RPI, November 2008

12

EigenTriangle Algorithm

RPI, November 2008

13

EigenTriangleLocal Algorithm

Why are these two algorithms efficient?

RPI, November 2008

Skewed Degree Distributions

14

Skewed degree distribution ubiquitous in nature!

Have been termed as “the signature of human activity”[FKP02] but appear as well to all other kind of networks, e.g. biological.

See [N05][M04] for generative models of power law distributions.

Typically referred to as power-laws (even if sometimes we abuse the strict definition of a power

RPI, November 2008

Examples of power laws

15

Newman [N05] demonstrated how often power laws appear using may different types of networks, ranging from word frequencies to population of cities.

Many cities have a small population

RPI, November 2008

Few cities have a huge population

Gershgorin’s Discs

16

Theorem

Let B an arbitrary matrix. Then the eigenvalues λ of

B are located in the union of the n discs

|

  b kk

|

 j

 k

| b kj

|

For a proof see Demmel [D97], p.82.

RPI, November 2008

Gershgorin Discs

17

Bounds on the airports network (Observe how loose)

RPI, November 2008

18

Typical real world spectra

Political blogs

RPI, November 2008

Airports

Top Eigenvalues

19

Zooming in the top eigenvalues and plotting the rank vs. the eigenvalue in log-log scale reveals that the top eigenvalues follow a power law [FFF99]

Some years later, Mihail & Papadimitriou [MP02] and Chung, Lu and Vu [CLV03] proved this fact.

RPI, November 2008

Our idea

20

Simple & clear:

Use a low-rank approximation of A 3 to estimate the diagonal elements and the trace.

Suggests also a way of thinking:

Take advantage of special properties (e.g. power laws) to reduce the complexity of certain computational tasks in real-world networks.

RPI, November 2008

Summing up: Why does it work?

21

Almost symmetry of the spectrum around 0 for the bulk of the eigenvalues except the top ones is the first main reason.

Cubes amplify strongly this phenomenon!

RPI, November 2008

Complexity Analysis

22

Main computational bottleneck that determines the complexity is the Lanczos method.

Lanczos runs in linear time with respect to the nonzero entries of the matrix, i.e. the edges, assuming that we compute a few constant number of eigenvalues.

Convergence of Lanczos is fast due to the eigenvalue power law (see Kaniel-Paige theory

[GL89])

RPI, November 2008

23

Outline

• Related Work

• Proposed Method

• Experiments

• Triangle-related Laws

• Triangles in Kronecker Graphs

• Future Work & Open Problems

RPI, November 2008

24

Datasets

RPI, November 2008

Competitor: Node Iterator

25

Node Iterator algorithm considers each node at the time, looks at its neighbors and checks how many among them are connected among them.

 d

We report the results as the speedup that

EigenTriangle algorithm gives compared to the running time of the Node Iterator .

RPI, November 2008

26

Results: #Eigenvalues vs. Speedup

RPI, November 2008

27

Results: #Edges vs. Speedup

RPI, November 2008

Main points

28

Some interesting facts for the two scatterplots:

Mean required approximations rank for at least

95% is 6.2

Speedups are between 33.7x and 1159x.

The mean speedup is 250.

Notice the increasing speedup as the size of the network grows.

RPI, November 2008

29

Zooming in

RPI, November 2008

Zooming in this point

Evaluating the Local Counting Method

30

Pearson’s correlation coefficient ρ

Relative Reconstruction Error

RRE

1

| V | i

| V  |

1

|

(

)

 

(

)

' ( i ) |

Political Blogs:

RRE 7*10 -4

ρ 99.97%

RPI, November 2008

31

#Eigenvalues vs. ρ for three networks

Observe how a low rank results in almost optimal results.

This holds for surprisingly many real world networks

RPI, November 2008

32

Outline

• Related Work

• Proposed Method

• Experiments

• Triangle-related Laws

• Triangles in Kronecker Graphs

• Future Work & Open Problems

RPI, November 2008

Triangle Participation Law

33

Plots the number of triangles δ (x-axis) vs. the count of vertices with δ participating triangles.

(a) (b)

(c) a) EPINIONS, who trusts-whos b) ASN, social network c) HEP_TH, collaboration network

RPI, November 2008

34

Degree Triangle Law

Plots the degree d i

(x-axis) vs. the mean number of triangles that nodes with degree d i participate at.

Epinions

RPI, November 2008

ASN

35

Outline

• Related Work

• Proposed Method

• Experiments

• New Triangle-related Laws

• Triangles in Kronecker Graphs

• Future Work & Open Problems

RPI, November 2008

Kronecker Graphs

36

This model was introduced in [LCKF05]. It is based on the simple operation of the Kronecker product to generate graphs that mimic real world networks.

Deterministic Kronecker Graphs: Kronecker Product of the adjacency matrix at the current step k with the initiator adjacency matrix (typically small).

Stochastic Kronecker Graphs: Kronecker Product of the matrix at the current step k with the initiator matrix. Initiator matrix contains probabilities.

For more details see [LF07].

RPI, November 2008

Triangles in Kronecker Graphs

37

Some notation first:

A: nxn initiatior adjacency matrix of the undirected, simple graph G

A

B = A [k] k-th Kronecker product

λ=(λ

1

,...,λ n

) the eigenvalues of A

Δ( G

A

), Δ( G

Β

) #triangles of G

A

, G

Β

Theorem [KroneckerTRC]

Δ(G

B

)

6 k Δ(G

A

) k

1

, k

0

RPI, November 2008

Proof

38

We use induction on the number of recursion steps k.

For k=0 the theorem trivially holds.

Assume now that KroneckerTRC holds now for some r

1

.Call C=A [r] , D=A [r+1] and the eigenvalues of C,

[ μ i

] i=1..s

.By the assumption

Δ(G c

)

6 r Δ(G

A

) r

1

The eigenvalues of D are given by the Kronecker product . By the EigenTriangle theorem, the number of triangles in D is given by:

RPI, November 2008

39

Proof

( G

D

)

 i n s 

1 j

1

 i

3

3 j

6

6

( G

A

) i s 

1

 i

3

6

 s i

1

 i

6 n j

1

3 j

 i s 

1

 i

3

6

( G

A

)

6

6

( G

A

)

( G

C

)

6 r

1 

( G

A

) r

2

Q.E.D

RPI, November 2008

40

Outline

• Related Work

• Proposed Method

• Experiments

• New Triangle-related Laws

• Triangles in Kronecker Graphs

• Future Work & Open Problems

RPI, November 2008

41

Theoretical Challenge I:

Spectra of real world networks

Can we prove things about the distribution of the eigenvalues, adopting a random graph model such as the expected degree model G(w) [CLV03]?

An analog to Wigner’s semicircle law for random

Erdos-Renyi graphs (see Furedi-Komlos [FK81])

RPI, November 2008

Spectrum of

G

1

40 ,

2 over 100000

Iterations

[S07]

42

Theoretical Challenge I:

Spectra of real world networks

Empirically, the rest the spectrum:

Something about

RPI, November 2008

43

Theoretical Challenge II:

Eigenvectors of real world networks

Things even “worse” than the case of spectra. Very few knowledge about the eigenvectors.

Related work:

See [P08] for random graphs.

RPI, November 2008

44

Theoretical Challenge III:

Degree Triangle Law

Prove using the expected degree random graph model G(w) the pattern we saw (see [S04])

Conjecture:

The relationship we observed probably appears for some cases of the slope of the degree distribution. Further experiments, recently showed that for some graphs this pattern does not hold.

RPI, November 2008

45

Experimental Challenge I:

Compare with Streaming Methods

Streaming or Semi-Streaming methods, perform one or O(1) passes over the graph.

[YKS02]

[BFLSS06]

[BPCG08]

Common Underlying Idea: Sophisticated sampling methods

Implement and compare.

RPI, November 2008

46

Practical Challenge I:

Triangles in Large Scale Graph Mining

Many Giga-byte and Peta-byte sized graphs.

How to handle these graphs?

HADOOP

EigenTriangle algorithms are based just on simple matrix vector multiplications.

Easy to parallelize in all sorts of architectures

(distributed memory , shared memory).

See [ DHV93 ] for the details.

RPI, November 2008

47

PEGASUS: Peta-Graph Mining from the Triangle perspective

Soon…

Stay tuned!

On-going work with U Kang and

Christos Faloutsos in collaboration with Yahoo! Research.

Among others: Implement

EigenTriangle algorithms in

HADOOP and compare to other methods.

Find outliers in graphs with many billions of edges wrt triangles.

RPI, November 2008

48

Curious about:

RPI, November 2008

49

Acknowledgements

Christos Faloutsos

Yiannis Koutis

For the helpful discussions

RPI, November 2008

50

Acknowledgements

Maria Tsiarli

For the PEGASUS logo

RPI, November 2008

51

RPI, November 2008

52

References

[WF94] Wasserman, Faust: “Social Network Analysis: Methods and

Applications (Structural Analysis in the Social Sciences)”

[EM02] Eckmann, Moses: “Curvature of co-links uncovers hidden thematic layers in the World Wide Web”

[BPCG08] Becchetti, Boldi, Castillo, Gionis Efficient Semi-Streaming

Algorithms for Local Triangle Counting in Massive Graphs

[FKP02] Fabrikant, Koutsoupias, Papadimitriou: “Heuristically Optimized

Trade-offs: A New Paradigm for Power Laws in the Internet”

[N05] Newman: “Power laws, Pareto distributions and Zipf's law”

[M04] Mitzenmacher: “A brief history of generative models for power law and lognormal distributions”

[FK81] Furedi-Komlos: “Eigenvalues of random symmetric matrices”

RPI, November 2008

53

References

[S04] Danilo Sergi: “Random graph model with power-law distributed triangle subgraphs”

[D97] Demmel: “Applied Numerical Algebra”

[LCKF05] Leskovec, Chakrabarti, Kleinberg, Faloutsos: “Realistic,

Mathematically Tractable Graph Generation and Evolution using Kronecker

Multiplication”

[LK07] Leskovec, Faloutsos: “Scalable Modeling of Real Graphs using

Kronecker Multiplication”

[FFF09] Faloutsos, Faloutsos, Faloutsos: “On power-law relationships of the

Internet topology”

[MP02] Mihail, Papadimitriou: “On the Eigenvalue Power Law”

[CLV03] Chung, Lu, Vu: “Spectra of Random Graphs with given expected degrees”

RPI, November 2008

54

References

[YKS02] Yossef, Kumar, Sivakumar: “Scalable Modeling of Real Graphs using

Kronecker Multiplication”

[GL89] Golub, Van Loan: “Matrix Computations”

[BFLSS06] Buriol, Frahling, Leonardi, Spaccamela, Sohler: “Counting triangles in data streams”

[DHV93] Demmel, Heath, Vorst: “Parallel Numerical Linear Algebra”

[YPSB05] Ye, Peyser, Spencer, Bader: “Commensurate distances and similar motifs in genetic congruence and protein interaction networks in yeast”

[P08] Mitra Pradipta: “Entrywise Bounds for Eigenvectors of Random Graphs”

[FDBV01] Farkas, Derenyi, Barabasi, Vicsek: “Spectra of "real-world" graphs:

Beyond the semi-circle law”

[S07] Spielman’s “Spectral Graph Theory and its Applications” class (YALE): http://www.cs.yale.edu/homes/spielman/eigs/

RPI, November 2008

55

References

[F08] Faloutsos’ “Multimedia Databases and Data Mining” class (CMU): http://www.cs.cmu.edu/~christos/courses/826.S08

For more references, take a look also in the paper: http://www.cs.cmu.edu/~ctsourak/tsourICDM08.pdf

RPI, November 2008

Download