Topological Analysis in PPI Networks
& Network Motif Discovery
Jin Chen
MSU CSE891-001
2012 Fall
1
• Topological properties of real networks
– Degree distribution (power-law & exponential)
– Path distance (small-world, non-small-world)
• Network motif
– Definitions
– Algorithms
2
WWW has power-law degree distribution
Distribution of links on the www a) Outgoing links. The tail of the distributions follows P(k)≈k -r , with r out =2.45
b) Incoming links, and r in =2.1
c) Average of the shortest path between two documents as a function of system size
The degree distribution scales as a power-law
R. Albert, H. Jeong, A.-L. Barabási, Nature 401, 130 (1999)
3
Power grid has exponential degree distribution
R. Albert et al, Phys. Rev. E 69, 025103(R) (2004) 4
Metabolic networks have a power-law degree distribution
Archaeoglobus fulgidus
Caenorhabditis elegans
E. coli
All
5 H. Jeong et al., Nature 407, 651 (2000)
Regulatory Network of E. Coli has out-degree powerlaw distribution & in-degree exponential distribution
The distribution of the number of transcription factors controlling a gene is exponential
The distribution of the number of genes regulated by a transcription factor is power-law with an average of ~5
from RegulonDB (Salgado et al. 2006)
Shen-Orr et al. Nature Genetics 31, 64 - 68 (2002)
6
• A small-world network is a network in which most nodes are not neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps
• A small-world network is defined as:
L
where L is the distance between two randomly chosen nodes; N is the number of nodes N in the network
• Small-world properties are found in many real-world phenomena
7
Six degrees of separation = everyone is on average approximately six steps away from any other person on Earth
But if persons are linked if they knew each other, then the number of degrees of separation between Albert Einstein and
Alexander the Great is almost certainly greater than 30 http://en.wikipedia.org/wiki/Six_degrees_of_separation
8
Relationship btw. power-law & small-world
• If a network has a degree-distribution which can be fit with a power law distribution, it is taken as a sign that the network is small-world
• But a small-world network is not necessary to have power-law distribution (e.g. clique)
9
• Barabasi AL hypothesized that the prevalence of small world networks in biological systems may reflect an evolutionary advantage of such an architecture
• One possibility is that small-world networks are more robust to perturbations than other network architectures
• It would provide an advantage to biological systems that are subject to damage by mutation or viral infection
10
True PPIs fit small-world, false PPIs distributed randomly
• Hypothesis: true PPIs fit the pattern of a small-world network; false PPIs are distributed randomly in the network
• By studying the local cohesiveness for each PPI, true and false
PPIs can be separated
– Incorporate a set of clustering coefficient measures of neighborhood cohesiveness
– Look for “network motifs” as an index of how well the PPIs are locally connected
Debra S. Goldberg, Frederick P. Roth (2003). PNAS, 100(8) 4372–4376.
• “Network Motifs: Simple Building Blocks of Complex
Networks”
– Focused on directed, cyclic subgraphs of 3 or 4 nodes in yeast (no self-loops)
– Used exhaustive enumeration and random networks as a comparison
Milo et al. Science (2002) Vol. 298 no. 5594 pp. 824-827
12
• In the 13 possible 3 node networks, one predominates in gene expression networks (Feed forward loop)
• In the 199 possible 4 node networks, one predominates (bifan)
Feed Forward loop
X
Z Y
X
Bi-fan
Y
Z W
13
14
• Efficient sampling algorithm for detecting network motifs
– Focused on directed, cyclic graphs
– Used a sampling approach to estimate motif frequency
– Found motifs of size 6 & 7
Kashtan et.al. Bioinformatics (2004) Volume20, Issue11 Pp. 1746-1758
15
• Given a PPI network
– Unlabelled & undirected subgraphs
– Find repeated and unique motifs of size 2 to K (5 to 25)
• Mining Maximal Frequent Subgraphs from Graph Databases
(SPIN, FSSM)
– Looks for frequent labelled subgraphs from a database of graphs
– Counts whether a subgraph occurs at least once in a graph
Huan et al. SIGKDD (2004)
16
1. Number of motifs increases exponentially with size
2. Motifs frequency is not A priori
3. Graph isomorphism does not have polynomial solution
Concepts of frequency
• f1: allow arbitrary overlaps of nodes & edges ---NOT DOWNWARD CLOSURE!
• f2: allow overlaps of nodes but edges disjoint
• f3: no overlap allowed (edge and node-disjoint)
17
• Input a Protein-Protein Interaction (PPI) network G
– K : maximal motif size
– F : frequency threshold
– S : uniqueness threshold
• Output set U of frequent and unique motifs of size 3 to K
• Since motifs are small (2 to 25 nodes), use adjacency matrices. Further, represent motifs as Canonical Adjacency
Matrices (CAM)
Chen et al SIGKDD 2006
18
• Given a graph G
• Let K = 5 (max motif size)
• Let F = 2 (min frequency)
• Let S = 0.95 (uniqueness threshold)
2
1 3
5 4
G
19
t
2
Find all subgraphs of size 2 to 5.
t
3 t
4_1 t
4_2 t
5_1 t
5_2
Fig 2. Size 2 to 5 trees t
5_3
20
1
2
Occurences of t
4_1
2 in G.
2
3 1 3 1
5 4
1
2
3 1
5
2
4
3
3
1
5
2
4
3
5 4 5 4 5 4
21
Tree t
2 t
3 t
4_1 t
4_2 t
5_1
Freq.
7 13 6 17 1 t
2 t
3
F = 2 t
4_1 t
4_2 t
5_2
5 t
5_3
7 t
5_1 t
5_2 t
5_3
22
Remaining frequent trees t
2 T
2
= t
3
T
3
= t
4_1 t
4_2
T
4
=
T
5
= t
5_2 t
5_3
23
Use Repeated Size-k Trees to Partition Graph
Take each graph in Tk and use it to partition G (i.e. T4)
2 2 2
1
5 4
GD4
1
3 1
2
5 4
3 1
5 4
3
2
1
5 4
5 4
3
3
24
Perform graph join operation to find repeated size-k graphs t
4_1 t
4_2
25
Perform graph join operation to find repeated size-k graphs
Generate all k-node, k-1 edge graphs from each graph in T k
4-node, 3-edge subgraphs from T
4
)
. (i.e. t
4_2 t
4_1 h
3 h
1
&
& h
4 h
2
& h
5
26
Perform graph join operation to find repeated size-k graphs
Join each tree with it’s cousins to produce frequent motif candidates C k
.
C
4
& t
4_1 h
1 h
2
& & t
4_2 h
3 h
4 h
5
27
Perform graph join operation to find repeated size-k graphs
Count the frequency of each graph C k in GD k
.
1
5
2
GD
4
3
2
1
5 4
1
5 4
3
1
2
4
3
2
5 4
3 g
1_2 g
1_1
F = 4
F = 2
28
Perform graph join operation to find repeated size-k graphs.
Generate k node, k+1 edge graphs from k node, k edge graphs move edge merge g
1_2 h
6 g
2
F = 2 in GD
4
29
Type I : Direct Cousin h is isomorphic to a subgraph which has the same number of nodes & edges as g and g != h h is a Type I cousin of g is isomorphic to because g’
31
h
G
4_1
G
4_2
G
4_3
G
4_4
G
4_5
G
4_1
G
4_2 g
G
4_3
G
4_5
GD
4
32
h
G
4_1
G
4_2
G
4_3
G
4_4
G
4_5
G
4_1
G
4_2 g
G
4_3
G
4_5
GD
4
33
Type II : Twin Cousin h is isomorphic to a subgraph g.
is isomorphic to h g
34
Type III : Distant Cousin h is a disconnected subgraph of g.
is a disconnected subgraph of g h
35
Type III : Distant Cousin h is a disconnected subgraph of g.
is a disconnected subgraph of h g
36
• Saves time when counting graph frequency
• GD k partitions the network into several subgraphs
• If they can limit the isomorphism search to a subset of those graphs, they can save time
37
Determine subgraph frequency in random networks
• A frequent subgraphs may appear frequently by chance
• In order to determine the significance of a subgraph, generate random networks with the same number of node and the same number of edges
• Also impose the constraint that each node must have the same number of neighbors as it’s counterpart in the real network
38
• Uetz dataset : 957 PPIs, 104 proteins
– In budding yeast
• MIPS CYGD dataset : 10199 PPIs, 4341 proteins
– Also in budding yeast
• Compared with
– Exhaustive enumeration
– Sampling
– FPF
39
~2.8 hrs
F = 50
U = 0.95
40
~2.8 hrs
41
42