GDV

advertisement
Optimal Network Alignment
with Graphlet Degree Vectors
Tijana Milenković
(Department of Computing, Imperial College London && Department of
Computer Science, University of California)
Weng Leong Ng
(Department of Computer Science, University of California),
Wayne Hayes
(Department of Computer Science, University of California && Department
of Mathematics, Imperial College London)
Nataša Pržulj
(Department of Computing, Imperial College London)
Cancer Informatics 2010
Presented by: Lila Shnaiderman
Motivation
• Lately, advances in experimental techniques:
–
–
–
–
yeast two-hybrid assay,
Mass spectrometry of purified complexes,
genome-wide chromatin immunoprecipitation,
etc.
• So, increasing amounts of biological network data
becoming available!
• Comparative analyses of biological networks have
as large an impact as comparative genomics on:
– understanding of biology
– Evolution
– disease
• So, meaningful network comparisons across species
becomes one of the foremost problems in
evolutionary and systems biology!!!
2/34
Background
• Subgraph isomorphism problem:
– Is one graph exists as an exact subgraph of another
graph.
– NP-complete complexity
– So, network comparisons are computationally
infeasible…
• Network alignment:
– The most common network comparison method.
– Is more general problem:
• Find the best way to “fit” a graph into another graph
(not an exact subgraph)
• Unclear:
– how to guide the alignment process
– how to measure the “goodness” of an inexact fit
– So, heuristic strategies must be sought
3/34
Background – alignment types
• Local alignment:
– The majority of existing methods.
– match a small sub network from one network to one or more
sub networks in another network.
– Can be ambiguous…
• Global alignment:
– Measures the overall similarity between two networks.
– Aligns every node in the smaller network to exactly one node
in the larger network.
– most existing methods incorporate some a priori information
external to network topology
• like protein sequence similarities in PPIs networks, etc.
• Best known global network alignment algorithm based
solely on network topology:
– GRAph ALigner (GRAAL): uses a heuristic search strategy to
quickly find approximate alignments
4/34
Current solution: H-GRAAL
• Hungarian-algorithm based GRAAL
• More expensive
• Guaranteed to find optimal alignments relative to
any fixed, deterministic cost function.
• Relies solely and explicitly on a strong and
direct measure of network topological similarity.
• Applicable to any type of networks
• Allows to transfer the knowledge between aligned
networks.
5/34
Graphlet degree vectors (1)
• A small connected induced sub graph of
a larger network.
0
1
3
2
G0
G1
G2
6
5
7
G3
13
10
11
4
8
G4
G5
14
12
9
G6
G7
G8
6/34
Graphlet degree vectors (2)
• Graphlet degrees vector of node V: counts the
number of different graphlets that the node
touches (for all graphlets on 2 to 5 nodes).
0
v
v
v
v
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
7/34
Graphlet degree vectors (3)
1
v
v
2
orbit
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
8/34
Graphlet degree vectors (4)
1
2
v
v
v
v
v
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
9/34
Graphlet degree vectors (4)
4
3
5
v
?
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
10/34
Graphlet degree vectors (5)
4
5
v
v
v
v
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
11/34
Graphlet degree vectors (6)
6
7
v
v
8
v
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
12/34
Graphlet degree vectors (7)
10
11
9
v
What is the degree of node V
(according to the vector)?
There are 73 different orbits
across all 2-5-node graphlets
v
The signature of
node V
Orbit
0 1 2 3 4 5 6 7 8 9 10
11
12
13
14
GDV(V)
4 2 5 1 0 4 0 2 1 0 0
2
0
0
0
13/34
Degree Vector - Signature
• Many real-world Networks:
– Have a small-world nature
• So, degree Vector is an effective measure:
– Looks at network distance of 4 around a node
– Captures a large portion of network topology
• Thus, comparing two signatures:
– Highly constraining measure of local
topological similarity between nodes.
14/34
Signature similarity
• For u G, ui: =
– the ith coordinate of its signature vector.
– Distance:
– wi is the weight of orbit i.
• Accounts for dependencies between orbits
• higher weights to orbits that are not affected by many
other orbits
• Questions:
– Why log?
– Why “+1”?
15/34
Distance and Similarity
• Total Distance:
– in (0,1)
– O means: u,v identical
• Similarity:
S(u,v) = 1-D(u,v)
16/34
H-GRAAL algorithm-definitions
• G1 and G2 are networks:
– |V(G1)|<|V(G2)|
• Alignment of G1 to G2:
– set of ordered pairs (u,v), u ∈ V (G1) and
v ∈ V (G2)
– no two ordered pairs share the same G1-node
or the same G2-node.
– Each pair called aligned pair.
• Maximum alignment:
– Every G1-node is in some aligned pair
– From now on:
alignment=maximum alignment
17/34
H-GRAAL algorithm
• H-GRAAL:
– Hungarian-algorithm-based GRAph Aligner
• Produces an alignment:
– of minimum total cost between networks
– total cost: summed over all aligned pairs
– aligned pair cost: based on signature similarity
• The cost of aligning u and v:
– favors alignment of the densest parts of the networks;
– Reduced as the degrees of both nodes increase: higher
degree nodes with similar signatures provide a tighter
constraint
– α ∈ [0, 1]: weighs the cost-function contributions of the
node signature similarity between u and v
– 1 − α: weights the contribution of nodes degrees.
18/34
Alignment Cost
• Cost=0: a pair of topologically identical nodes u and v
• Cost close to 2: a pair of topologically very different nodes.
• Any problem with this formula?
• T(u,v) for most nodes is very low:
– As, there is small number of hubs (highly-linked nodes),
– So max_deg(G1) and max_deg(G2) are much larger
than deg(u) and deg(v).
19/34
Hungarian Algorithm
• solves the assignment problem in polynomial
time:
– Create two bipartite graphs V(G1), V(G2).
– Edge (u,v) from V(G1) to V(G2): labeled with the
node alignment cost.
– Find perfect match between them (with minimal
cost).
• More than one optimal alignment is possible:
– the particular found alignment is highly dependent
on the implementation details of the underlying
Hungarian algorithm.
– For example: the order of presenting the nodes to
the algorithm
20/34
Finding Few Optimal Alignment
• Can learn about all possible optimal matchings.
• Make H-GRAAL to give more alignments:
– “Remove” (u,v): raise the alignment cost of a nodepair (u,v) in A0 to +∞
– Run H-GRAAL again
• Found alignment with higher cost than A0, “Remove”
different edge.
• After trying to “remove” all edges, if not found alignment
with optimal cost, no more optimal alignments exist.
• This process has too high complexity…
– O(|V(G1)|3x||E(G1)|)
– There exist a fix O(|V(G1)|2x||E(G1)|) (based on
dynamic Hungarian algorithm).
– My remark: still very slow (can take months…)
21/34
Few Optimal Alignment algorithm
• Optimizing aligned pair:
– Appears in at least one optimal alignment.
• The set of optimizing pairs:
– Can be computed in at worst O(n4) time.
– Can be easily parallelized.
My remark:
too slow…
22/34
Few Optimal Alignments - Analysis
• Significance of aligned pair:
– According to number of optimizing pairs
per u.
– If (u,v) were the only optimizing pair for u:
every optimal alignment contains (u,v).
I.e., (u,v) is highly significant.
• Core alignment:
– the set of all such special optimizing pairs.
– Large core alignment means: stable
alignment.
23/34
Measures of alignment quality (1)
• Edge correctness (EC) –
– percentage of edges in one graph that are aligned to
edges in the other graph.
To be able to measure the following measurements,
must know the “true alignment” …
• Node correctness (NC) –
– percentage of nodes in one network that are correctly
aligned to nodes in the other network
• Interaction correctness (IC) –
– percentage of interactions that are aligned correctly
• IC is stricter than EC:
– EC does not require that the alignment partners are
the correct ones
24/34
Measures of alignment quality (2)
• Usually the “true alignment” is not known
– So, can measure just EC…
– two alignments possibly can have similar ECs, where
one alignment is “good” and the other is “bad”  EC is
not enough…
• To uncover regions of similar topology:
– the aligned edges must cluster together and form
large and dense connected sub-graphs.
• Common connected sub-graph (CCS):
– connected sub-graph that appears in both networks
• Good alignment has:
– large and dense CCSs.
– Large EC
25/34
Statistical Significance
• Random alignment of real-world networks:
– the probability of obtaining a given or better EC at random.
• Null model of random alignment:
–
–
–
–
–
• P:
Random mapping g: E1 → V1 × V2.
n1 = |V1|, n2 = |V2|, m1 = |E1|, and m2 = |E2|.
p = n2 (n2 − 1)/2: the number of node pairs in G2
EC = x%: the edge correctness of the given alignment
k = [m1 × x]: the number of aligned edges from G1 to edges
in G2.
– the probability of successfully aligning k or more edges by
chance (the tail of the hypergeometric distribution):
.
26/34
More statistical Significance Metrics
• H-GRAAL’s alignment of random model networks:
– Checks the significance of the alignment in compare to
alignment of random networks:
• Align two PPI networks,
• align them with random networks,
• compare results.
• Biological Validation:
– find the number of aligned protein pairs sharing a Gene
Ontology (GO) term.
– Compute its statistical significance.
• Significance of functional enrichments:
– Align metabolic networks of different species
– generate phylogenetic trees based on H-GRAALs ECs.
– Compute its statistical significance.
27/34
Results (1)
• H-GRAAL always produces better alignments than GRAAL
for all values of α.
• using only degrees (α = 0) gives bad results.
– So, graphlet-based signatures are far more valuable than a
measure based on degree alone.
28/34
Results (2)
• The largest common connected sub-graph in the
alignment of the yeast and human PPI networks
– consisting of 1,290 interactions amongst 317 proteins.
– This network appears, in its entirety, in the PPI networks of
both species.
29/34
Results (3)
• Statistics of H-GRAAL’s core yeast-human
alignment for α = 0.5.
• The percentage of yeast proteins, out of 2,390 of
them, that participate in n “optimizing pairs”.
• Shows the quality of H-GRAAL!
30/34
Results (4)
• Comparison of the phylogenetic trees for protists and
fungies
• H-GRAAL’s and GRAAL’s tree are slightly different from the
sequence-based one.
• Sequence-based trees are built based on:
– multiple alignment of gene sequences
– whole genome alignments.
31/34
Results (5)
• Multiple alignments have few problems:
– Can be misleading due to gene rearrangements, inversions,
transpositions, and translocations (at the substring level)
– Different species might have an unequal number of genes or
genomes of vastly different lengths.
• Whole genome alignments can be misleading:
– Noncontiguous copies of a gene or non-decisive gene order.
– The trees are built incrementally from smaller pieces that are
“patched” together probabilistically  probabilistic errors
expected.
• H-GRAAL’s and GRAAL’s have none of these. But
– There are noise problems
– Incompleteness of PPI networks.
• No reason to believe that the sequence-based tree or
GRAAL’s one should a priori be considered the correct
one 
32/34
Conclusions
• Presented H-GRAAL algorithm for global
alignment between networks
• Presented different statistics to evaluate
the quality of the alignment.
• Experimented with different PPI
networks, and not only PPI.
• Showed that H-GRAAL is the best known
global alignment algorithm.
• H-GRAAL can have huge influence on
researching biological networks!
33/34
Thank you for your
attention!
34/34
Download