Optimal Network Alignment with Graphlet Degree Vectors Tijana Milenković (Department of Computing, Imperial College London && Department of Computer Science, University of California) Weng Leong Ng (Department of Computer Science, University of California), Wayne Hayes (Department of Computer Science, University of California && Department of Mathematics, Imperial College London) Nataša Pržulj (Department of Computing, Imperial College London) Cancer Informatics 2010 Presented by: Lila Shnaiderman Motivation • Lately, advances in experimental techniques: – – – – yeast two-hybrid assay, Mass spectrometry of purified complexes, genome-wide chromatin immunoprecipitation, etc. • So, increasing amounts of biological network data becoming available! • Comparative analyses of biological networks have as large an impact as comparative genomics on: – understanding of biology – Evolution – disease • So, meaningful network comparisons across species becomes one of the foremost problems in evolutionary and systems biology!!! 2/34 Background • Subgraph isomorphism problem: – Is one graph exists as an exact subgraph of another graph. – NP-complete complexity – So, network comparisons are computationally infeasible… • Network alignment: – The most common network comparison method. – Is more general problem: • Find the best way to “fit” a graph into another graph (not an exact subgraph) • Unclear: – how to guide the alignment process – how to measure the “goodness” of an inexact fit – So, heuristic strategies must be sought 3/34 Background – alignment types • Local alignment: – The majority of existing methods. – match a small sub network from one network to one or more sub networks in another network. – Can be ambiguous… • Global alignment: – Measures the overall similarity between two networks. – Aligns every node in the smaller network to exactly one node in the larger network. – most existing methods incorporate some a priori information external to network topology • like protein sequence similarities in PPIs networks, etc. • Best known global network alignment algorithm based solely on network topology: – GRAph ALigner (GRAAL): uses a heuristic search strategy to quickly find approximate alignments 4/34 Current solution: H-GRAAL • Hungarian-algorithm based GRAAL • More expensive • Guaranteed to find optimal alignments relative to any fixed, deterministic cost function. • Relies solely and explicitly on a strong and direct measure of network topological similarity. • Applicable to any type of networks • Allows to transfer the knowledge between aligned networks. 5/34 Graphlet degree vectors (1) • A small connected induced sub graph of a larger network. 0 1 3 2 G0 G1 G2 6 5 7 G3 13 10 11 4 8 G4 G5 14 12 9 G6 G7 G8 6/34 Graphlet degree vectors (2) • Graphlet degrees vector of node V: counts the number of different graphlets that the node touches (for all graphlets on 2 to 5 nodes). 0 v v v v Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 7/34 Graphlet degree vectors (3) 1 v v 2 orbit Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 8/34 Graphlet degree vectors (4) 1 2 v v v v v Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 9/34 Graphlet degree vectors (4) 4 3 5 v ? Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 10/34 Graphlet degree vectors (5) 4 5 v v v v Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 11/34 Graphlet degree vectors (6) 6 7 v v 8 v Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 12/34 Graphlet degree vectors (7) 10 11 9 v What is the degree of node V (according to the vector)? There are 73 different orbits across all 2-5-node graphlets v The signature of node V Orbit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 GDV(V) 4 2 5 1 0 4 0 2 1 0 0 2 0 0 0 13/34 Degree Vector - Signature • Many real-world Networks: – Have a small-world nature • So, degree Vector is an effective measure: – Looks at network distance of 4 around a node – Captures a large portion of network topology • Thus, comparing two signatures: – Highly constraining measure of local topological similarity between nodes. 14/34 Signature similarity • For u G, ui: = – the ith coordinate of its signature vector. – Distance: – wi is the weight of orbit i. • Accounts for dependencies between orbits • higher weights to orbits that are not affected by many other orbits • Questions: – Why log? – Why “+1”? 15/34 Distance and Similarity • Total Distance: – in (0,1) – O means: u,v identical • Similarity: S(u,v) = 1-D(u,v) 16/34 H-GRAAL algorithm-definitions • G1 and G2 are networks: – |V(G1)|<|V(G2)| • Alignment of G1 to G2: – set of ordered pairs (u,v), u ∈ V (G1) and v ∈ V (G2) – no two ordered pairs share the same G1-node or the same G2-node. – Each pair called aligned pair. • Maximum alignment: – Every G1-node is in some aligned pair – From now on: alignment=maximum alignment 17/34 H-GRAAL algorithm • H-GRAAL: – Hungarian-algorithm-based GRAph Aligner • Produces an alignment: – of minimum total cost between networks – total cost: summed over all aligned pairs – aligned pair cost: based on signature similarity • The cost of aligning u and v: – favors alignment of the densest parts of the networks; – Reduced as the degrees of both nodes increase: higher degree nodes with similar signatures provide a tighter constraint – α ∈ [0, 1]: weighs the cost-function contributions of the node signature similarity between u and v – 1 − α: weights the contribution of nodes degrees. 18/34 Alignment Cost • Cost=0: a pair of topologically identical nodes u and v • Cost close to 2: a pair of topologically very different nodes. • Any problem with this formula? • T(u,v) for most nodes is very low: – As, there is small number of hubs (highly-linked nodes), – So max_deg(G1) and max_deg(G2) are much larger than deg(u) and deg(v). 19/34 Hungarian Algorithm • solves the assignment problem in polynomial time: – Create two bipartite graphs V(G1), V(G2). – Edge (u,v) from V(G1) to V(G2): labeled with the node alignment cost. – Find perfect match between them (with minimal cost). • More than one optimal alignment is possible: – the particular found alignment is highly dependent on the implementation details of the underlying Hungarian algorithm. – For example: the order of presenting the nodes to the algorithm 20/34 Finding Few Optimal Alignment • Can learn about all possible optimal matchings. • Make H-GRAAL to give more alignments: – “Remove” (u,v): raise the alignment cost of a nodepair (u,v) in A0 to +∞ – Run H-GRAAL again • Found alignment with higher cost than A0, “Remove” different edge. • After trying to “remove” all edges, if not found alignment with optimal cost, no more optimal alignments exist. • This process has too high complexity… – O(|V(G1)|3x||E(G1)|) – There exist a fix O(|V(G1)|2x||E(G1)|) (based on dynamic Hungarian algorithm). – My remark: still very slow (can take months…) 21/34 Few Optimal Alignment algorithm • Optimizing aligned pair: – Appears in at least one optimal alignment. • The set of optimizing pairs: – Can be computed in at worst O(n4) time. – Can be easily parallelized. My remark: too slow… 22/34 Few Optimal Alignments - Analysis • Significance of aligned pair: – According to number of optimizing pairs per u. – If (u,v) were the only optimizing pair for u: every optimal alignment contains (u,v). I.e., (u,v) is highly significant. • Core alignment: – the set of all such special optimizing pairs. – Large core alignment means: stable alignment. 23/34 Measures of alignment quality (1) • Edge correctness (EC) – – percentage of edges in one graph that are aligned to edges in the other graph. To be able to measure the following measurements, must know the “true alignment” … • Node correctness (NC) – – percentage of nodes in one network that are correctly aligned to nodes in the other network • Interaction correctness (IC) – – percentage of interactions that are aligned correctly • IC is stricter than EC: – EC does not require that the alignment partners are the correct ones 24/34 Measures of alignment quality (2) • Usually the “true alignment” is not known – So, can measure just EC… – two alignments possibly can have similar ECs, where one alignment is “good” and the other is “bad” EC is not enough… • To uncover regions of similar topology: – the aligned edges must cluster together and form large and dense connected sub-graphs. • Common connected sub-graph (CCS): – connected sub-graph that appears in both networks • Good alignment has: – large and dense CCSs. – Large EC 25/34 Statistical Significance • Random alignment of real-world networks: – the probability of obtaining a given or better EC at random. • Null model of random alignment: – – – – – • P: Random mapping g: E1 → V1 × V2. n1 = |V1|, n2 = |V2|, m1 = |E1|, and m2 = |E2|. p = n2 (n2 − 1)/2: the number of node pairs in G2 EC = x%: the edge correctness of the given alignment k = [m1 × x]: the number of aligned edges from G1 to edges in G2. – the probability of successfully aligning k or more edges by chance (the tail of the hypergeometric distribution): . 26/34 More statistical Significance Metrics • H-GRAAL’s alignment of random model networks: – Checks the significance of the alignment in compare to alignment of random networks: • Align two PPI networks, • align them with random networks, • compare results. • Biological Validation: – find the number of aligned protein pairs sharing a Gene Ontology (GO) term. – Compute its statistical significance. • Significance of functional enrichments: – Align metabolic networks of different species – generate phylogenetic trees based on H-GRAALs ECs. – Compute its statistical significance. 27/34 Results (1) • H-GRAAL always produces better alignments than GRAAL for all values of α. • using only degrees (α = 0) gives bad results. – So, graphlet-based signatures are far more valuable than a measure based on degree alone. 28/34 Results (2) • The largest common connected sub-graph in the alignment of the yeast and human PPI networks – consisting of 1,290 interactions amongst 317 proteins. – This network appears, in its entirety, in the PPI networks of both species. 29/34 Results (3) • Statistics of H-GRAAL’s core yeast-human alignment for α = 0.5. • The percentage of yeast proteins, out of 2,390 of them, that participate in n “optimizing pairs”. • Shows the quality of H-GRAAL! 30/34 Results (4) • Comparison of the phylogenetic trees for protists and fungies • H-GRAAL’s and GRAAL’s tree are slightly different from the sequence-based one. • Sequence-based trees are built based on: – multiple alignment of gene sequences – whole genome alignments. 31/34 Results (5) • Multiple alignments have few problems: – Can be misleading due to gene rearrangements, inversions, transpositions, and translocations (at the substring level) – Different species might have an unequal number of genes or genomes of vastly different lengths. • Whole genome alignments can be misleading: – Noncontiguous copies of a gene or non-decisive gene order. – The trees are built incrementally from smaller pieces that are “patched” together probabilistically probabilistic errors expected. • H-GRAAL’s and GRAAL’s have none of these. But – There are noise problems – Incompleteness of PPI networks. • No reason to believe that the sequence-based tree or GRAAL’s one should a priori be considered the correct one 32/34 Conclusions • Presented H-GRAAL algorithm for global alignment between networks • Presented different statistics to evaluate the quality of the alignment. • Experimented with different PPI networks, and not only PPI. • Showed that H-GRAAL is the best known global alignment algorithm. • H-GRAAL can have huge influence on researching biological networks! 33/34 Thank you for your attention! 34/34