A phylogenetic application of the combinatorial graph Laplacian Eric A. Stone Department of Statistics Bioinformatics Research Center North Carolina State University My motivation for this project • Trees in statistics or biology – Often a latent branching structure relating some observed data • Trees in mathematics – Always a connected graph with no cycles My motivation for this project • Trees in statistics or biology – PROBLEM: Recover properties of latent branching structure • Trees in mathematics – Always a connected graph with no cycles My motivation for this project • Trees in statistics or biology – PROBLEM: Recover properties of latent branching structure • Trees in mathematics – Characterization of observed structure by spectral graph theory My motivation for this project • Trees in statistics or biology – PROBLEM: Recover properties of latent branching structure • Trees in mathematics – Characterization of observed structure by spectral graph theory Bridging the gap • Rectifying trees and trees • Can we use some powerful tools of spectral graph theory to recover latent structure? – Natural relationship between trees and complete graphs?!? Tree and distance matrices • The tree with vertex set {1,…,8} has distance matrix D The phylogenetic portion D* • The “phylogenetic tree” can only be observed at {1,…,5} – We can only observe (estimate) the phylogenetic portion D* More motivation for this project • Trees in statistics or biology – PROBLEM: Recover properties of latent branching structure The phylogenetic portion D* • Given D* only, recover latent branching structure – This is the problem of phylogenetic reconstruction (w/o error!) NJ finds (2,n-2) splits from D* • A split is a bipartition of the leaf set (e.g. {1,2,3,4,5}) that can be induced by cutting a branch on the tree – e.g. {{1,2},{3,4,5}} or {{1,2,5},{3,4}} • Neighbor-joining criterion identifies (2,n-2) splits through {{1,2},{3,4,5}} {{1,2,5},{3,4}} A recipe for tree reconstruction from D* 1. Find a split – NJ relies on theorem that guarantees (2,n-2) split from Q matrix 2. Use knowledge of split to reduce dimension – NJ prunes the cherry (neighboring taxa) to reduce leaves by one 3. Iterate until tree has been fully reconstructed – Tree topology specified by its split set Our narrow goal 1. Find a split – NJ relies on theorem that guarantees (2,n-2) split from Q matrix – Hypothesize criterion that identifies deeper splits … and prove that it actually works Our solution The phylogenetic portion D* Our solution The phylogenetic portion D* • Let H be the centering matrix: • Find eigenvector Y of HD*H with the smallest eigenvalue – The signs of the entries of Y identify a split of the tree About the matrix HD*H • Entries of HD*H are Dij – Di. – D.j + D.. • HD*H is negative semidefinite – Zero is a simple eigenvalue with unit eigenvector – Entries of remaining eigenvalues have both + and - entries • HD*H appears prominently in: – Multidimensional scaling – Principal coordinate analysis Example of our solution • Find eigenvector Y of HD*H with the smallest eigenvalue: -0.0564 +0.5793 +0.4418 -0.5011 -0.4636 • Signs of Y identify the split {{1,2},{3,4,5}} A real example (data from ToL) • Two iterations Our solution 1. Find a split – NJ relies on theorem that guarantees (2,n-2) split from Q matrix – Hypothesize criterion that identifies deep splits … and prove that it actually works Affinity and distance • In phylogenetics, common to consider pairwise distances – In graph theory, common to consider pairwise affinities Affinity-based Distance-based Distance matrix Laplacian matrix The genius of Miroslav Fiedler • G connected smallest eigenvalue of L, zero, is simple – Smallest positive eigenvalue, , called algebraic connectivity of G • Fiedler vectors Y satisfy LY=Y – Fiedler cut is the sign-induced bipartition -0.0223 -0.4277 +0.4840 -0.0158 +0.3449 +0.4038 -0.3653 -0.4047 The genius of Miroslav Fiedler • G connected smallest eigenvalue of L, zero, is simple – Smallest positive eigenvalue, , called algebraic connectivity of G • Fiedler vectors Y satisfy LY=Y – Fiedler cut is the sign-induced bipartition -0.0223 -0.4277 +0.4840 -0.0158 • Fiedler cut here is – {{1,2,6},{3,4,5,7,8}} +0.3449 -0.3653 • Note that the cut implies a leaf split: – {{1,2},{3,4,5}} +0.4038 -0.4047 Is this relevant here? • We do not observe an 8x8 Laplacian matrix L – All we get is a 5x5 matrix of between-leaf pairwise distances D* The phylogenetic portion D* • Where is the connection to graph theory? Recall: Our solution The phylogenetic portion D* • Let H be the centering matrix: • Find eigenvector Y of HD*H with the smallest eigenvalue – The signs of the entries of Y identify a split of the tree An extremely useful relationship • Recall the centering matrix H – The (Moore-Penrose) pseudoinverse of HDH is in fact -2L • We have shown in the context of this formula – Principal submatrices of D relate to Schur complements of L • In particular, (HD*H)+ = -2L* = -2(L/Z) = -2(W – XZTY), where W X Y Z Recall: Our solution • Find eigenvector Y of HD*H with the smallest eigenvalue – The signs of the entries of Y identify a split of the tree • The smallest eigenvalue of HD*H (negative semidefinite) is the smallest positive eigenvalue of L* • In fact, L* can be seen as a graph Laplacian – And our solution, Y, is the Fiedler vector of that graph! • But what does this graph look like? Schur complementation of a vertex • The vertices adjacent to 8 become adjacent to each other Schur complementation of the interior • The graph described by L* is fully connected – All cuts yield connected subgraphs No help from Fiedler Recap thus far • Given matrix D* of pairwise distances between leaves • Find eigenvector Y of HD*H with the smallest eigenvalue – Claim: The signs of the entries of Y identify a split of the tree • Y shown to be a Fiedler vector of the Laplacian L* – But graph of L* is fully connected, has no apparent structure • Thus Fiedler says nothing about signs of entries of Y – But claim requires signs to be consistent with structure of the tree Recap thus far • Thus Fiedler says nothing about signs of entries of Y – But claim requires signs to be consistent with structure of the tree NO NO • How does L* inherit the structure of the tree? YES The quotient rule inspires a “Schur tower” The quotient rule inspires a “Schur tower” • How does this help? Cutpoints and connected components • A point of articulation (or cutpoint) is a point rG whose deletion yields a subgraph with 2 connected components – Cutpoints: 6,7,8 – Shown: {1}, {2}, {3,4,5,7,8} are connected components at 6 • The cutpoints of a tree are its internal nodes The key observation (i.e. theorem) • Let L be the Laplacian of a graph G with some cutpoint v – Let L{v} be the Laplacian of G{v} obtained by Schur complement at v + + +0.5828 + -0.4129 +0.0380 - ? -0.3439 G + +0.0570 - +0.4660 G{6} - -0.3870 • Then the Fiedler cut G{v} identifies a split of G – Here the Fiedler cut of G{6} is {{1,2,5,8},{3,4,7}} – Including 6 in {1,2,5,8} defines two connected components in G The quotient rule inspires a “Schur tower” L L* • How does this help? Look at Schur paths to graph with Laplacian L* The punch line • The graph with Laplacian L* can be obtained in three ways • The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8} The punch line • The graph with Laplacian L* can be obtained in three ways • The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8} Recall: Example • Find eigenvector Y of HD*H with the smallest eigenvalue: -0.0564 +0.5793 +0.4418 -0.5011 -0.4636 • Signs of Y identify the split {{1,2},{3,4,5}} The punch line • The graph with Laplacian L* can be obtained in three ways {{1,2,6},{3,4,5,7,8}} • The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8} • This implies that the cut splits the progenitor graph G! Our solution actually works The phylogenetic portion D* • Let H be the centering matrix: • Find eigenvector Y of HD*H with the smallest eigenvalue – The signs of the entries of Y identify a split of the tree A recipe for tree reconstruction 1. Find a split – – NJ relies on theorem that guarantees (2,n-2) split from Q matrix We have a theorem that guarantees splits from HD*H matrix 2. Use knowledge of split to reduce dimension – – NJ prunes the cherry (neighboring taxa) to reduce leaves by one We use a divisive method that reduces to pairs of subtrees 3. Iterate until tree has been fully reconstructed – Tree topology specified by its split set Reconstruction from the inside out Connections with Classical MDS and PCoA • Classical solution to multidimensional scaling – a.k.a. Principal coordinate analysis • Recipe for dimension reduction given distance matrix D: 1. Construct matrix A from D entrywise: x -x2/2 2. Double centering: B = HAH 3. Find k largest eigenvalues i of B with corresponding eigenvectors Xi 4. Coordinates of point Pr given by row r of eigenvector entries k = 1 with sqrt of tree distance equivalent to our approach Phylogenetic ordination • PCoA on sequence data with k = 3: – For appropriate distance, C1 (x-axis) guaranteed to split taxa at 0 • Our results support popular use of PCoA – Provided that the right distance is considered… Conclusion I • Natural connection between matrix of pairwise distances and the Laplacian of a complete graph Conclusion II • Structure of tree embedded in complete graph and recoverable via spectral theory NO NO YES • Notion of “Fiedler cut” extends concept to “Fiedler split” – Inheritance propagated through Schur tower Conclusion III • Results inspire fast divisive tree reconstruction method Conclusion IV • Provides guidance and justification for ordination approach Acknowledgements • Alex Griffing (NCSU Bioinformatics) • Carl Meyer (NCSU Math) • Amy Langville (CoC Math)