A phylogenetic application of the combinatorial graph

advertisement
A phylogenetic application of the
combinatorial graph Laplacian
Eric A. Stone
Department of Statistics
Bioinformatics Research Center
North Carolina State University
My motivation for this project
• Trees in statistics or biology
– Often a latent branching structure relating some observed data
• Trees in mathematics
– Always a connected graph with no cycles
My motivation for this project
• Trees in statistics or biology
– PROBLEM: Recover properties of latent branching structure
• Trees in mathematics
– Always a connected graph with no cycles
My motivation for this project
• Trees in statistics or biology
– PROBLEM: Recover properties of latent branching structure
• Trees in mathematics
– Characterization of observed structure by spectral graph theory
My motivation for this project
• Trees in statistics or biology
– PROBLEM: Recover properties of latent branching structure
• Trees in mathematics
– Characterization of observed structure by spectral graph theory
Bridging the gap
• Rectifying trees and trees
• Can we use some powerful tools of spectral graph theory
to recover latent structure?
– Natural relationship between trees and complete graphs?!?
Tree and distance matrices
• The tree with vertex set {1,…,8} has distance matrix D
The phylogenetic portion D*
• The “phylogenetic tree” can only be observed at {1,…,5}
– We can only observe (estimate) the phylogenetic portion D*
More motivation for this project
• Trees in statistics or biology
– PROBLEM: Recover properties of latent branching structure
The phylogenetic portion D*
• Given D* only, recover latent branching structure
– This is the problem of phylogenetic reconstruction (w/o error!)
NJ finds (2,n-2) splits from D*
• A split is a bipartition of the leaf set (e.g. {1,2,3,4,5}) that
can be induced by cutting a branch on the tree
– e.g. {{1,2},{3,4,5}} or {{1,2,5},{3,4}}
• Neighbor-joining criterion identifies (2,n-2) splits through
{{1,2},{3,4,5}}
{{1,2,5},{3,4}}
A recipe for tree reconstruction from D*
1. Find a split
–
NJ relies on theorem that guarantees (2,n-2) split from Q matrix
2. Use knowledge of split to reduce dimension
–
NJ prunes the cherry (neighboring taxa) to reduce leaves by one
3. Iterate until tree has been fully reconstructed
–
Tree topology specified by its split set
Our narrow goal
1. Find a split
–
NJ relies on theorem that guarantees (2,n-2) split from Q matrix
– Hypothesize criterion that identifies deeper splits
…
and prove that it actually works
Our solution
The phylogenetic portion D*
Our solution
The phylogenetic portion D*
• Let H be the centering matrix:
• Find eigenvector Y of HD*H with the smallest eigenvalue
– The signs of the entries of Y identify a split of the tree
About the matrix HD*H
• Entries of HD*H are Dij – Di. – D.j + D..
• HD*H is negative semidefinite
– Zero is a simple eigenvalue with unit eigenvector
– Entries of remaining eigenvalues have both + and - entries
• HD*H appears prominently in:
– Multidimensional scaling
– Principal coordinate analysis
Example of our solution
• Find eigenvector Y of HD*H with the smallest eigenvalue:
-0.0564
+0.5793
+0.4418
-0.5011
-0.4636
• Signs of Y identify the split {{1,2},{3,4,5}}
A real example (data from ToL)
• Two iterations
Our solution
1. Find a split
–
NJ relies on theorem that guarantees (2,n-2) split from Q matrix
– Hypothesize criterion that identifies deep splits
…
and prove that it actually works
Affinity and distance
• In phylogenetics, common to consider pairwise distances
– In graph theory, common to consider pairwise affinities
Affinity-based
Distance-based
Distance matrix  Laplacian matrix
The genius of Miroslav Fiedler
• G connected  smallest eigenvalue of L, zero, is simple
– Smallest positive eigenvalue, , called algebraic connectivity of G
• Fiedler vectors Y satisfy LY=Y
– Fiedler cut is the sign-induced bipartition
-0.0223
-0.4277
+0.4840
-0.0158
+0.3449
+0.4038
-0.3653
-0.4047
The genius of Miroslav Fiedler
• G connected  smallest eigenvalue of L, zero, is simple
– Smallest positive eigenvalue, , called algebraic connectivity of G
• Fiedler vectors Y satisfy LY=Y
– Fiedler cut is the sign-induced bipartition
-0.0223
-0.4277
+0.4840
-0.0158
• Fiedler cut here is
– {{1,2,6},{3,4,5,7,8}}
+0.3449
-0.3653
• Note that the cut implies a leaf split:
– {{1,2},{3,4,5}}
+0.4038
-0.4047
Is this relevant here?
• We do not observe an 8x8 Laplacian matrix L
– All we get is a 5x5 matrix of between-leaf pairwise distances D*
The phylogenetic portion D*
• Where is the connection to graph theory?
Recall: Our solution
The phylogenetic portion D*
• Let H be the centering matrix:
• Find eigenvector Y of HD*H with the smallest eigenvalue
– The signs of the entries of Y identify a split of the tree
An extremely useful relationship
• Recall the centering matrix H
– The (Moore-Penrose) pseudoinverse of HDH is in fact -2L
• We have shown in the context of this formula
– Principal submatrices of D relate to Schur complements of L
• In particular, (HD*H)+ = -2L* = -2(L/Z) = -2(W – XZTY), where
W
X
Y
Z
Recall: Our solution
• Find eigenvector Y of HD*H with the smallest eigenvalue
– The signs of the entries of Y identify a split of the tree
• The smallest eigenvalue of HD*H (negative semidefinite) is
the smallest positive eigenvalue of L*
• In fact, L* can be seen as a graph Laplacian
– And our solution, Y, is the Fiedler vector of that graph!
• But what does this graph look like?
Schur complementation of a vertex
• The vertices adjacent to 8 become adjacent to each other
Schur complementation of the interior
• The graph described by L* is fully connected
– All cuts yield connected subgraphs  No help from Fiedler
Recap thus far
• Given matrix D* of pairwise distances between leaves
• Find eigenvector Y of HD*H with the smallest eigenvalue
– Claim: The signs of the entries of Y identify a split of the tree
• Y shown to be a Fiedler vector of the Laplacian L*
– But graph of L* is fully connected, has no apparent structure
• Thus Fiedler says nothing about signs of entries of Y
– But claim requires signs to be consistent with structure of the tree
Recap thus far
• Thus Fiedler says nothing about signs of entries of Y
– But claim requires signs to be consistent with structure of the tree
NO
NO
• How does L* inherit the structure of the tree?
YES
The quotient rule inspires a “Schur tower”
The quotient rule inspires a “Schur tower”
• How does this help?
Cutpoints and connected components
• A point of articulation (or cutpoint) is a point rG whose
deletion yields a subgraph with 2 connected components
– Cutpoints: 6,7,8
– Shown: {1}, {2}, {3,4,5,7,8} are connected
components at 6
• The cutpoints of a tree are its internal nodes
The key observation (i.e. theorem)
• Let L be the Laplacian of a graph G with some cutpoint v
– Let L{v} be the Laplacian of G{v} obtained by Schur complement at v
+
+
+0.5828
+
-0.4129
+0.0380
-
?
-0.3439
G
+
+0.0570
-
+0.4660
G{6}
-
-0.3870
• Then the Fiedler cut G{v} identifies a split of G
– Here the Fiedler cut of G{6} is {{1,2,5,8},{3,4,7}}
– Including 6 in {1,2,5,8} defines two connected components in G
The quotient rule inspires a “Schur tower”
L
L*
• How does this help?
 Look at Schur paths to graph with Laplacian L*
The punch line
• The graph with Laplacian L* can be obtained in three ways
• The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8}
The punch line
• The graph with Laplacian L* can be obtained in three ways
• The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8}
Recall: Example
• Find eigenvector Y of HD*H with the smallest eigenvalue:
-0.0564
+0.5793
+0.4418
-0.5011
-0.4636
• Signs of Y identify the split {{1,2},{3,4,5}}
The punch line
• The graph with Laplacian L* can be obtained in three ways
{{1,2,6},{3,4,5,7,8}}
• The Fiedler cut of G{6,7,8} must split G{6,7} and G{6,8} and G{7,8}
• This implies that the cut splits the progenitor graph G!
Our solution actually works
The phylogenetic portion D*
• Let H be the centering matrix:
• Find eigenvector Y of HD*H with the smallest eigenvalue
– The signs of the entries of Y identify a split of the tree
A recipe for tree reconstruction
1. Find a split
–
–
NJ relies on theorem that guarantees (2,n-2) split from Q matrix
We have a theorem that guarantees splits from HD*H matrix
2. Use knowledge of split to reduce dimension
–
–
NJ prunes the cherry (neighboring taxa) to reduce leaves by one
We use a divisive method that reduces to pairs of subtrees
3. Iterate until tree has been fully reconstructed
–
Tree topology specified by its split set
Reconstruction from the inside out
Connections with Classical MDS and PCoA
• Classical solution to multidimensional scaling
– a.k.a. Principal coordinate analysis
• Recipe for dimension reduction given distance matrix D:
1. Construct matrix A from D entrywise: x  -x2/2
2. Double centering: B = HAH
3. Find k largest eigenvalues i of B with corresponding
eigenvectors Xi
4. Coordinates of point Pr given by row r of eigenvector entries
 k = 1 with sqrt of tree distance equivalent to our approach
Phylogenetic ordination
• PCoA on sequence data with k = 3:
– For appropriate distance, C1 (x-axis) guaranteed to split taxa at 0
• Our results support popular use of PCoA
– Provided that the right distance is considered…
Conclusion I
• Natural connection between matrix of pairwise distances
and the Laplacian of a complete graph
Conclusion II
• Structure of tree embedded in complete graph and
recoverable via spectral theory
NO
NO
YES
• Notion of “Fiedler cut” extends concept to “Fiedler split”
– Inheritance propagated through Schur tower
Conclusion III
• Results inspire fast divisive tree reconstruction method
Conclusion IV
• Provides guidance and justification for ordination approach
Acknowledgements
• Alex Griffing (NCSU Bioinformatics)
• Carl Meyer (NCSU Math)
• Amy Langville (CoC Math)
Download