Reconstructing regular networks from their trees Abstract

advertisement
Reconstructing regular networks from their trees
Abstract
Suppose N is a network---i.e., an acyclic directed graph with labelled leaves. By
identifying a particular parent for each hybrid vertex, various trees can be displayed
by N. Let Tr(N) denote the set of trees displayed by N. In general there can be
many networks M such that Tr(N) = Tr(M). Suppose that the collection Tr(N) is
given but N is not known. We study how to reconstruct N from Tr(N) in the case
where N is regular, showing that such a regular network is uniquely determined.
For certain smaller collections of trees displayed by a regular network N we give a
more complicated method to reconstruct N. Simpler reconstruction procedures are
given if N is normal.
Reconstructing regular networks from their trees
by Stephen J. Willson
Department of Mathematics
Iowa State University
Ames, Iowa, USA 50011
swillson@iastate.edu
Phylogenetics 2009
Rényi Institute, Budapest, Hungary
June 24, 2009
The basic problem in phylogeny is the reconstruction of evolutionary history from
data.
Most work produces rooted phylogenetic trees.
The simplest assumption is that there is one tree (the species tree). For each gene,
one uses various methods to produce a tree for that gene, called a gene tree. The
hope was that the species tree is typically the same as the gene tree for each gene.
This is often not the case. For example, Rokas, Williams, King, and Carroll 2003:
articles
that commonly used genes may provide better resolution of the
same set of taxa. We repeated ML and MP analyses of nucleotide
sequence on a sample of six commonly used genes (actin, hsp70,
b-tubulin, RNA polymerase II, elongation factor 1-a and 18S
rDNA), and also obtained alternative and often strongly supported
phylogenies (Fig. 1g–l). Specifically, actin and RNA polymerase II
(Fig. 1g, j) each support the same tree as in Fig. 1a, whereas hsp70
(Fig. 1h) supports the topology in Fig. 1b. b-Tubulin (Fig. 1i), 18S
rDNA (Fig. 1l) and elongation factor 1-a (Fig. 1k) each support,
albeit weakly at some branches, trees not represented in Fig. 1a–f.
Therefore, analyses of both commonly used genes and the set of 106
orthologous genes provide strong support for several alternative
phylogenies and fail to indicate which gene tree(s) might represent
the actual species tree.
The alternative phylogenies could have resulted from a number
of different scenarios: (1) most genes could have weakly supported
most phylogenies and strongly supported only a few alternative
trees; (2) most genes could have strongly supported one phylogeny
and a few genes strongly supported only a small number of
alternatives; (3) there could have been some combination of these
scenarios so that each branch among alternative phylogenies had
either weak or strong support depending on the gene. To distinguish
between these possibilities, we identified all of the branches recovered during the single-gene analyses, and recorded each bootstrap
value with respect to the gene and method of analysis (see Supplementary Information). Eight branches were shared by all three
analyses with multiple instances of bootstrap values .50%; hence,
we focus on these eight branches in our analyses.
For each branch, we observed a full range of bootstrap values
(Fig. 2c). Only two branches (branches 1 and 4 in Fig. 2) were
supported with high bootstrap values by a majority of genes. The
remaining six branches were supported by bootstrap values that
range evenly from 0 to 100 (Fig. 2c). Two summary statistics of these
distributions, the mean bootstrap value and the percentage of genes
supporting each branch (Fig. 2c), also indicate weak support overall
for each of these six branches (Fig. 2c). Notably, bootstrap values
for branches 3 versus 6 and among branches 5, 7 and 8 exhibit
significant and strong negative correlations so that a low bootstrap
value for one branch corresponds to a high bootstrap value for an
alternative branch (Pearson correlation coefficients and P-values
are: 20.684 and ,0.001, 20.734 and ,0.001, and 20.607 and
,0.001 for branches 3 versus 6, 5 versus 7, and 5 versus 8,
respectively; values are for ML analysis; similar results were
obtained from the remaining two MP analyses (data not shown)).
Although the values of some other pairs of branches were also
correlated, these were the only pairs of branches exhibiting a very
strong negative correlation across all three analyses. This negative
correlation is indicative of the pervasive direct conflict among genes
with respect to these branches. In summary, the support for a given
branch was strongly dependent on the gene analysed.
Although the trees recovered from single-gene analyses are
incongruent in that they exhibit topological differences, the degree
of conflict among trees could be relatively minor. We examined
the degree of incongruence between the trees recovered by determining the number of taxa that would need to be removed in order
to make two trees congruent (see Methods). These values were
computed for all possible pairwise comparisons among the 106 50%
majority-rule consensus trees generated within each of the three
analyses. The distribution of values shows that only a small
proportion of all pairwise comparisons are in total agreement
(,15%), whereas most of the trees (.50%) require that at least
two of the eight taxa are removed before they become congruent,
regardless of the method of analysis (Fig. 3). These results were
consistent with other metrics of incongruence (see Methods and
Supplementary Information) and indicate extensive incongruence
among the 106 data sets.
Figure 1 Single-gene data sets generate multiple, robustly supported alternative
topologies. Representative alternative trees recovered from analyses of nucleotide data of
106 selected single genes and six commonly used genes are shown. The trees are the
50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b),
YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f), as well as those from the
commonly used genes actin (g), hsp70 (h), b-tubulin (i), RNA polymerase II (j) elongation
factor 1-a (k) and 18S rDNA (l). Numbers above branches indicate bootstrap values
(ML on nucleotides/MP on nucleotides).
NATURE | VOL 425 | 23 OCTOBER 2003 | www.nature.com/nature
Factors potentially influencing incongruence between trees
We mitigated some potential sources of incongruence by analysing a
genome-wide sample of orthologous genes. There are a number of
799
If we assume that the underlying description is a species tree, differing gene trees can
be explained by lineage sorting or coalescence models (Degnan and Rosenberg
2006), especially when there are short edges. They can also be explained by
homoplasies.
But there are other events that might occur.
• hybridization
• lateral gene transfer
r
A
1
B
C
E
D
2
F
7
6
3
4
5
The results are networks, not trees.
Assumptions on networks
A network is a rooted directed graph which is acyclic (it has no directed cycles).
r
A
1
B
C
E
D
2
F
7
6
3
4
5
Example. The root is r.
Usually the leaves correspond in a one-to-one manner to species on which
measurements can be made. We identify the leaves with a set X of taxa.
r
A
1
B
C
E
D
2
F
7
6
3
4
5
X = {1, 2, 3, 4, 5, 6, 7}
Call X the leaf-set or base-set.
The root has indegree 0.
A tree-child or normal vertex has indegree 1.
A hybrid vertex has indegree 2 or greater.
r
A
1
B
C
E
D
2
F
7
6
3
4
5
D and 6 are hybrid. r is the root. Other vertices are normal.
So far, a phylogenetic network N = (V, A, r, X) is an acyclic directed graph
V = the set of vertices
A = the set of (directed) arcs
r = the root
X = the base-set
In order to make progress, additional hypotheses will be needed on phylogenetic
networks. There are lots of different choices of hypotheses.
• regular networks (Baroni, Semple, and Steel 2004)
• normal networks (W 2008 )
• time consistent networks (Moret et al 2004; Cardona, Rossalló, and Valiente 2008)
• tree-child networks (Cardona, Rossalló, and Valiente 2007)
• reconstructible networks (Moret et al 2004)
• galled trees (reticulation cycles are edge-disjoint) (Gusfield, Eddhu, and Langley
2004; L. Nakhleh, T. Warnow, C.R. Linder, and K.S. John, 2005)
• galled networks (reticulation cycles may share a tree) (Huson and Klöpper 2007)
• level-k networks (van Iersel et al 2008)
Trees displayed by a network
1
1
2
2
3
3
1
A network
2
3
The trees it displays
Model of evolution on a network
A given gene utilizes one arc to a hybrid and not the other. Thus the arcs occur with
a certain probability.
.7
.3
1
2
3
The probability of the left arc is 0.7.
1
2
3
3
1
2
The left tree occurs with probability 0.7 and the right tree with probability 0.3.
Details about displaying trees
Let N = (V, A, r, X) be a phylogenetic network.
A parent map for N is a map
p: V-{r} ! V
such that for each v , v"r, p(v) is a parent of v.
r
A
1
B
C
E
D
2
F
7
6
3
5
4
p(6) = D, p(D) = C, p(3) = F, p(2) = B, ...
q(6) = D, q(D) = B, q(3) = F, q(2) = B, ...
s(6) = E, s(D) = C, s(3) = F, s(2) = B, ...
t(6) = E, t(D) = B, t(3) = F, t(2) = B, ....
Given a parent map p, obtain a network Np by using only the arcs (p(v),v).
r
A
1
B
C
E
D
2
F
7
6
3
5
4
p(6) = D, p(D) = C, p(3) = F, p(2) = B, ...
r
A
1
B
C
E
D
2
F
7
6
3
5
4
Np. Note that it has vertices of outdegree 1.
Simplifications:
Type 1: Suppress a vertex with outdegree 1.
Type 2: Suppress a vertex with no directed path to a member of X.
The result is T(Np).
r
A
1
B
C
E
D
2
F
7
6
3
Np
4
5
r
A
1
C
2
D
F
7
6
3
5
4
A tree T(Np) displayed by N.
Example:
D
E
F
G
H
2
1
A network N.
3
Parent map p(G) = E, p(H) = E.
Np
D
D
E
F
E
G
H
G
3
2
2
1
1
=
Suppress F since there is no path from F to a member of X.
Suppress H since it has outdegree 1.
3
Suppress D since it has outdegree 1
D
E
E
G
G
1
2
3
T(Np) =
1
2
3
D
Similarly N
E
F
G
H
1
2
3
1
displays
2
3
Let Tr(N) be the set of trees displayed by N.
D
For N =
E
F
G
H
1
2
3
Tr(N) =
1
2
3
1
2
3
Basic question for this talk:
Suppose N is a network and Tr(N) is the set of trees it displays. If we are given
Tr(N) or a suitable subset of Tr(N), how can we find N?
A fundamental difficulty
M=
1
2
3 Tr(M) =
1
2
3
1
2
3
There can be two distinct networks N and M such that Tr(M) = Tr(N).
D
N =
E
F
G
H
1
2
3
M=
1
2
3
Suppose are given Tr(N) but not N.
In order to reconstruct N, we need to make some further assumptions about the
nature of N.
Otherwise, if Tr(M) = Tr(N) we can't reconstruct both uniquely by a determinate
method.
Reconstruction of reticulate networks from gene trees: (D.
Huson, T. Klöpper, P. Lockhart, and M. Steel, 2005).
A network N supports a set D of X-trees provided that for each tree T in D, there
is a refinement T ' of T that is displayed by D.
1. Given a set D of X-trees, find a network N that supports D and has a minimum
number of hybrid vertices.
(Computationally intractable--Bordewich and Semple 2007)
2. Given a set D of X-trees, find a network N that supports D, has only
independent reticulations (eg., is a galled tree), and has a minimal number of hybrid
vertices, if one exists.
(tractable)
3. Given a set D of X-trees, find a network N that supports D, has only
independent or overlapping reticulations, and has a minimal number of hybrid
vertices, if one exists. (tractable if we limit the maximal number of hybrid vertices in
a tangle)
Phylogenetic networks: modeling, reconstructibility, and
accuracy: (B. Moret, L. Nakhleh, T. Warnow, C. R. Linder,
A. Tholse, A. Padolina, J. Sun, and R. Timme, 2004)
If M and N are "reconstructible" networks and Tr(M) = Tr(N), then M and N have
isomorphic "reduced" versions.
Let N= (V, A, r, X) be a phylogenetic network.
For v # V, define the (full) cluster of v in N by
cl(v,N) = {x # X: there is a directed path from v to x}.
r
A
1
B
C
E
D
2
F
7
6
3
4
5
cl(C,N) = {3,4,5,6,7}. cl(B,N) = {2,3,4,5,6}. cl(6,N) = {6}.
Let N= (V, A, r, X) be a phylogenetic network. N is regular if
(1) The cluster map is one-to-one (ie, cl(u,N) = cl(v,N) iff u = v); and
(2) There is an arc (u,v) if and only if
(a) cl(v,N) $ cl(u,N), and
(b) there is no vertex w such that cl(v,N) $ cl(w,N) $ cl(u,N).
(Baroni, Semple, and Steel 2004)
r
A
1
B
C
E
D
2
F
7
6
3
5
4
There is an arc (C,D) since cl(D,N) = {3,4,5,6} $ { 3,4,5,6,7 } = cl(C,N)
and no other cluster fits in between these. This N is regular.
D
E
F
G
H
3
2
1
This N is not regular: cl(E,N) = {1,2,3} = cl(F,N).
Basic fact: To reconstruct a regular network N, we need only find {cl(v,N): v # V}.
We can then use the Hasse construction to find the arcs.
There is an arc (u,v) if and only if
(a) cl(v,N) $ cl(u,N), and
(b) there is no vertex w such that cl(v,N) $ cl(w,N) $ cl(u,N).
Given a collection of clusters
{1,2,3,4,5}, {1}, {2}, {3}, {4}, {5},
{2,3}, {1,2,3}, {2,3,4}, {1,2,3,4}
the Hasse construction yields a regular network
5
1
4
2
3
Main Theorem.
Suppose N = (V,A,r,X) is a regular phylogenetic network. Then there is a procedure
(Maximal Proper Child) which reconstructs N when given Tr(N).
A network N = (V, A, r, X) is normal if
(1) N is regular; and
(2) Every vertex not a leaf has a normal child (tree-child child).
r
A
1
B
C
E
D
2
F
7
6
3
4
N is normal.
5
From every vertex not a leaf there is a directed path to a leaf such that all vertices
except possibly the first are normal.
r
B
C
D
1
3
2
N is regular but not normal.
E
4
Theorem. Suppose N is a normal network. Suppose D % Tr(N) is a collection of
trees displayed by N such that for every arc (u,v) of N there exists a tree T in D
with an arc (u',v') such that
cl(u',T) = cl(u,N) and
cl(v', T) = cl(v,N).
There is a procedure (Maximal Child) to reconstruct N from D.
Example. A network N
r
A
1
B
C
E
D
2
F
7
6
3
5
4
To detect (C,E) we must input a tree containing both the clusters
cl(C,N) = {3,4,5,6,7} and cl(E,N) = {6,7}.
r
A
1
B
C
E
D
2
F
7
6
3
5
4
Remove the red arcs to contain both the clusters
cl(C,N) = {3,4,5,6,7} and cl(E,N) = {6,7}.
r
A
1
B
C
E
2
F
7
6
3
4
5
Corollary. If N is normal then there is a procedure to reconstruct N from Tr(N).
Corollary. If N is normal then Tr(N) uniquely determines N.
Procedure Maximal Child:
Input: A set D of displayed trees with leaf set X.
Output: A set S of subsets of X. The network N will be the Hasse graph of S.
1. Initially S = {X}.
2. Repeat until done:
a. If U in S, let Tr (U) = {T in D: T has the cluster U}.
b. Let C (U) = {clusters of children of U in T, for T in Tr (U)}.
c. Let M (U) = the maximal members of C (U).
d. Adjoin M (U) to S.
Example. A normal network N will be reconstructed from D = Tr(N).
r
B
C
E
D
2
F
1
6
3
5
4
Initially S contains X = {1,2,3,4,5,6}.
Consider U = {1,2,3,4,5,6}.
There is a tree T containing U = {1,2,3,4,5,6} and also cl(B,N) = {2,3,4,5,6} and
{1}.
r
B
C
E
D
2
F
1
6
3
5
4
Hence {2,3,4,5,6} and {1} are in C (U).
There is a tree T containing U = {1,2,3,4,5,6} and also cl(C,N) = {1,3,4,5,6} and
{2}.
r
B
C
E
D
2
F
1
6
3
5
4
Hence {1,3,4,5,6} and {2} are in C (U).
There is a tree T containing U = {1,2,3,4,5,6} and also {2,3,4,5} and {1,6}.
r
B
C
E
D
2
F
1
6
3
4
5
Hence {2,3,4,5} and {1,6} are in C (U).
In fact, C (U) = { {2,3,4,5,6}, {1,3,4,5,6}, {2}, {1}, {1,6}, {2,3,4,5} }
r
B
C
E
D
2
F
1
6
3
4
5
The maximal members are {2,3,4,5,6} and {1,3,4,5,6}. These are the children of U
= {1,2,3,4,5,6}.
Thus S contains the clusters {1,2,3,4,5,6}, {2,3,4,5,6}, {1,3,4,5,6}.
Now let U = {1,3,4,5,6}.
r
B
C
E
D
2
F
1
6
3
4
5
C (U) = {{3,4,5,6}, {1}, {3,4,5}, {1,6}}.
The maximal members are {3,4,5,6} and {1,6}. These will be the children of U.
Thus S contains the clusters {1,2,3,4,5,6}, {2,3,4,5,6}, {1,3,4,5,6}, {3,4,5,6}, {1,6}.
In this manner we find the clusters of all vertices of N, from which N is
reconstructed. This method will work, if N is normal and D satisfies the
hypotheses.
If N is normal, we do not need D = Tr(N).
Theorem. Suppose N is a normal network. Suppose D is a collection of trees
displayed by N such that for every arc (u,v) of N there exists a tree T in D with an
arc (u',v') such that
cl(u',T) = cl(u,N) and cl(v', T) = cl(v,N).
Then the procedure Maximal Child with input D outputs N.
Thus we can get by with certain D such that |D| & a.
Complexity of Procedure Maximal Child.
Input:
X -- n leaves
D -- d trees
V -- v vertices
A -- a arcs
The algorithm terminates in time O(n2 v d2).
Suppose the network is normal. One can show v =O(n2) and a = O(n2).
Hence the total time is O(n4 d2).
If each hybrid has 2 parents, then a normal network has at most n hybrid vertices,
and d & 2n.
If N is normal, then there exists D such that d & a = O(n2) and N can be
reconstructed from D in time O(n2 v a2) = O( n8 )
Normal networks can be reasonably complicated:
1
2
3
...
4
n-2
5
n-1
n
This normal network contains v = (n2 + n + 2) / 2 vertices
and a = (n2 +3 n - 6) / 2 arcs.
A normal network can be level-k for arbitrarily large k.
Why can there be only v = O(n2) vertices and a = O(n2) arcs in a normal network?
To each leaf x in X there is a maximal normal path Px such that each vertex after
the first is normal.
r
B
C
E
D
2
F
1
6
3
4
5
P3 in red, P7 in blue. P6 is trivial.
Each Px has at most n vertices. Every vertex lies on some Px. Hence v = O(n2).
There are at most n (n-1) arcs such that for some x, the arc lies on Px.
r
B
C
E
D
2
F
1
6
3
5
4
All arcs in some Px are in red.
Suppose (u,v) is an arc not in some Px.
If v were normal, then a normal path from v to Px could be extended to u.
Hence v is hybrid, hence the start of some Px.
r
B
C
E
D
2
F
1
6
3
4
5
For such (u,v), v is hybrid starting some Px and u is on some Py for y " x.
But there cannot be 2 vertices u1 and u2 on Py both with arcs to v. Otherwise
(u1,v) or (u2,v) is redundant.
u1
v
u2
x
y
Hence there are at most n (n-1) arcs not on any Px.
Thus a = O(n2) + O(n2) = O(n2).
Maximal Child may NOT work if N is regular but not normal.
r
B
C
4
D
F
5
G
1
6
2
E
3
Consider U = {1,2,3}.
The correct children should be {1} and {2,3}.
r
B
C
4
D
F
5
G
1
6
2
E
3
C (U) = { {1}, {2}, {3}, {2,3}, {1,2}, {1,3} }
The maximal children are {2,3}, {1,2}, and {1,3}, which are wrong.
Main Theorem.
Suppose N = (V,A,r,X) is a regular phylogenetic network. Then there is a procedure
(Maximal Proper Child) which reconstructs N when given Tr(N).
The crux of the revision is the use of "proper trees".
Let N be a network and v0=r, v1, ..., vn be a directed path. A displayed tree T is
proper for the path if there is a directed path w0=r, ..., wn in T such that for all i,
cl(vi,N) = cl(wi,T).
r
B
C
4
D
F
5
G
1
6
2
A directed path in N
E
3
r
B
r
F
5
C
B
C
4
D
6
F
5
G
1
E
2
3
4
D
6
2
A tree proper for the path.
G
1
E
3
r
B
r
F
5
C
B
C
4
D
6
F
5
G
1
E
4
D
6
G
1
2
2
3
This tree is not proper for the path, but it is proper for a different path.
E
3
Procedure Maximal Proper Child
Input: A set D of displayed trees with leafset X.
Output: A set S of subsets of X. The network M will be the Hasse graph of S.
1. Initially S = {X}.
2. Repeat until done:
a. If U in S, choose a path P in the Hasse graph of S from X to U. By induction
this will be a directed path in N.
b. Let PrTr(P) = {T in D: T has the cluster U and T is a proper tree for P}.
c. Let C (U) = {clusters of children of U in T, for T in PrTr(P)}.
d. Let M(U) = the maximal members of C (U).
e. Adjoin M(U) to S.
Main Theorem.
Suppose N = (V,A,r,X) is a regular phylogenetic network. Then Maximal Proper
Child reconstructs N when given D = Tr(N).
Example.
r
B
C
4
D
F
5
1
6
2
X = {1,2,3,4,5,6}.
G
E
3
Input: the trees displayed by N:
T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}.
T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}.
T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}.
T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}.
T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}.
All trees are proper for the trivial path at X.
C(X) = { {1,2,3,4,6}, {5}, {1,2,3,4}, {5,6} }
M(X) = { {1,2,3,4,6}, {5,6} }.
r
B
C
4
D
F
5
G
1
6
2
E
3
For the children of B = {1,2,3,4,6} we use only trees proper for the path r, B.
T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}.
T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}.
T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}.
T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}.
T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}.
C({1,2,3,4,6}) = { {1,2,3,6}, {4} } = M({1,2,3,4,6})
r
B
C
4
D
F
5
G
1
6
2
E
3
For D = {1,2,3,6} we use only trees proper for the path r, B, D.
T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}.
T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}.
T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}.
T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}.
T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}.
C({1,2,3,6}) = { {1,2,3}, {6}, {1,3}, {2,6}, {1,6}, {2,3}, {1,2,6},{3} }
M({1,2,3,6}) = { {1,2,3}, {1,2,6} }
r
B
C
4
D
F
5
G
1
6
2
E
3
For G = {1,2,3} we use only trees proper for r, B, D, G.
These contain {1,2,3,4,5,6}, {1,2,3,4,6}, {1,2,3,6}, {1,2,3}.
T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}.
T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}.
T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}.
T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}.
T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}.
T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}.
C({1,2,3}) = { {2,3}, {1}} = M({1,2,3,6})
r
B
C
4
D
F
5
G
1
6
2
In this manner we reconstruct N.
E
3
Main Theorem.
Suppose N = (V,A,r,X) is a regular phylogenetic network. Then Maximal Proper
Child reconstructs N when given D = Tr(N).
Note:
• We might need proper trees for lots of paths.
• Tr(N) could be a very large set.
Corollary. Suppose M and N are regular phylogenetic networks with base-set X. If
Tr(M) = Tr(N), then M = N.
This is related to a result in Moret et al 2004. They define "reconstructible"
networks N and a reduced form R(N). They assert that if M and N are
reconstructible and Tr(M) = Tr(N), then R(M) = R(N).
They do not give a procedure to find R(N) from Tr(N).
The time complexity of Maximal Proper Child is O(n2 v d2)
X -- n leaves
D -- d trees
V -- v vertices
For a regular network, v =O( 2n ). There can be exponentially many hybrid vertices
so d can be O(2(2n)).
This is much longer than Maximal Child because regular networks can be so much
more complicated than normal networks.
Do we really need all of Tr(N) when N is regular?
Theorem.
Let N = (V, A, r, X) be a regular network.
Let D be a collection of displayed trees such that for each arc (u,v) in A there is a
directed path P from r to v including the arc (u,v) and a tree T in D that is proper for
P.
Then there is a procedure which reconstructs N when given D.
The procedure is a modification of Maximal Proper Child that is careful about the
order in which vertices are considered.
The time complexity is O(n3 v d2) instead of O(n2 v d2).
There exists D such that d & a.
But v = O(2n).
The number of hybrid vertices can be O(2n) so
a may also be exponential in n.
Will Maximal Child work on real data?
There are some important issues:
• Lineage sorting probably occurs in addition to hybridization in networks.
• Homoplasies probably occur.
• Standard maximum likelihood models for trees might need revision.
• Hence not all gene trees may be trees displayed by the network.
• We don't know when a collection D of trees satisfies the hypotheses.
Example
Rokas et al 2003 analyzed a set of 106 yeast genes from total database with 127,026
nucleotide sites. There were 7 yeast genomes, genus Saccharomyces, and one
outgroup from genus Candida. They concatenated the aligned genes and obtained a
tree with 100% bootstrap support at each internal vertex.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
The tree found by Rokas et al 2003.
Holland, Huber, Moulton, Lockhart 2004 analyzed the data set. They found
consensus networks for the 106 maximum-likelihood trees and also for the 106
maximum-parsimony trees.
Holland et al 2004 consensus network for the 106 maximum-parsimony trees
displaying edges represented in at least 10 of the 106 trees.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
Analysis of the 106 maximum-likelihood trees on 127,026 sites.
There were 19 different trees.
Tree 1 has 45 gene trees and 61017 DNA sites.
Tree 2 has 19 gene trees and 24156 DNA sites.
Tree 3 has 8 gene trees and 9381 DNA sites.
Tree 4 has 5 gene trees and 6267 DNA sites.
Tree 5 has 4 gene trees and 5463 DNA sites.
Tree 6 has 5 gene trees and 4407 DNA sites.
Tree 7 has 4 gene trees and 2841 DNA sites.
Tree 8 has 3 gene trees and 2013 DNA sites.
Tree 9 has 2 gene trees and 1869 DNA sites.
Tree 10 has 2 gene trees and 1758 DNA sites.
Other trees occurred only once.
Maybe we just want to use some of these trees.
Tree 1 has 45 gene trees and 61,017 DNA sites.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
The tree found by Rokas et al 2003.
It explains 45 gene trees and 61,017 sites.
Tree 2 has 19 gene trees and 24,156 DNA sites.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. kluyveri
S. castellii
C. albicans
Tree 1 has 45 gene trees and 61017 DNA sites.
Tree 2 has 19 gene trees and 24156 DNA sites.
If we use Maximal Child with the two most common trees we obtain
S. paradoxus
S. castellii
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. kluyveri
C. albicans
This network is in fact normal and displays Tree 1 and Tree 2.
It explains 64 gene trees and 85173 sites.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
19/64
S. castellii
S. kluyveri
45/64
C. albicans
Tree 3 has 8 gene trees and 9381 DNA sites.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
Tree 1 has 45 gene trees and 61017 DNA sites.
Tree 2 has 19 gene trees and 24156 DNA sites.
Tree 3 has 8 gene trees and 9381 DNA sites.
If we use Maximal Child with these three most common trees we obtain
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
This network is normal and displays Trees 1, 2, and 3.
If this is correct we would also expect to see the tree
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
This tree occurred exactly once involving 780 sites.
The other trees displayed by the network were observed 45, 19, and 8 times.
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
.877
S. bayanus
.123
S. castellii
.274
S. kluyveri
.726
C. albicans
Least square estimates for the probabilities of arcs to hybrid vertices are in red.
These are based on the probabilities of each of the four displayed trees.
The network explains 73 gene trees and 95334 sites.
This network is a directed version of the consensus network for the maximumparsimony trees (Holland et al 2004)
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
Tree 1 has 45 gene trees and 61017 DNA sites.
Tree 2 has 19 gene trees and 24156 DNA sites.
Tree 3 has 8 gene trees and 9381 DNA sites.
Tree 4 has 5 gene trees and 6267 DNA sites.
Tree 4 is as follows:
S. bayanus
S. kudriavzevii
S. mikatae
S. paradoxus
S. cerevisiae
S. castellii
S. kluyveri
C. albicans
If we use Maximal Child with these four most common trees we obtain
S. paradoxus
S cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. kluyveri
S. castellii
C. albicans
This network is normal. It has 4 hybrid vertices.
It displays 16 trees. Of these, only 7 are observed.
Reject it.
Possibilities:
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
S. castellii
S. kluyveri
C. albicans
It explains 45 gene trees and 61017 sites.
or
S. paradoxus
S. castellii
S. cerevisiae
S. mikatae
S. kudriavzevii
S. bayanus
.297
S. kluyveri
.703
C. albicans
It explains 64 gene trees and 85173 sites.
or
S. paradoxus
S. cerevisiae
S. mikatae
S. kudriavzevii
.877
S. bayanus
.123
S. castellii
.274
S. kluyveri
.726
C. albicans
It explains 73 gene trees and 95334 sites.
Tree Clusters or Splits
Let N = (V, A, r, X) be a network. The set of tree clusters of N is
TrCl(N) = ' [ { cl(v,T)} : T # Tr(N)].
r
B
C
E
D
2
F
7
6
3
4
5
TrCl(N) = { {1,2,3,4,5,6,7}, {1}, {2}, {3}, {4}, {5}, {6}, {7},
{3,4,5}, {3,4,5,6}, {2,3,4,5,6}, {6,7}, {3,4,5,6,7}, {2,3,4,5} }
Beyond galled trees--decomposition and computation of
galled networks: (D. Huson and T. Klöpper, 2007)
Problem: Given a set ( of splits and a parameter K ) 0, find a minimal galled
network that displays ( and has at most K hybrid vertices in each biconnected
component. (fixed parameter tractable)
Question.
For what classes C of networks is it true that
when M and N are in C and TrCl(M) = TrCl(N), then M = N?
r
B
C
E
D
2
F
7
6
3
4
5
TrCl(N) = { {1,2,3,4,5,6,7}, {1}, {2}, {3}, {4}, {5}, {6}, {7},
{3,4,5}, {3,4,5,6}, {2,3,4,5,6}, {6,7}, {3,4,5,6,7}, {2,3,4,5} }
1
1
8
8
2
3
7
2
6
3
4
5
4
5
6
Two normal networks M and N such that TrCl(M) = TrCl(N).
W 2008 gives additional conditions on the network sufficient for unique
determination of the normal network N from TrCl(N).
7
Conclusions
In order for a network to be determined by its trees, we need to make assumptions
about the network.
Normal networks and, more generally, regular networks can be reconstructed from
appropriate subsets of their displayed trees using variants of the Maximal Child
procedure.
Open problems
Reconstruction of other classes of networks from their displayed trees
Reconstruction of other classes of networks from their tree clusters
Gene trees occur with a certain probability distribution. Give a general procedure to
use this information to reconstruct the networks.
Thank you for your attention.
References
M. Baroni, C. Semple, and M. Steel, 2004. A framework for representing reticulate
evolution. Annals of Combinatorics 8, 391-408.
G. Cardona, F. Rossalló, and G. Valiente, 2007. Comparison of tree-child
phylogenetic networks. To appear in IEEE/ACM Transactions on Computational
Biology and Bioinformatics.
J. Degnan and N. Rosenberg, 2006. Discordance of species trees with their most
likely gene trees. PLoS Genetics 2(5)e68, 762-768.
D. Gusfield, S. Eddhu, and C. Langley, 2004. Optimal, efficient reconstruction of
phylogenetic networks with constrained recombination. Journal of Bioinformatics
and Computational Biology 2, 173-213.
D. Huson, T. Klöpper, P. Lockhart, and M. Steel, 2005. Reconstruction of reticulate
networks from gene trees. In S. Miyano et al. (eds.) RECOMB 2005, LNBI 3500,
233-249.
B. Moret, L. Nakhleh, T. Warnow, C. R. Linder, A. Tholse, A. Padolina, J. Sun, and
R. Timme, 2004. Phylogenetic networks: modeling, reconstructibility, and accuracy.
IEEE Transactions on Computational Biology and Bioinformatics 1, 13-23.
A. Rokas, B. Williams, N. King, and S. Carroll, 2003. Genome-scale approaches to
resolving incongruence in molecular phylogenies. Nature 425, 798-804.
S.J. Willson, 2008. Reconstruction of certain phylogenetic networks from the
genomes at their leaves. Journal of Theoretical Biology 252, 338-349.
Download