Reconstructing regular networks from their trees Abstract Suppose N is a network---i.e., an acyclic directed graph with labelled leaves. By identifying a particular parent for each hybrid vertex, various trees can be displayed by N. Let Tr(N) denote the set of trees displayed by N. In general there can be many networks M such that Tr(N) = Tr(M). Suppose that the collection Tr(N) is given but N is not known. We study how to reconstruct N from Tr(N) in the case where N is regular, showing that such a regular network is uniquely determined. For certain smaller collections of trees displayed by a regular network N we give a more complicated method to reconstruct N. Simpler reconstruction procedures are given if N is normal. Reconstructing regular networks from their trees by Stephen J. Willson Department of Mathematics Iowa State University Ames, Iowa, USA 50011 swillson@iastate.edu Phylogenetics 2009 Rényi Institute, Budapest, Hungary June 24, 2009 The basic problem in phylogeny is the reconstruction of evolutionary history from data. Most work produces rooted phylogenetic trees. The simplest assumption is that there is one tree (the species tree). For each gene, one uses various methods to produce a tree for that gene, called a gene tree. The hope was that the species tree is typically the same as the gene tree for each gene. This is often not the case. For example, Rokas, Williams, King, and Carroll 2003: articles that commonly used genes may provide better resolution of the same set of taxa. We repeated ML and MP analyses of nucleotide sequence on a sample of six commonly used genes (actin, hsp70, b-tubulin, RNA polymerase II, elongation factor 1-a and 18S rDNA), and also obtained alternative and often strongly supported phylogenies (Fig. 1g–l). Specifically, actin and RNA polymerase II (Fig. 1g, j) each support the same tree as in Fig. 1a, whereas hsp70 (Fig. 1h) supports the topology in Fig. 1b. b-Tubulin (Fig. 1i), 18S rDNA (Fig. 1l) and elongation factor 1-a (Fig. 1k) each support, albeit weakly at some branches, trees not represented in Fig. 1a–f. Therefore, analyses of both commonly used genes and the set of 106 orthologous genes provide strong support for several alternative phylogenies and fail to indicate which gene tree(s) might represent the actual species tree. The alternative phylogenies could have resulted from a number of different scenarios: (1) most genes could have weakly supported most phylogenies and strongly supported only a few alternative trees; (2) most genes could have strongly supported one phylogeny and a few genes strongly supported only a small number of alternatives; (3) there could have been some combination of these scenarios so that each branch among alternative phylogenies had either weak or strong support depending on the gene. To distinguish between these possibilities, we identified all of the branches recovered during the single-gene analyses, and recorded each bootstrap value with respect to the gene and method of analysis (see Supplementary Information). Eight branches were shared by all three analyses with multiple instances of bootstrap values .50%; hence, we focus on these eight branches in our analyses. For each branch, we observed a full range of bootstrap values (Fig. 2c). Only two branches (branches 1 and 4 in Fig. 2) were supported with high bootstrap values by a majority of genes. The remaining six branches were supported by bootstrap values that range evenly from 0 to 100 (Fig. 2c). Two summary statistics of these distributions, the mean bootstrap value and the percentage of genes supporting each branch (Fig. 2c), also indicate weak support overall for each of these six branches (Fig. 2c). Notably, bootstrap values for branches 3 versus 6 and among branches 5, 7 and 8 exhibit significant and strong negative correlations so that a low bootstrap value for one branch corresponds to a high bootstrap value for an alternative branch (Pearson correlation coefficients and P-values are: 20.684 and ,0.001, 20.734 and ,0.001, and 20.607 and ,0.001 for branches 3 versus 6, 5 versus 7, and 5 versus 8, respectively; values are for ML analysis; similar results were obtained from the remaining two MP analyses (data not shown)). Although the values of some other pairs of branches were also correlated, these were the only pairs of branches exhibiting a very strong negative correlation across all three analyses. This negative correlation is indicative of the pervasive direct conflict among genes with respect to these branches. In summary, the support for a given branch was strongly dependent on the gene analysed. Although the trees recovered from single-gene analyses are incongruent in that they exhibit topological differences, the degree of conflict among trees could be relatively minor. We examined the degree of incongruence between the trees recovered by determining the number of taxa that would need to be removed in order to make two trees congruent (see Methods). These values were computed for all possible pairwise comparisons among the 106 50% majority-rule consensus trees generated within each of the three analyses. The distribution of values shows that only a small proportion of all pairwise comparisons are in total agreement (,15%), whereas most of the trees (.50%) require that at least two of the eight taxa are removed before they become congruent, regardless of the method of analysis (Fig. 3). These results were consistent with other metrics of incongruence (see Methods and Supplementary Information) and indicate extensive incongruence among the 106 data sets. Figure 1 Single-gene data sets generate multiple, robustly supported alternative topologies. Representative alternative trees recovered from analyses of nucleotide data of 106 selected single genes and six commonly used genes are shown. The trees are the 50% majority-rule consensus trees from the genes YBL091C (a), YDL031W (b), YER005W (c), YGL001C (d), YNL155W (e) and YOL097C (f), as well as those from the commonly used genes actin (g), hsp70 (h), b-tubulin (i), RNA polymerase II (j) elongation factor 1-a (k) and 18S rDNA (l). Numbers above branches indicate bootstrap values (ML on nucleotides/MP on nucleotides). NATURE | VOL 425 | 23 OCTOBER 2003 | www.nature.com/nature Factors potentially influencing incongruence between trees We mitigated some potential sources of incongruence by analysing a genome-wide sample of orthologous genes. There are a number of 799 If we assume that the underlying description is a species tree, differing gene trees can be explained by lineage sorting or coalescence models (Degnan and Rosenberg 2006), especially when there are short edges. They can also be explained by homoplasies. But there are other events that might occur. • hybridization • lateral gene transfer r A 1 B C E D 2 F 7 6 3 4 5 The results are networks, not trees. Assumptions on networks A network is a rooted directed graph which is acyclic (it has no directed cycles). r A 1 B C E D 2 F 7 6 3 4 5 Example. The root is r. Usually the leaves correspond in a one-to-one manner to species on which measurements can be made. We identify the leaves with a set X of taxa. r A 1 B C E D 2 F 7 6 3 4 5 X = {1, 2, 3, 4, 5, 6, 7} Call X the leaf-set or base-set. The root has indegree 0. A tree-child or normal vertex has indegree 1. A hybrid vertex has indegree 2 or greater. r A 1 B C E D 2 F 7 6 3 4 5 D and 6 are hybrid. r is the root. Other vertices are normal. So far, a phylogenetic network N = (V, A, r, X) is an acyclic directed graph V = the set of vertices A = the set of (directed) arcs r = the root X = the base-set In order to make progress, additional hypotheses will be needed on phylogenetic networks. There are lots of different choices of hypotheses. • regular networks (Baroni, Semple, and Steel 2004) • normal networks (W 2008 ) • time consistent networks (Moret et al 2004; Cardona, Rossalló, and Valiente 2008) • tree-child networks (Cardona, Rossalló, and Valiente 2007) • reconstructible networks (Moret et al 2004) • galled trees (reticulation cycles are edge-disjoint) (Gusfield, Eddhu, and Langley 2004; L. Nakhleh, T. Warnow, C.R. Linder, and K.S. John, 2005) • galled networks (reticulation cycles may share a tree) (Huson and Klöpper 2007) • level-k networks (van Iersel et al 2008) Trees displayed by a network 1 1 2 2 3 3 1 A network 2 3 The trees it displays Model of evolution on a network A given gene utilizes one arc to a hybrid and not the other. Thus the arcs occur with a certain probability. .7 .3 1 2 3 The probability of the left arc is 0.7. 1 2 3 3 1 2 The left tree occurs with probability 0.7 and the right tree with probability 0.3. Details about displaying trees Let N = (V, A, r, X) be a phylogenetic network. A parent map for N is a map p: V-{r} ! V such that for each v , v"r, p(v) is a parent of v. r A 1 B C E D 2 F 7 6 3 5 4 p(6) = D, p(D) = C, p(3) = F, p(2) = B, ... q(6) = D, q(D) = B, q(3) = F, q(2) = B, ... s(6) = E, s(D) = C, s(3) = F, s(2) = B, ... t(6) = E, t(D) = B, t(3) = F, t(2) = B, .... Given a parent map p, obtain a network Np by using only the arcs (p(v),v). r A 1 B C E D 2 F 7 6 3 5 4 p(6) = D, p(D) = C, p(3) = F, p(2) = B, ... r A 1 B C E D 2 F 7 6 3 5 4 Np. Note that it has vertices of outdegree 1. Simplifications: Type 1: Suppress a vertex with outdegree 1. Type 2: Suppress a vertex with no directed path to a member of X. The result is T(Np). r A 1 B C E D 2 F 7 6 3 Np 4 5 r A 1 C 2 D F 7 6 3 5 4 A tree T(Np) displayed by N. Example: D E F G H 2 1 A network N. 3 Parent map p(G) = E, p(H) = E. Np D D E F E G H G 3 2 2 1 1 = Suppress F since there is no path from F to a member of X. Suppress H since it has outdegree 1. 3 Suppress D since it has outdegree 1 D E E G G 1 2 3 T(Np) = 1 2 3 D Similarly N E F G H 1 2 3 1 displays 2 3 Let Tr(N) be the set of trees displayed by N. D For N = E F G H 1 2 3 Tr(N) = 1 2 3 1 2 3 Basic question for this talk: Suppose N is a network and Tr(N) is the set of trees it displays. If we are given Tr(N) or a suitable subset of Tr(N), how can we find N? A fundamental difficulty M= 1 2 3 Tr(M) = 1 2 3 1 2 3 There can be two distinct networks N and M such that Tr(M) = Tr(N). D N = E F G H 1 2 3 M= 1 2 3 Suppose are given Tr(N) but not N. In order to reconstruct N, we need to make some further assumptions about the nature of N. Otherwise, if Tr(M) = Tr(N) we can't reconstruct both uniquely by a determinate method. Reconstruction of reticulate networks from gene trees: (D. Huson, T. Klöpper, P. Lockhart, and M. Steel, 2005). A network N supports a set D of X-trees provided that for each tree T in D, there is a refinement T ' of T that is displayed by D. 1. Given a set D of X-trees, find a network N that supports D and has a minimum number of hybrid vertices. (Computationally intractable--Bordewich and Semple 2007) 2. Given a set D of X-trees, find a network N that supports D, has only independent reticulations (eg., is a galled tree), and has a minimal number of hybrid vertices, if one exists. (tractable) 3. Given a set D of X-trees, find a network N that supports D, has only independent or overlapping reticulations, and has a minimal number of hybrid vertices, if one exists. (tractable if we limit the maximal number of hybrid vertices in a tangle) Phylogenetic networks: modeling, reconstructibility, and accuracy: (B. Moret, L. Nakhleh, T. Warnow, C. R. Linder, A. Tholse, A. Padolina, J. Sun, and R. Timme, 2004) If M and N are "reconstructible" networks and Tr(M) = Tr(N), then M and N have isomorphic "reduced" versions. Let N= (V, A, r, X) be a phylogenetic network. For v # V, define the (full) cluster of v in N by cl(v,N) = {x # X: there is a directed path from v to x}. r A 1 B C E D 2 F 7 6 3 4 5 cl(C,N) = {3,4,5,6,7}. cl(B,N) = {2,3,4,5,6}. cl(6,N) = {6}. Let N= (V, A, r, X) be a phylogenetic network. N is regular if (1) The cluster map is one-to-one (ie, cl(u,N) = cl(v,N) iff u = v); and (2) There is an arc (u,v) if and only if (a) cl(v,N) $ cl(u,N), and (b) there is no vertex w such that cl(v,N) $ cl(w,N) $ cl(u,N). (Baroni, Semple, and Steel 2004) r A 1 B C E D 2 F 7 6 3 5 4 There is an arc (C,D) since cl(D,N) = {3,4,5,6} $ { 3,4,5,6,7 } = cl(C,N) and no other cluster fits in between these. This N is regular. D E F G H 3 2 1 This N is not regular: cl(E,N) = {1,2,3} = cl(F,N). Basic fact: To reconstruct a regular network N, we need only find {cl(v,N): v # V}. We can then use the Hasse construction to find the arcs. There is an arc (u,v) if and only if (a) cl(v,N) $ cl(u,N), and (b) there is no vertex w such that cl(v,N) $ cl(w,N) $ cl(u,N). Given a collection of clusters {1,2,3,4,5}, {1}, {2}, {3}, {4}, {5}, {2,3}, {1,2,3}, {2,3,4}, {1,2,3,4} the Hasse construction yields a regular network 5 1 4 2 3 Main Theorem. Suppose N = (V,A,r,X) is a regular phylogenetic network. Then there is a procedure (Maximal Proper Child) which reconstructs N when given Tr(N). A network N = (V, A, r, X) is normal if (1) N is regular; and (2) Every vertex not a leaf has a normal child (tree-child child). r A 1 B C E D 2 F 7 6 3 4 N is normal. 5 From every vertex not a leaf there is a directed path to a leaf such that all vertices except possibly the first are normal. r B C D 1 3 2 N is regular but not normal. E 4 Theorem. Suppose N is a normal network. Suppose D % Tr(N) is a collection of trees displayed by N such that for every arc (u,v) of N there exists a tree T in D with an arc (u',v') such that cl(u',T) = cl(u,N) and cl(v', T) = cl(v,N). There is a procedure (Maximal Child) to reconstruct N from D. Example. A network N r A 1 B C E D 2 F 7 6 3 5 4 To detect (C,E) we must input a tree containing both the clusters cl(C,N) = {3,4,5,6,7} and cl(E,N) = {6,7}. r A 1 B C E D 2 F 7 6 3 5 4 Remove the red arcs to contain both the clusters cl(C,N) = {3,4,5,6,7} and cl(E,N) = {6,7}. r A 1 B C E 2 F 7 6 3 4 5 Corollary. If N is normal then there is a procedure to reconstruct N from Tr(N). Corollary. If N is normal then Tr(N) uniquely determines N. Procedure Maximal Child: Input: A set D of displayed trees with leaf set X. Output: A set S of subsets of X. The network N will be the Hasse graph of S. 1. Initially S = {X}. 2. Repeat until done: a. If U in S, let Tr (U) = {T in D: T has the cluster U}. b. Let C (U) = {clusters of children of U in T, for T in Tr (U)}. c. Let M (U) = the maximal members of C (U). d. Adjoin M (U) to S. Example. A normal network N will be reconstructed from D = Tr(N). r B C E D 2 F 1 6 3 5 4 Initially S contains X = {1,2,3,4,5,6}. Consider U = {1,2,3,4,5,6}. There is a tree T containing U = {1,2,3,4,5,6} and also cl(B,N) = {2,3,4,5,6} and {1}. r B C E D 2 F 1 6 3 5 4 Hence {2,3,4,5,6} and {1} are in C (U). There is a tree T containing U = {1,2,3,4,5,6} and also cl(C,N) = {1,3,4,5,6} and {2}. r B C E D 2 F 1 6 3 5 4 Hence {1,3,4,5,6} and {2} are in C (U). There is a tree T containing U = {1,2,3,4,5,6} and also {2,3,4,5} and {1,6}. r B C E D 2 F 1 6 3 4 5 Hence {2,3,4,5} and {1,6} are in C (U). In fact, C (U) = { {2,3,4,5,6}, {1,3,4,5,6}, {2}, {1}, {1,6}, {2,3,4,5} } r B C E D 2 F 1 6 3 4 5 The maximal members are {2,3,4,5,6} and {1,3,4,5,6}. These are the children of U = {1,2,3,4,5,6}. Thus S contains the clusters {1,2,3,4,5,6}, {2,3,4,5,6}, {1,3,4,5,6}. Now let U = {1,3,4,5,6}. r B C E D 2 F 1 6 3 4 5 C (U) = {{3,4,5,6}, {1}, {3,4,5}, {1,6}}. The maximal members are {3,4,5,6} and {1,6}. These will be the children of U. Thus S contains the clusters {1,2,3,4,5,6}, {2,3,4,5,6}, {1,3,4,5,6}, {3,4,5,6}, {1,6}. In this manner we find the clusters of all vertices of N, from which N is reconstructed. This method will work, if N is normal and D satisfies the hypotheses. If N is normal, we do not need D = Tr(N). Theorem. Suppose N is a normal network. Suppose D is a collection of trees displayed by N such that for every arc (u,v) of N there exists a tree T in D with an arc (u',v') such that cl(u',T) = cl(u,N) and cl(v', T) = cl(v,N). Then the procedure Maximal Child with input D outputs N. Thus we can get by with certain D such that |D| & a. Complexity of Procedure Maximal Child. Input: X -- n leaves D -- d trees V -- v vertices A -- a arcs The algorithm terminates in time O(n2 v d2). Suppose the network is normal. One can show v =O(n2) and a = O(n2). Hence the total time is O(n4 d2). If each hybrid has 2 parents, then a normal network has at most n hybrid vertices, and d & 2n. If N is normal, then there exists D such that d & a = O(n2) and N can be reconstructed from D in time O(n2 v a2) = O( n8 ) Normal networks can be reasonably complicated: 1 2 3 ... 4 n-2 5 n-1 n This normal network contains v = (n2 + n + 2) / 2 vertices and a = (n2 +3 n - 6) / 2 arcs. A normal network can be level-k for arbitrarily large k. Why can there be only v = O(n2) vertices and a = O(n2) arcs in a normal network? To each leaf x in X there is a maximal normal path Px such that each vertex after the first is normal. r B C E D 2 F 1 6 3 4 5 P3 in red, P7 in blue. P6 is trivial. Each Px has at most n vertices. Every vertex lies on some Px. Hence v = O(n2). There are at most n (n-1) arcs such that for some x, the arc lies on Px. r B C E D 2 F 1 6 3 5 4 All arcs in some Px are in red. Suppose (u,v) is an arc not in some Px. If v were normal, then a normal path from v to Px could be extended to u. Hence v is hybrid, hence the start of some Px. r B C E D 2 F 1 6 3 4 5 For such (u,v), v is hybrid starting some Px and u is on some Py for y " x. But there cannot be 2 vertices u1 and u2 on Py both with arcs to v. Otherwise (u1,v) or (u2,v) is redundant. u1 v u2 x y Hence there are at most n (n-1) arcs not on any Px. Thus a = O(n2) + O(n2) = O(n2). Maximal Child may NOT work if N is regular but not normal. r B C 4 D F 5 G 1 6 2 E 3 Consider U = {1,2,3}. The correct children should be {1} and {2,3}. r B C 4 D F 5 G 1 6 2 E 3 C (U) = { {1}, {2}, {3}, {2,3}, {1,2}, {1,3} } The maximal children are {2,3}, {1,2}, and {1,3}, which are wrong. Main Theorem. Suppose N = (V,A,r,X) is a regular phylogenetic network. Then there is a procedure (Maximal Proper Child) which reconstructs N when given Tr(N). The crux of the revision is the use of "proper trees". Let N be a network and v0=r, v1, ..., vn be a directed path. A displayed tree T is proper for the path if there is a directed path w0=r, ..., wn in T such that for all i, cl(vi,N) = cl(wi,T). r B C 4 D F 5 G 1 6 2 A directed path in N E 3 r B r F 5 C B C 4 D 6 F 5 G 1 E 2 3 4 D 6 2 A tree proper for the path. G 1 E 3 r B r F 5 C B C 4 D 6 F 5 G 1 E 4 D 6 G 1 2 2 3 This tree is not proper for the path, but it is proper for a different path. E 3 Procedure Maximal Proper Child Input: A set D of displayed trees with leafset X. Output: A set S of subsets of X. The network M will be the Hasse graph of S. 1. Initially S = {X}. 2. Repeat until done: a. If U in S, choose a path P in the Hasse graph of S from X to U. By induction this will be a directed path in N. b. Let PrTr(P) = {T in D: T has the cluster U and T is a proper tree for P}. c. Let C (U) = {clusters of children of U in T, for T in PrTr(P)}. d. Let M(U) = the maximal members of C (U). e. Adjoin M(U) to S. Main Theorem. Suppose N = (V,A,r,X) is a regular phylogenetic network. Then Maximal Proper Child reconstructs N when given D = Tr(N). Example. r B C 4 D F 5 1 6 2 X = {1,2,3,4,5,6}. G E 3 Input: the trees displayed by N: T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}. T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}. T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}. T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}. T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}. T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}. All trees are proper for the trivial path at X. C(X) = { {1,2,3,4,6}, {5}, {1,2,3,4}, {5,6} } M(X) = { {1,2,3,4,6}, {5,6} }. r B C 4 D F 5 G 1 6 2 E 3 For the children of B = {1,2,3,4,6} we use only trees proper for the path r, B. T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}. T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}. T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}. T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}. T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}. T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}. C({1,2,3,4,6}) = { {1,2,3,6}, {4} } = M({1,2,3,4,6}) r B C 4 D F 5 G 1 6 2 E 3 For D = {1,2,3,6} we use only trees proper for the path r, B, D. T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}. T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}. T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}. T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}. T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}. T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}. C({1,2,3,6}) = { {1,2,3}, {6}, {1,3}, {2,6}, {1,6}, {2,3}, {1,2,6},{3} } M({1,2,3,6}) = { {1,2,3}, {1,2,6} } r B C 4 D F 5 G 1 6 2 E 3 For G = {1,2,3} we use only trees proper for r, B, D, G. These contain {1,2,3,4,5,6}, {1,2,3,4,6}, {1,2,3,6}, {1,2,3}. T1: p(1) = G, p(2) = E, p(6) = F. Clusters {2,3}, {1,2,3}, {1,2,3,6}, {1,2,3,4,6}. T2: p(1) = G, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T3: p(1) = G, p(2) = F, p(6) = F. Clusters {1,3}, {2,6}, {1,2,3,6}, {1,2,3,4,6}. T4: p(1) = G, p(2) = F, p(6) = C. Clusters {1,3}, {1,2,3}, {1,2,3,4}, {5,6}. T5: p(1) = F, p(2) = E, p(6) = F. Clusters {2,3}, {1,6}, {1,2,3,6}, {1,2,3,4,6}. T6: p(1) = F, p(2) = E, p(6) = C. Clusters {2,3}, {1,2,3}, {1,2,3,4}, {5,6}. T7: p(1) = F, p(2) = F, p(6) = F. Clusters {1,2,6}, {1,2,3,6}, {1,2,3,4,6}. T8: p(1) = F, p(2) = F, p(6) = C. Clusters {1,2}, {1,2,3}, {1,2,3,4}, {5,6}. C({1,2,3}) = { {2,3}, {1}} = M({1,2,3,6}) r B C 4 D F 5 G 1 6 2 In this manner we reconstruct N. E 3 Main Theorem. Suppose N = (V,A,r,X) is a regular phylogenetic network. Then Maximal Proper Child reconstructs N when given D = Tr(N). Note: • We might need proper trees for lots of paths. • Tr(N) could be a very large set. Corollary. Suppose M and N are regular phylogenetic networks with base-set X. If Tr(M) = Tr(N), then M = N. This is related to a result in Moret et al 2004. They define "reconstructible" networks N and a reduced form R(N). They assert that if M and N are reconstructible and Tr(M) = Tr(N), then R(M) = R(N). They do not give a procedure to find R(N) from Tr(N). The time complexity of Maximal Proper Child is O(n2 v d2) X -- n leaves D -- d trees V -- v vertices For a regular network, v =O( 2n ). There can be exponentially many hybrid vertices so d can be O(2(2n)). This is much longer than Maximal Child because regular networks can be so much more complicated than normal networks. Do we really need all of Tr(N) when N is regular? Theorem. Let N = (V, A, r, X) be a regular network. Let D be a collection of displayed trees such that for each arc (u,v) in A there is a directed path P from r to v including the arc (u,v) and a tree T in D that is proper for P. Then there is a procedure which reconstructs N when given D. The procedure is a modification of Maximal Proper Child that is careful about the order in which vertices are considered. The time complexity is O(n3 v d2) instead of O(n2 v d2). There exists D such that d & a. But v = O(2n). The number of hybrid vertices can be O(2n) so a may also be exponential in n. Will Maximal Child work on real data? There are some important issues: • Lineage sorting probably occurs in addition to hybridization in networks. • Homoplasies probably occur. • Standard maximum likelihood models for trees might need revision. • Hence not all gene trees may be trees displayed by the network. • We don't know when a collection D of trees satisfies the hypotheses. Example Rokas et al 2003 analyzed a set of 106 yeast genes from total database with 127,026 nucleotide sites. There were 7 yeast genomes, genus Saccharomyces, and one outgroup from genus Candida. They concatenated the aligned genes and obtained a tree with 100% bootstrap support at each internal vertex. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans The tree found by Rokas et al 2003. Holland, Huber, Moulton, Lockhart 2004 analyzed the data set. They found consensus networks for the 106 maximum-likelihood trees and also for the 106 maximum-parsimony trees. Holland et al 2004 consensus network for the 106 maximum-parsimony trees displaying edges represented in at least 10 of the 106 trees. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans Analysis of the 106 maximum-likelihood trees on 127,026 sites. There were 19 different trees. Tree 1 has 45 gene trees and 61017 DNA sites. Tree 2 has 19 gene trees and 24156 DNA sites. Tree 3 has 8 gene trees and 9381 DNA sites. Tree 4 has 5 gene trees and 6267 DNA sites. Tree 5 has 4 gene trees and 5463 DNA sites. Tree 6 has 5 gene trees and 4407 DNA sites. Tree 7 has 4 gene trees and 2841 DNA sites. Tree 8 has 3 gene trees and 2013 DNA sites. Tree 9 has 2 gene trees and 1869 DNA sites. Tree 10 has 2 gene trees and 1758 DNA sites. Other trees occurred only once. Maybe we just want to use some of these trees. Tree 1 has 45 gene trees and 61,017 DNA sites. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans The tree found by Rokas et al 2003. It explains 45 gene trees and 61,017 sites. Tree 2 has 19 gene trees and 24,156 DNA sites. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. kluyveri S. castellii C. albicans Tree 1 has 45 gene trees and 61017 DNA sites. Tree 2 has 19 gene trees and 24156 DNA sites. If we use Maximal Child with the two most common trees we obtain S. paradoxus S. castellii S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. kluyveri C. albicans This network is in fact normal and displays Tree 1 and Tree 2. It explains 64 gene trees and 85173 sites. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus 19/64 S. castellii S. kluyveri 45/64 C. albicans Tree 3 has 8 gene trees and 9381 DNA sites. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans Tree 1 has 45 gene trees and 61017 DNA sites. Tree 2 has 19 gene trees and 24156 DNA sites. Tree 3 has 8 gene trees and 9381 DNA sites. If we use Maximal Child with these three most common trees we obtain S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans This network is normal and displays Trees 1, 2, and 3. If this is correct we would also expect to see the tree S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans This tree occurred exactly once involving 780 sites. The other trees displayed by the network were observed 45, 19, and 8 times. S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii .877 S. bayanus .123 S. castellii .274 S. kluyveri .726 C. albicans Least square estimates for the probabilities of arcs to hybrid vertices are in red. These are based on the probabilities of each of the four displayed trees. The network explains 73 gene trees and 95334 sites. This network is a directed version of the consensus network for the maximumparsimony trees (Holland et al 2004) S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans Tree 1 has 45 gene trees and 61017 DNA sites. Tree 2 has 19 gene trees and 24156 DNA sites. Tree 3 has 8 gene trees and 9381 DNA sites. Tree 4 has 5 gene trees and 6267 DNA sites. Tree 4 is as follows: S. bayanus S. kudriavzevii S. mikatae S. paradoxus S. cerevisiae S. castellii S. kluyveri C. albicans If we use Maximal Child with these four most common trees we obtain S. paradoxus S cerevisiae S. mikatae S. kudriavzevii S. bayanus S. kluyveri S. castellii C. albicans This network is normal. It has 4 hybrid vertices. It displays 16 trees. Of these, only 7 are observed. Reject it. Possibilities: S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii S. bayanus S. castellii S. kluyveri C. albicans It explains 45 gene trees and 61017 sites. or S. paradoxus S. castellii S. cerevisiae S. mikatae S. kudriavzevii S. bayanus .297 S. kluyveri .703 C. albicans It explains 64 gene trees and 85173 sites. or S. paradoxus S. cerevisiae S. mikatae S. kudriavzevii .877 S. bayanus .123 S. castellii .274 S. kluyveri .726 C. albicans It explains 73 gene trees and 95334 sites. Tree Clusters or Splits Let N = (V, A, r, X) be a network. The set of tree clusters of N is TrCl(N) = ' [ { cl(v,T)} : T # Tr(N)]. r B C E D 2 F 7 6 3 4 5 TrCl(N) = { {1,2,3,4,5,6,7}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {3,4,5}, {3,4,5,6}, {2,3,4,5,6}, {6,7}, {3,4,5,6,7}, {2,3,4,5} } Beyond galled trees--decomposition and computation of galled networks: (D. Huson and T. Klöpper, 2007) Problem: Given a set ( of splits and a parameter K ) 0, find a minimal galled network that displays ( and has at most K hybrid vertices in each biconnected component. (fixed parameter tractable) Question. For what classes C of networks is it true that when M and N are in C and TrCl(M) = TrCl(N), then M = N? r B C E D 2 F 7 6 3 4 5 TrCl(N) = { {1,2,3,4,5,6,7}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {3,4,5}, {3,4,5,6}, {2,3,4,5,6}, {6,7}, {3,4,5,6,7}, {2,3,4,5} } 1 1 8 8 2 3 7 2 6 3 4 5 4 5 6 Two normal networks M and N such that TrCl(M) = TrCl(N). W 2008 gives additional conditions on the network sufficient for unique determination of the normal network N from TrCl(N). 7 Conclusions In order for a network to be determined by its trees, we need to make assumptions about the network. Normal networks and, more generally, regular networks can be reconstructed from appropriate subsets of their displayed trees using variants of the Maximal Child procedure. Open problems Reconstruction of other classes of networks from their displayed trees Reconstruction of other classes of networks from their tree clusters Gene trees occur with a certain probability distribution. Give a general procedure to use this information to reconstruct the networks. Thank you for your attention. References M. Baroni, C. Semple, and M. Steel, 2004. A framework for representing reticulate evolution. Annals of Combinatorics 8, 391-408. G. Cardona, F. Rossalló, and G. Valiente, 2007. Comparison of tree-child phylogenetic networks. To appear in IEEE/ACM Transactions on Computational Biology and Bioinformatics. J. Degnan and N. Rosenberg, 2006. Discordance of species trees with their most likely gene trees. PLoS Genetics 2(5)e68, 762-768. D. Gusfield, S. Eddhu, and C. Langley, 2004. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. Journal of Bioinformatics and Computational Biology 2, 173-213. D. Huson, T. Klöpper, P. Lockhart, and M. Steel, 2005. Reconstruction of reticulate networks from gene trees. In S. Miyano et al. (eds.) RECOMB 2005, LNBI 3500, 233-249. B. Moret, L. Nakhleh, T. Warnow, C. R. Linder, A. Tholse, A. Padolina, J. Sun, and R. Timme, 2004. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE Transactions on Computational Biology and Bioinformatics 1, 13-23. A. Rokas, B. Williams, N. King, and S. Carroll, 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798-804. S.J. Willson, 2008. Reconstruction of certain phylogenetic networks from the genomes at their leaves. Journal of Theoretical Biology 252, 338-349.