This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright Author's personal copy ARTICLE IN PRESS Journal of Theoretical Biology 252 (2008) 338–349 www.elsevier.com/locate/yjtbi Reconstruction of certain phylogenetic networks from the genomes at their leaves Stephen J. Willson Department of Mathematics, Iowa State University, Ames, IA 50011, USA Received 30 October 2007; received in revised form 6 February 2008; accepted 11 February 2008 Available online 19 February 2008 Abstract A network N is a rooted acyclic digraph. A base-set X for N is a subset of vertices including the root (or outgroup), all leaves, and all vertices of outdegree 1. A simple model of evolution is considered in which all characters are binary and in which back-mutations occur only at hybrid vertices. It is assumed that the genome is known for each member of the base-set X. If the network is known and is assumed to be ‘‘normal,’’ then it is proved that the genome of every vertex is uniquely determined and can be explicitly reconstructed. Under additional hypotheses involving time-consistency and separation of the hybrid vertices, the network itself can also be reconstructed from the genomes of all members of X. An explicit polynomial-time procedure is described for performing the reconstruction. r 2008 Elsevier Ltd. All rights reserved. Keywords: Phylogenetic network; Hybrid; Recombination; Reticulate; Speciation 1. Introduction Phylogenetic relationships are most commonly represented by rooted trees. The extant taxa correspond to leaves of the trees, while internal vertices correspond to ancestral species. The arcs correspond to genetic change, typically by mutations in the DNA such as substitutions, insertions, and deletions. There has been increased interest recently in phylogenetic networks that are not necessarily trees. Besides speciation events, these networks could include such additional reticulation events as hybridization, recombination, or lateral gene transfer. Basic models of recombination were suggested by Hein (1990, 1993). General frameworks are discussed in Bandelt and Dress (1992), Baroni et al. (2004), Moret et al. (2004), and Nakhleh et al. (2005). Recent evidence suggests that reticulation events are not rare but are still less frequent than ordinary speciation events. A common approach has been to seek networks Tel.: +1 515 294 7671; fax: +1 515 294 5454. E-mail address: swillson@iastate.edu 0022-5193/$ - see front matter r 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2008.02.015 with as few reticulation events as possible, in part to give a lower bound. A ‘‘perfect phylogeny’’ is a network in which for each character the set of vertices with a particular value of the character is connected. Wang et al. (2001) considered the problem of finding a perfect phylogenetic network with recombination that has the smallest number of recombination events. They suggested that the problem is NP-hard, and a full proof was given by Bordewich and Semple (2007). Wang et al. then considered a restricted problem in which all recombination events are associated with nodedisjoint recombination cycles. Gusfield et al. (2004a) gave necessary and sufficient conditions to identify these networks, which they called ‘‘galled-trees,’’ and they added a much more specific and realistic model of recombination events. Gusfield et al. (2004b) gave a more detailed study of these node-disjoint cycles. This paper concerns a direct reconstruction of a phylogenetic network, without seeking to minimize the number of reticulation events. Additional hypotheses are assumed about the network. Different papers in the literature differ in the detailed assumptions about the nature of the networks. One commonly assumed property is that they be rooted acyclic Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 digraphs (Strimmer and Moulton, 2000; Moret et al., 2004; Nakhleh et al., 2005). Another is a kind of time consistency at hybridization events, roughly that the parents of a hybrid be possibly contemporaneous (‘‘temporal representation’’ for Baroni et al., 2006, ‘‘coexistence in time’’ for Moret et al., 2004). Another is a condition called ‘‘regularity’’ in Baroni et al. (2004) in which, among other properties, no two vertices have the same set of leaves as descendents. This paper performs the reconstructions under the assumption that the network N is ‘‘normal’’. An exact definition is given in Section 2, but roughly the idea is as follows: There is a ‘‘base-set’’ X consisting of vertices with known DNA, typically containing just the root and the leaves. The network is normal if from every vertex that is not in X there is a directed path to a vertex in X such that no vertex after the initial vertex is a hybrid vertex. This property is similar to the defining property in ‘‘tree-child networks’’ (Cardona et al., 2007) or the assumption about ‘‘tree nodes’’ in ‘‘model phylogenetic networks’’ (Moret et al., 2004). The first main theorem (Theorem 3.1) assumes that the network N is normal. Given the genomes at all members of X and given N, the theorem asserts that genomes at all internal vertices may be uniquely reconstructed. The second main theorem (Theorem 4.1) asserts that, under some additional assumptions, given the genomes at the members of X, then the network N itself can also be uniquely reconstructed. The construction methods are explicit and indeed of polynomial-time complexity. All characters are assumed to be binary. Baroni and Steel (2006) assumed a simple but related model of evolution called ‘‘accumulation phylogeny’’ in which each child inherits the entire set of modified characters in each parent. Thus there may be spontaneous mutations in the network, but all such mutations are inherited by all descendents. When each hybridization event is a polyploidy, this assumption may be appropriate. The current paper seeks to generalize the evolutionary model and be somewhat more realistic. It assumes instead that at a hybridization event only a portion of the genome of each parent is inherited by each child. On the other hand, we retain the assumption that the mutated genomes in a parent are inherited by a ‘‘normal’’ child for which there is only a single parent. The resulting model of evolution is called in this paper the ‘‘Simple Homoplasy Model’’. Full details are in Section 2. For Theorem 3.1 on reconstructing genomes given the network, the assumption of normality is the principal assumption on the network. For Theorem 4.1, in which the network itself is being reconstructed, additional assumptions are made. We assume a version of time consistency at hybridization events. We assume that a hybrid vertex has exactly two parents. We assume that the hybridization events are separated in the network, so that for example a hybrid node does not have a hybrid child (this was also assumed in Moret et al., 2004), but in addition there are constraints on the immediate descendents of a parent and grandparent of a hybrid vertex. Examples are presented 339 that indicate that some, at least, of the extra conditions are required for unique reconstruction of the network. This paper is a major extension of the author’s paper (Willson, 2007). The earlier paper obtained a theorem about reconstructing genomes given the network and the genomes at the leaves. It did not, however, give a reconstruction of the network itself. The current paper extends (Willson, 2007) in two ways. First, it gives a reconstruction of the network itself. Secondly, it applies to a more general model of evolution, with less specific assumptions about the nature of homoplasies. Hence the reconstruction of the genomes given the network (Theorem 3.1) is also a major generalization of the main theorem of Willson (2007). 2. Fundamentals A directed graph or digraph ðV ; AÞ consists of a finite set V of vertices and a finite set A of arcs, each consisting of an ordered pair ðu; vÞ where u 2 V , v 2 V , uav, interpreted as an arrow from u (the parent) to v (the child). There are no multiple arcs and no loops. A directed path is a sequence u0 ; u1 ; . . . ; uk of vertices such that for i ¼ 1; . . . ; k, ðui1 ; ui Þ 2 A. The path is trivial if k ¼ 0. Write upv if there is a directed path starting at u and ending at v. The graph is acyclic if there is no nontrivial directed path starting and ending at the same point. If the graph is acyclic, it is easy to see that p is a partial order on V. The acyclic digraph ðV ; AÞ has root r 2 V if for all v 2 V , rpv. The indegree of vertex u is the number of v 2 V such that ðv; uÞ 2 A. The outdegree of u is the number of v 2 V such that ðu; vÞ 2 A. If ðV ; AÞ is rooted at r then r is the only vertex of indegree 0. A leaf is a vertex of outdegree 0. A normal (or tree) vertex is a vertex of indegree at most 1. A hybrid vertex (or recombination vertex or reticulation node) is a vertex of indegree at least 2. A base-set X is a subset of V that contains the root, all leaves, and all vertices of outdegree 1. (It may contain other vertices as well.) The interpretation of X is that its members correspond to taxa on which direct measurements may be made. The leaves correspond typically to extant taxa on which such measurements as DNA are possible. While the root is usually thought of as being a remote ancestor, it may be replaced in practice by an outgroup on which measurements can be made. In consideration of trees, it is common to suppress any vertices of outdegree 1 other than possibly the root because nothing in the tree uniquely identifies such a taxon. If such a vertex remains in the network, it is because special information is known about such a taxon, whence we assume measurements on it can be made and it is in X. An arc ðu; vÞ is redundant if there exists w 2 V such that u, v, and w are distinct and upwpv. The inclusion of a redundant arc is problematic since it duplicates much genetic information while adding to both indegrees and outdegrees. Author's personal copy ARTICLE IN PRESS 340 S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 In this paper a (phylogenetic) network N ¼ ðV ; A; r; X Þ is a rooted acyclic digraph ðV ; AÞ with root r and base-set X such that there are no redundant arcs. The fundamental problem is to learn about N from information on X. Let N ¼ ðV ; A; r; X Þ be a phylogenetic network. A directed path u ¼ u0 ; u1 ; u2 ; . . . ; uk ¼ v (where ðui1 ; ui Þ is an arc for i ¼ 1; . . . ; k) is a normal path from u to v provided for i40 ui is normal. Note u itself may or may not be hybrid. There is a normal path from u to X if there is a normal path from u to some x such that x 2 X . If u 2 X , then the trivial path starting at u is a normal path from u to X. The network N is normal if from every vertex u 2 V there is a normal path to X. Let W be a nonempty subset of V. The most recent common ancestor of W, denoted mrcaðW Þ, is the vertex u 2 V , if it exists, such that (1) For all w 2 W , upw. (2) Suppose v 2 V satisfies that for all w 2 W , vpw. Then vpu. If mrcaðW Þ exists, it is unique. This is because, if u1 and u2 both satisfy the definition then by (1) for all w 2 W , u2 pw, whence by (2) u2 pu1 . By a symmetric argument, u1 pu2 . Hence u1 ¼ u2 . It is easy, however, to construct examples of networks where mrcaðW Þ need not exist. This paper concerns evolution under the ‘‘Simple Homoplasy Model.’’ Let C denote a (large finite) set of binary characters. Let N ¼ ðV ; A; r; X Þ be a phylogenetic network. Associated with each v 2 V there is a set MðvÞ C, called the mutated genome of v, such that i 2 MðvÞ iff character i has a different allele in taxon v than the allele at the root r. Since each character i is binary, it follows that the genome of v is determined by the genome at r and by MðvÞ. It is immediate that MðrÞ ¼ ;. Since measurements can be made on members of X and r 2 X , we may assume that for every x 2 X , the genomes of r and x are known, whence MðxÞ is known. Following are the assumptions for the Simple Homoplasy Model. (SH1) For every v 2 V , there is a set OðvÞ MðvÞ, whose members are called originating mutations at v. (SH2) If vaw, then OðvÞ \ OðwÞ ¼ ;. (SH3) MðrÞ ¼ OðrÞ ¼ ;. (SH4) If c 2 V is normal with parent p 2 V , then MðcÞ ¼ MðpÞ [ OðcÞ. (SH5) If c 2 V is hybrid with parents p1 ; p2 ; . . . ; pk then for each i; 1pipk, there exist sets Pðc; pi Þ C such that (SH5a) MðcÞ ¼ OðcÞ [ ½[fPðc; pi Þ : i ¼ 1; . . . ; kg (SH5b) Pðc; pi Þ ¼ MðcÞ \ Mðpi Þ, and (SH5c) Oðpi Þ Pðc; pi Þ for i ¼ 1; . . . ; k. Call Pðc; pi Þ the parental contribution to c from pi . The intuition behind the model is that the sets OðvÞ identify the new mutations that occur at vertex v. Under an infinite site hypothesis, there are so many characters and mutation is sufficiently rare that the same mutation never occurs twice; hence the sets OðvÞ are pairwise disjoint, as required in (SH2). Inheritance at a normal vertex c (with only one parent p) is very simple as described in (SH4): all mutated characters from p are inherited by c, and new mutations identified in OðcÞ occur as well. At a hybrid vertex c with parents p and q, the vertex c inherits the mutations Pðc; pÞ from parent p and mutations Pðc; qÞ from parent q. Hence (SH5b) Pðc; pÞ ¼ MðcÞ \ MðpÞ whence Pðc; pÞ MðpÞ. Every mutation i 2 MðcÞ either comes from p (i 2 Pðc; pÞ), comes from q (i 2 Pðc; qÞ), or originates anew at c (i 2 OðcÞ). Hence (SH5a) MðcÞ ¼ OðcÞ [ Pðc; pÞ [ Pðc; qÞ. The axiom (SH5a) also allows an arbitrary set of parents. It is easy to see that if there exists v 2 V such that i 2 MðvÞ, then there exists u 2 V such that i 2 OðuÞ and upv. The important final requirement (SH5c), if the hybrid vertex c has parents p and q, is that OðpÞ Pðc; pÞ and OðqÞ Pðc; qÞ. This asserts that every mutation in MðpÞ that originated at p is inherited by c. If this were not the case, then a mutation originating at p could disappear immediately, a situation that will be called an immediate reversion. In the simplest case, suppose p has exactly one additional child c0 as well as c, and c0 is normal. If there were an immediate reversion of character i, then i 2 OðpÞ and i 2 Mðc0 Þ (by SH4) but ieMðcÞ. Since MðpÞ is not directly measurable, this situation would be indistinguishable from that in which i 2 Oðc0 Þ. An immediate reversion would thus provide a trivial barrier to reconstruction of the genome. If we assume there are no immediate reversions, then once the network is known, this trivial ambiguity could be instantly recognized. Let N ¼ ðV ; AÞ be an acyclic digraph. A pseudocycle in N is a sequence of vertices x0 ; x1 ; x2 ; . . . ; xn from V with n40 such that xn ¼ x0 and for each i (taken mod n) either (1) ðxi ; xiþ1 Þ is an arc; or (2) xi is hybrid with distinct parents xi1 and xiþ1 and ðxiþ1 ; xi Þ is an arc. A pseudocycle is not a cycle since it is not a directed path. Nevertheless it is very similar to a cycle since time is moving forward on most parts of the sequence. The existence of a pseudocycle indicates a lack of ‘‘time consistency’’. For example, if there is a temporal representation on the network (Baroni et al., 2006) then each vertex v has a ‘‘time’’ f ðvÞ such that when v has parents p and q, then f ðpÞ ¼ f ðqÞ; and when c is a child of u, then f ðuÞof ðcÞ. Following a pseudocycle we see that the successive hybrid parents must exist later in time and yet loop back to the original hybrid node, an impossibility. Hence the network can have no pseudocycle. If N ¼ ðV ; A; r; X Þ is a phylogenetic network, suppose x is a hybrid vertex with parents p and q. Call x a positive hybrid if OðxÞa;. If x is a positive hybrid, perform an operation to produce a new network N x as follows: Insert Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 341 Case (1): If m ¼ 0 then x1 ¼ c. Hence xodpx2 ¼ x1 ¼ c. This contradicts nonredundancy of the arc ðx; cÞ. Case (2): If n ¼ 0 then x2 ¼ d. An argument similar to case (1) applies. Case (3): Otherwise m40 and n40. By normality cm has unique parent cm1 and d n has unique parent d n1 , whence cm1 ¼ d n1 . But ðn 1Þ þ ðm 1Þon þ m. By the inductive hypothesis it follows cm1 ad n1 , a contradiction. & Fig. 1. In separation of the hybrid vertex x, the left graph N is replaced by the right graph N x in which Oðx0 Þ ¼ ;. Lemma 3.3. Suppose v has distinct children a and b and there is a normal path from a to x1 and a normal path from b to x2 . Assume a is normal. Then v ¼ mrcaðx1 ; x2 Þ. a new vertex x0 called the separated vertex at x. Delete the arcs ðp; xÞ and ðq; xÞ. Insert new arcs ðp; x0 Þ, ðq; x0 Þ, and ðx0 ; xÞ, and let Oðx0 Þ ¼ ;. This procedure will be called separating x, and N x is the separated network. See Fig. 1. The separated vertex x0 has biological meaning. In the act of hybridization of taxa p and q to yield x, a new taxon x0 was produced such that Mðx0 Þ ¼ Pðx0 ; pÞ [ Pðx0 ; qÞ. Further mutation from x0 led to the taxon x such that OðxÞ ¼ MðxÞ Mðx0 Þ. Thus x0 denotes the presumed first hybrid offspring and x denotes the descendent of x0 , slightly mutated, which first left sufficient record to be detectable in the network. Proof. Let the normal paths be a ¼ a0 ; a1 ; a2 ; . . . ; am ¼ x1 and b ¼ b0 ; b1 ; b2 ; . . . ; bn ¼ x2 . Suppose upx1 and upx2 . We show upv. Since upx1 ¼ am , either u ¼ x1 or uox1 , in which case by normality upam1 since am1 is the unique parent of am . By a similar argument, if upam1 then either u ¼ am1 or upam2 . In this manner we see that either u ¼ ai for some i, 1pipm, or upa0 ¼ a. Similarly we see that either u ¼ bi for some i, 1pipn, or upb0 ¼ b. By Lemma 3.2 we cannot have simultaneously u ¼ ai and u ¼ bj for i40 and j40. Hence the possibilities are 3. Reconstruction of the genome given the network (i) upa, upb, (ii) upa, u ¼ bj for some j, (iii) u ¼ ai for some i, upb. In this section we show that, given a normal phylogenetic network N ¼ ðV ; A; r; X Þ, given the mutated sets MðxÞ for all x 2 X , and assuming the Simple Homoplasy Model, it is possible to reconstruct all the sets OðvÞ and Pðu; vÞ. Effectively, the genomes at all vertices are uniquely determined and can be found in polynomial time. Theorem 3.1. Assume N ¼ ðV ; A; r; X Þ is normal and the evolution satisfies the Simple Homoplasy Model. Assume the network N is known and that for each x 2 X , MðxÞ is known. Then for all v 2 V , MðvÞ is determined and OðvÞ is determined. For all hybrid vertices v with parents pi , i ¼ 1; . . . ; k, for each i, Pðv; pi Þ is determined. Throughout Section 3, we shall make the assumptions in Theorem 3.1. The proof of Theorem 3.1 will first show that, for all v 2 V , MðvÞ is determined. We can then obtain all sets OðvÞ and Pðv; pÞ from all the sets MðvÞ. We shall repeatedly use the immediate consequence of (SH4) that, if u ¼ u0 ; u1 ; . . . ; uk ¼ a is a normal path, then MðaÞ ¼ MðuÞ [ Oðu1 Þ [ [ Oðuk Þ whence MðuÞ MðaÞ. Lemma 3.2. Let the vertex x 2 V have distinct children c and d, let c ¼ c0 ; c1 ; . . . ; cm ¼ x1 be a normal path from c to x1 , and let d ¼ d 0 ; d 1 ; . . . ; d n ¼ x2 be a normal path from d to x2 . Then x1 ax2 . Proof. We prove the result by induction on n þ m. Suppose x1 ¼ x2 . In case (ii), vobpbj ¼ upa, contradicting the nonredundancy of the arc ðv; aÞ. In case (iii), voapai ¼ upb, contradicting the nonredundancy of the arc ðv; bÞ. Hence (i) must apply. But then since a is normal and upa, we have either u ¼ a or upv. If u ¼ a, then voa ¼ upb contradicts the nonredundancy of arc ðv; bÞ. Hence upv. This shows v ¼ mrcaðx1 ; x2 Þ. & Corollary 3.4. Suppose v has distinct children a and b and there is a normal path from a to x1 and a normal path from b to x2 . Assume a is normal. Then Mðx1 Þ \ Mðx2 Þ MðvÞ. Proof. Let the normal paths be a ¼ a0 ; a1 ; a2 ; . . . ; am ¼ x1 and b ¼ b0 ; b1 ; b2 ; . . . ; bn ¼ x2 . Then by (SH4), Mðx1 Þ ¼ MðvÞ [ OðaÞ [ Oða1 Þ [ [ Oðam Þ and Mðx2 Þ ¼ MðbÞ[ Oðb1 Þ [ Oðb2 Þ [ [ Oðbn Þ. Suppose i 2 Mðx1 Þ \ Mðx2 Þ. There exists u 2 V such that i 2 OðuÞ. Then upx1 since i 2 Mðx1 Þ, and similarly upx2 . By Lemma 3.3, upmrcaðx1 ; x2 Þ ¼ v. Hence u cannot equal a, a1 ; . . . ; am , b1 ; . . . ; or bn . Hence Mðx1 Þ \ Mðx2 Þ ¼ MðvÞ\ MðbÞ MðvÞ. & Lemma 3.5. Suppose v has two normal children a and b. Choose a normal path from a to x 2 X , and a normal path from b to y 2 X . Then MðvÞ ¼ MðxÞ \ MðyÞ. Proof. Since a is normal and the path from a to x is normal, it follows that MðvÞ MðxÞ. Similarly MðvÞ MðyÞ, whence MðvÞ MðxÞ \ MðyÞ. Conversely, by Corollary 3.4, MðxÞ \ MðyÞ MðvÞ. & Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 342 Lemma 3.6. Suppose v 2 V is normal with parent p. Suppose v has normal child c and hybrid child z. Choose a normal path from c to a 2 X , and a normal path from z to w 2 X . Then MðvÞ ¼ MðpÞ [ ðMðaÞ \ MðwÞÞ. from zi to wi 2 X . Then [ MðvÞ ¼ fMðaÞ \ Mðpi Þ : 1pipkg [ [ fMðaÞ \ Mðwi Þ : 1pipmg. Proof. By (SH4), MðpÞ MðvÞ. By Corollary 3.4, MðaÞ \ MðwÞ MðvÞ. Hence MðpÞ [ ðMðaÞ \ MðwÞÞ MðvÞ. Conversely, OðvÞ MðaÞ since there is a normal path from v to a. OðvÞ MðzÞ by (SH5c) since there are no immediate reversions. MðzÞ MðwÞ since there is a normal path from z to w. Hence OðvÞ MðwÞ and OðvÞ MðaÞ \ MðwÞ. Thus MðvÞ ¼ MðpÞ [ OðvÞ MðpÞ[ ðMðaÞ \ MðwÞÞ. & Proof. The argument is an obvious generalization of the argument for Lemma 3.7, and is omitted. & Lemma 3.7. Suppose v is hybrid with parents p and q. Suppose v has normal child c and hybrid child z. Choose a normal path from c to a 2 X and a normal path from z to w 2 X . Then MðvÞ ¼ ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ [ ðMðaÞ \ MðwÞÞ. Proof. See Fig. 2. Let the normal path from c to a be c ¼ c0 ; c1 ; . . . ; cn ¼ a. Let the normal path from z to w be z ¼ z0 ; z1 ; . . . ; zm ¼ w. Then MðaÞ ¼ MðvÞ [ Oðc0 Þ [ Oðc1 Þ [ [ Oðcn Þ and MðwÞ ¼ MðzÞ [ Oðz1 Þ [ Oðz2 Þ[ [ Oðzm Þ. Note that ðMðaÞ \ MðpÞÞ MðvÞ. (Otherwise if i 2 MðaÞ \ MðpÞ, then i 2 MðaÞ so there exists j such that i 2 Oðcj Þ, and cj pp since i 2 MðpÞ, leading to a cycle.) Similarly ðMðaÞ \ MðqÞÞ MðvÞ. Note that ðMðaÞ \ MðwÞÞ MðvÞ by Corollary 3.4. Thus ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ [ ðMðaÞ \ MðwÞÞ MðvÞ. We can now complete the proof of Theorem 3.1. We first show that for all v 2 V , MðvÞ is uniquely determined. Assume by induction that MðuÞ is known when uov. The case MðrÞ ¼ ; serves as a base for the induction. By normality, one of the following cases A, B, C, and D occurs. Case A: Suppose v 2 X . Then MðvÞ is given. Case B: Suppose v has outdegree 0 or 1. Then v 2 X and MðvÞ is given. Case C: Suppose v has two normal children c1 and c2 . Then Lemma 3.5 shows MðvÞ ¼ MðxÞ \ MðyÞ for some specified members x and y of X. Case D: Suppose veX and v does not have two normal children. Then v has a normal child c and at least one hybrid child z. Subcase D1: If v is normal, then Lemma 3.6 shows MðvÞ ¼ MðpÞ [ ðMðaÞ \ MðwÞÞ for specified members a and w of X, and with pov whence MðpÞ is known by induction. Subcase D2: If v is hybrid with exactly two parents p and q, then Lemma 3.7 finds a and w in X such that MðvÞ ¼ ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ [ ðMðaÞ \ MðwÞÞ. Conversely, MðvÞ ¼ OðvÞ [ Pðv; pÞ [ Pðv; qÞ by (SH5a). If i 2 OðvÞ, then i 2 MðzÞ since there are no immediate reversions, so i 2 MðaÞ \ MðwÞ. If i 2 Pðv; pÞ then i 2 MðaÞ \ MðpÞ. If i 2 Pðv; qÞ then i 2 MðaÞ \ MðqÞ. Hence MðvÞ ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ [ ðMðaÞ \ MðwÞÞ: & Lemma 3.8. Suppose v is hybrid with parents p1 ; p2 ; . . . ; pn . Suppose v has normal child c and other children z1 ; . . . ; zm . Choose a normal path from c to a 2 X , and normal paths Since pov and qov, MðpÞ and MðqÞ are known by induction. Hence MðvÞ is determined. Subcase D3: If v is hybrid with an arbitrary number of parents and hybrid children, then Lemma 3.8 determines MðvÞ. Thus for all v 2 V , MðvÞ is uniquely determined. There remains to show that OðvÞ and Pðu; pÞ are determined. This follows from three cases: Case 1: Suppose v is normal with parent p. Then OðvÞ ¼ MðvÞ MðpÞ by (SH4). Case 2: Suppose v is hybrid with parents p1 ; p2 ; . . . ; pk . Then OðvÞ ¼ MðvÞ ðMðp1 Þ [ [ Mðpk ÞÞ by (SH5). Case 3: Suppose p is a parent of the hybrid vertex u. Then Pðu; pÞ ¼ MðuÞ \ MðpÞ by (SH5b). This completes the proof of Theorem 3.1. 4. Reconstruction of the network Fig. 2. The situation in Lemma 3.7. In this section we make additional assumptions on both the network and the evolution model. Under these Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 additional assumptions, given MðxÞ for all x 2 X , we can reconstruct the phylogenetic network N as well. Here is a list of the assumptions: (A1) (A2) (A3) (A4) (A5) (A6) (A7) (A8) N ¼ ðV ; A; r; X Þ is a normal phylogenetic network. The evolution satisfies the Simple Homoplasy Model. Every hybrid vertex has exactly two parents. For all v 2 V , var, if v is normal then OðvÞ is nonempty. If p has a hybrid child c, then p is normal and every child of p other than c is normal. If h is hybrid, p is a parent of h (hence normal by (A5)), and p0 is a parent of p, then either (A6a) p0 has no child other than p, whence p0 2 X , or else (A6b) p0 has a normal child b such that bap, and either (A6b1) b is a leaf, or (A6b2) b has two normal children, or else (A6b3) b 2 X and b has a normal child. Suppose h is hybrid with parents p and q. Then Pðh; pÞaMðpÞ and Pðh; qÞaMðqÞ. N has no pseudocycles. 343 By (A2) and (SH4) there are no homoplasies at normal vertices, and by (SH5) there are no immediate reversions. Note (A7) says that at a hybrid vertex, reversions occur from each parent. Assumptions (A5)–(A7) appear very technical, yet similar assumptions are needed to ensure unique reconstruction. Fig. 3 shows two normal networks A and B with the same base-set X that satisfy all assumptions except (A6). For example, in A a grandparent of the hybrid vertex 6 has the hybrid child 4. It can be shown by exhaustive check that under the Simple Homoplasy Model the possible characters (observed only on members of X) are exactly the same in both A and B. Given all these characters on members of X and given either A or B, the characters at all remaining vertices can be recovered by Theorem 3.1. Unique reconstruction of the network itself, however, will not be possible since both A and B will be solutions. Hence Theorem 4.1 cannot be true without an assumption similar to (A6). Similarly Fig. 4 exhibits two normal networks C and D satisfying all assumptions except (A5). Any character in either network (as observed in members of X) under the Simple Homoplasy Model is a character in the other. Hence Theorem 4.1 cannot be true without an assumption similar to (A5). Fig. 3. Two normal networks A and B with base-set X ¼ f1; 2; 3; 4; 5; 6; 7; 8g. They fail (A6). Fig. 4. Two normal networks C and D with base-set X ¼ f1; 2; 3; 4; 5; 6; 7g. They fail (A5). Author's personal copy ARTICLE IN PRESS 344 S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 Fig. 5. Two normal networks E and F with base-set X ¼ f1; 2; 3; 4; 5; 6; 7g. Each vertex v also shows MðvÞ. Both networks have the same genomes and satisfy the Simple Homoplasy Model, but they fail (A7). Finally, Fig. 5 exhibits two networks E and F and for each vertex v the mutated set MðvÞ. Both networks satisfy all axioms except (A7). For example, in E, Pð4; 3Þ ¼ Mð3Þ ¼ fa; b; dg. In E, e 2 Oð5Þ, while in F, e 2 Oð4Þ. The networks differ in which vertex is hybrid. It follows Theorem 4.1 cannot be true without an assumption like (A7). Theorem 4.1. Let N ¼ ðV ; A; r; X Þ be a phylogenetic network that satisfies (A1)–(A8) above. Assume for all x 2 X , MðxÞ is given. Then N can be reconstructed uniquely by an explicit procedure. If jX j ¼ n and jCj ¼ m then the reconstruction can be done in time Oðn6 mÞ. From Theorem 3.1 it then follows that for all v 2 V , MðvÞ can be reconstructed uniquely. The proof of 4.1 requires many lemmas to handle many special cases. We assume (A1)–(A8) throughout this section. The basic tool is the following: Assume jX jX3. Suppose for every x 2 X , we know MðxÞ. Define the stem function d by, whenever a, b, and x are distinct members of X, dðx; a; bÞ ¼ MðxÞ ½ðMðaÞ \ MðxÞÞ [ ðMðbÞ \ MðxÞÞ. If x 2 X , define dðxÞ ¼ \½dðx; a; bÞ : x; a; b distinct members of X . Note dðx; a; bÞ MðxÞ. Trivially dðrÞ ¼ ;. The first use of d will be to identify which members of X are leaves and which are internal vertices. Lemmas 4.3–4.11 deal with different cases that show (1) if x 2 X is a leaf, then dðxÞ ¼ OðxÞ, and (2) if x 2 X is not a leaf, then dðxÞ ¼ ;. If x is a normal leaf with parent p, then by (SH4), MðpÞ ¼ MðxÞ OðxÞ whence MðpÞ ¼ MðxÞ dðxÞ by (1). Thus the genome of the parent p can be reconstructed, and dðxÞ ¼ OðxÞ identifies the mutations that occurred on the ‘‘stem’’ of x, i.e., the arc leading to x. (This is the reason for calling d the ‘‘stem’’ function.) This fact will allow us recursively to remove normal leaves and replace them by their parents, simplifying the network. If x is a hybrid leaf, then the separated vertex x0 satisfies Mðx0 Þ ¼ MðxÞ dðxÞ, so by a similar process a hybrid leaf may be replaced by its separated vertex. In this manner the network will be simplified recursively, with X changing in the process, until for each x 2 X we have dðxÞ ¼ ;. Lemma 4.13 shows that the network will now have no normal leaves. Hence every leaf is hybrid. In this situation, Lemma 4.14 gives a criterion to identify a member x of X which is a hybrid leaf and to identify its parents p and q. Lemma 4.15 proves that there exists x 2 X that satisfies the criterion in Lemma 4.14. Hence we can identify a hybrid leaf x and its parents p and q. We then simplify the network by removing x from X (since its parents are known), and continue recursively. The first step is to verify that two distinct vertices cannot have the same mutated genome. This result will be needed in order to identify together two vertices with the same genome constructed by different procedures. Lemma 4.2. Suppose u and v are vertices such that MðuÞ ¼ MðvÞ. Then u ¼ v. Proof. There are three cases to consider: Case (1): Assume that both u and v are normal. By (A4) we may choose a 2 OðuÞ and b 2 OðvÞ. Since MðuÞ ¼ MðvÞ it follows fa; bg MðuÞ ¼ MðvÞ. Since a 2 MðvÞ it follows upv. Since b 2 MðuÞ it follows vpu. Hence u ¼ v. Case (2): Suppose one of the vertices, say u, is normal and the other vertex v is hybrid. Let p and q denote the parents of v. By (A5) they are normal, and by (A4) we may choose a 2 OðuÞ, b 2 OðpÞ, and c 2 OðqÞ. Since there are no immediate reversions, it follows fb; cg MðvÞ, whence fa; b; cg MðuÞ ¼ MðvÞ. Note uav since only u is normal. Since a 2 MðvÞ it follows upv. Since v has only parents p and q by (A3), it follows either upp or upq. Without loss of generality, assume upp. Since b 2 MðuÞ, it follows ppu. Hence p ¼ u. Since c 2 MðuÞ, we see qpu ¼ p. But then qopov shows that the arc ðq; vÞ was redundant, contradicting (A1). Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 Case (3): Suppose that both u and v are hybrid but uav. Let pu and qu be the parents of u, and pv and qv be the parents of v. By (A5) these are normal, and by (A4) we may choose a 2 Oðpu Þ, b 2 Oðqu Þ, c 2 Oðpv ), and d 2 Oðqv Þ. Since there are no immediate reversions, it follows fa; bg MðuÞ and fc; dg MðvÞ, whence fa; b; c; dg MðuÞ ¼ MðvÞ. Since a 2 MðvÞ it follows pu pv, whence either pu ppv or pu pqv . In like manner we see that either pv ppu or pv pqu . Similarly either qu ppv or qu pqv ; and either qv ppu or qv pqu . Without loss of generality assume pu ppv . If pv ppu , then pu ¼ pv . By (A5) every child of pu other than u is normal; since v is hybrid it follows u ¼ v, a contradiction. Hence pv pqu . If qu ppv , then pv ¼ qu , whence again by (A5) u ¼ v, a contradiction. Hence qu pqv . If qv pqu , then qu ¼ qv , whence u ¼ v, a contradiction. Hence qv ppu . It follows that pu ppv pqu pqv ppu . Hence all four points are equal, a contradiction. & 345 Fig. 6. The situation in Lemma 4.10 or in Lemma 4.6 if p and x are merged into a single vertex x. Proof. Since x is a leaf and a and b are distinct from x, it is false that xpa and it is false that xpb. It follows that OðxÞ \ MðaÞ ¼ ; and OðxÞ \ MðbÞ ¼ ;. Since OðxÞ MðxÞ, the result follows. & c1 ; . . . ; ck ¼ y. Choose a normal path from b to z 2 X , say b ¼ b0 ; b1 ; . . . ; bj ¼ z. The situation is then as in Fig. 6 modified so that x and p are merged into a single vertex x. I claim dðx; y; zÞ ¼ ;. By (SH4), since b is normal, MðzÞ ¼ Mðp0 Þ [ OðbÞ[ Oðb1 Þ [ [ OðzÞ. By (SH4), MðyÞ ¼ MðcÞ [ Oðc1 Þ [ [ OðyÞ and MðxÞ ¼ Mðp0 Þ [ OðxÞ. If x ¼ bu for some uX0, then the arc ðp0 ; xÞ would be redundant. Hence MðzÞ \ MðxÞ ¼ Mðp0 Þ. If x ¼ cu for some u40, then there would be a directed cycle at x. Hence MðyÞ \ MðxÞ ¼ MðxÞ \ MðcÞ ¼ Pðc; xÞ. Then dðx; y; zÞ ¼ MðxÞ ðMðp0 Þ [ Pðc; xÞÞ ¼ ; because Pðc; xÞ contains OðxÞ by (SH5c). & Lemma 4.4. Suppose x 2 X and x has a normal child c. Then dðxÞ ¼ ;. Lemma 4.7. Suppose x is a normal leaf with parent p and p has a normal child c distinct from x. Then dðxÞ ¼ OðxÞ. Proof. If x ¼ r, the result is immediate. Assume xar. Choose a normal path from c to y 2 X ; say the path is c ¼ c0 ; c1 ; . . . ; ck ¼ y. I claim dðx; r; yÞ ¼ ;. Since c is normal, by (SH4), MðyÞ ¼ MðxÞ [ OðcÞ [ Oðc1 Þ [ [OðyÞ. Hence MðxÞ \ MðyÞ ¼ MðxÞ, whence dðx; r; yÞ ¼ MðxÞ MðxÞ ¼ ;. Hence dðxÞ ¼ ;. & Proof. Since N is normal we may choose a normal path from c to y 2 X given by c ¼ c0 ; c1 ; c2 ; ; ck ¼ y. I claim that dðx; r; yÞ ¼ OðxÞ. To see this, observe that by (SH4), MðxÞ ¼ MðpÞ [ OðxÞ and Lemma 4.3. Suppose x 2 X is a leaf. Then for all a and b such that a, b, and x are distinct, OðxÞ dðx; a; bÞ, whence OðxÞ dðxÞ. Lemma 4.5. Suppose x 2 X and x has a hybrid child c. Suppose x is normal with parent p0 and p0 2 X . Then dðxÞ ¼ ;. Proof. Suppose c is hybrid with parents x and q. Choose a normal path from c to y 2 X ; say the path is c ¼ c0 ; c1 ; . . . ; ck ¼ y. Then MðyÞ ¼ MðcÞ [ Oðc1 Þ[ [ OðyÞ. Hence MðxÞ \ MðyÞ ¼ MðxÞ \ MðcÞ ¼ Pðc; xÞ. Since x is normal, MðxÞ ¼ Mðp0 Þ [ OðxÞ, whence MðxÞ \ Mðp0 Þ ¼ Mðp0 Þ. Thus 0 0 dðx; y; p Þ ¼ MðxÞ ½Pðc; xÞ [ Mðp Þ ¼ MðxÞ MðxÞ ¼ ; since OðxÞ Pðc; xÞ and MðxÞ ¼ Mðp0 Þ [ OðxÞ by (SH4), it follows that dðxÞ ¼ ;. & MðyÞ ¼ MðpÞ [ OðcÞ [ Oðc1 Þ [ [ Oðck1 Þ [ OðyÞ. Note x is distinct from p and all ci . (If x ¼ ci for some i40 then the path p; c; ; ci ¼ x makes the arc ðp; xÞ redundant. If x ¼ c then this contradicts that c is distinct from x.) Hence MðxÞ \ MðyÞ ¼ MðpÞ. It follows that dðx; r; yÞ ¼ MðxÞ MðpÞ ¼ OðxÞ. Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. & Lemma 4.8. Assume jX jX3. Suppose x is a normal leaf with parent p and p 2 X . Then dðxÞ ¼ OðxÞ. Proof. If p ¼ r, then since jX jX3, r must have another child c which is necessaritly normal. Lemma 4.7 then shows that dðxÞ ¼ OðxÞ. If par, then dðx; r; pÞ ¼ OðxÞ. To see this, observe that since MðxÞ ¼ MðpÞ [ OðxÞ and OðxÞ is disjoint from MðpÞ we have MðxÞ \ MðpÞ ¼ MðpÞ. Since MðrÞ ¼ ; we have Lemma 4.6. Suppose x 2 X and x has a hybrid child c. Suppose x is normal with parent p0 and p0 has a normal child b distinct from x. Then dðxÞ ¼ ;. dðx; r; pÞ ¼ MðxÞ MðxÞ \ MðpÞ ¼ MðxÞ MðpÞ ¼ OðxÞ. Proof. Suppose c is hybrid with parents x and q. Choose a normal path from c to y 2 X ; say the path is c ¼ c0 ; Lemma 4.9. Suppose x is a normal leaf with parent p. Suppose p has a hybrid child c whose other parent is q. Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. & Author's personal copy ARTICLE IN PRESS 346 S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 Suppose p is normal with parent p0 , and p0 2 X . Then dðxÞ ¼ OðxÞ. Lemma 4.11. Suppose x 2 X is a hybrid leaf with parents p and q. Then dðxÞ ¼ OðxÞ. Proof. By normality we may choose a normal path from c to y 2 X , say c ¼ c0 ; c1 ; . . . ; ck ¼ y. We show dðx; y; p0 Þ ¼ OðxÞ. By (SH4), MðyÞ ¼ MðcÞ [ Oðc1 Þ [ [ OðyÞ, and Proof. By normality, choose a normal path from p to y 2 X , say p ¼ p0 ; p1 ; . . . ; pm ¼ y; and choose a normal path from q to z 2 X , say q ¼ q0 ; q1 ; . . . ; qn ¼ z. I claim dðx; y; zÞ ¼ OðxÞ. Note MðxÞ ¼ OðxÞ [ Pðx; pÞ [ Pðx; qÞ, MðyÞ ¼ MðpÞ[ Oðp1 Þ [ [ OðyÞ, and MðzÞ ¼ MðqÞ [ Oðq1 Þ [ [ OðzÞ. For i40, MðxÞ \ Oðpi Þ ¼ ; since otherwise pi px leading to a directed path from p to x contradicting nonredundancy of ðp; xÞ. Hence MðxÞ \ MðyÞ ¼ MðxÞ \ MðpÞ ¼ Pðx; pÞ by (SH5b). Similarly MðxÞ \ MðzÞ ¼ MðxÞ \ MðqÞ ¼ Pðx; qÞ. It follows that MðxÞ ¼ Mðp0 Þ [ OðpÞ [ OðxÞ ¼ MðpÞ [ OðxÞ. Hence MðxÞ \ Mðp0 Þ ¼ Mðp0 Þ, while MðxÞ \ MðyÞ ¼ MðxÞ \ MðcÞ (since for i40, Oðci Þ \ MðxÞ ¼ ; because otherwise we have ci px and the directed path from p to ci through c and then from ci to x contradicts that the arc ðp; xÞ is nonredundant). But MðcÞ ¼ OðcÞ[ Pðc; pÞ [ Pðc; qÞ. Hence MðxÞ \ MðyÞ ¼ ½MðpÞ [ OðxÞ \ MðcÞ ¼ MðpÞ \ MðcÞ ¼ Pðc; pÞ. It follows that dðx; y; zÞ ¼ MðxÞ ðPðx; pÞ [ Pðx; qÞÞ ¼ OðxÞ. Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. & Lemma 4.12. Suppose, for all x 2 X , we have dðxÞ ¼ ;. dðx; y; p0 Þ ¼ MðxÞ ðPðc; pÞ [ Mðp0 ÞÞ. Since there are no immediate reversions, OðpÞ Pðc; pÞ MðpÞ, and p is normal so MðpÞ ¼ Mðp0 Þ [ OðpÞ whence Pðc; pÞ [ Mðp0 Þ ¼ MðpÞ. Hence dðx; y; p0 Þ ¼ MðxÞ MðpÞ ¼ OðxÞ. It follows that dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. & Lemma 4.10. Suppose x is a normal leaf with parent p. Suppose p has a hybrid child c whose other parent is q. Suppose p is normal with parent p0 , and p0 has a normal child b distinct from p. Then dðxÞ ¼ OðxÞ. Proof. By normality we may choose a normal path from c to y 2 X , say c ¼ c0 ; c1 ; . . . ; ck ¼ y. We may also choose a normal path from b to z 2 X , say b ¼ b0 ; b1 ; . . . ; bj ¼ z. See Fig. 6. We show dðx; y; zÞ ¼ OðxÞ. We have MðxÞ ¼ MðpÞ [ OðxÞ, MðyÞ ¼ MðcÞ [ Oðc1 Þ[ [ OðyÞ, and MðzÞ ¼ Mðp0 Þ [ OðbÞ [ Oðb1 Þ [ [ OðzÞ. Note x is distinct from all bi for ioj since x is a leaf. Also, xaz since otherwise if j40 then p ¼ bj1 and the arc ðp0 ; pÞ would be redundant. Moreover, xab since otherwise by normality p ¼ p0 ; and xap0 since x is a leaf. Similarly x is distinct from all ci for iok since x is a leaf. Moreover, xay if k40 since otherwise p ¼ ck1 and there is a directed cycle at p; and xac if k ¼ 0 because x and c are distinct leaves. Because the originating sets are pairwise disjoint it follows that MðxÞ \ MðyÞ ¼ MðpÞ \ MðcÞ ¼ Pðc; pÞ and MðxÞ \ MðzÞ ¼ Mðp0 Þ by Lemma 3.5. Note MðpÞ ¼ Mðp0 Þ [ OðpÞ ¼ Mðp0 Þ [ Pðc; pÞ since OðpÞ Pðc; pÞ. Hence dðx; y; zÞ ¼ MðxÞ ½Mðp0 Þ [ Pðc; pÞ ¼ MðxÞ MðpÞ ¼ OðxÞ: & Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. (1) Suppose x 2 X is a hybrid leaf with parents p and q, such that p and q are in X. Then dðx; p; qÞ ¼ ;. (2) Suppose in addition that x is the only child of p and the only child of q. Then whenever a and b in X satisfy that dðx; a; bÞ ¼ ;, it follows that either (a ¼ p and b ¼ q) or else (a ¼ q and b ¼ p). Proof. Since dðxÞ ¼ ;, it follows from Lemma 4.11 that OðxÞ ¼ ;. Hence MðxÞ ¼ OðxÞ [ Pðx; pÞ [ Pðx; qÞ ¼ Pðx; pÞ [ Pðx; qÞ. From the proof of Lemma 4.11, dðxÞ ¼ ; ¼ MðxÞ ½ðMðxÞ \ MðpÞÞ [ ðMðxÞ \ MðqÞÞ. Hence MðxÞ ¼ ½ MðxÞ \ MðpÞ [ ½MðxÞ \ MðqÞ. This proves (1). For (2), suppose a and b in X are as described, so dðx; a; bÞ ¼ ;. Then MðxÞ ¼ ½MðxÞ \ MðaÞ [ ½MðxÞ \ MðbÞ. Since OðpÞa; by (A4), we may choose i 2 OðpÞ, whence i 2 Pðx; pÞ, whence i 2 MðxÞ, whence either i 2 MðaÞ or i 2 MðbÞ. In particular, either ppa or ppb. Assume without loss of generality that ppa. Since x is the only child of p, and we cannot have xpa since x is a leaf, it follows p ¼ a. Similarly we see that either q ¼ a or b, but since a and b are distinct it follows q ¼ b. & Lemma 4.13. Suppose all members x of X satisfy dðxÞ ¼ ;. Then no member of X is a normal leaf. Proof. Suppose x 2 X is a normal leaf. By (A4), OðxÞa;, and by Lemma 4.7, 4.8, 4.9, or 4.10, dðxÞ ¼ OðxÞ, a contradiction. & Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 Lemma 4.14. Assume all members y of X satisfy dðyÞ ¼ ;. Suppose there exist distinct x, p, and q in X, all three distinct from r, such that (i) dðx; p; qÞ ¼ ;. (ii) If dðx; a; bÞ ¼ ; for distinct x, a, and b, then either (p ¼ a and q ¼ b) or (p ¼ b and q ¼ a). (iii) MðpÞD / MðxÞ, and MðqÞD / MðxÞ. Then x is a hybrid leaf with parents p and q, x is the only child of p, and x is the only child of q. Proof. Suppose x is not a hybrid leaf. Since x is not a normal leaf by Lemma 4.13, we see that x has a child w. For any normal child u of x, by choosing a normal path from u to y 2 X we would have dðx; r; yÞ ¼ ; by Lemma 4.4. But then by (ii) either p or q equals r, contradicting that p and q were distinct from r. Hence every child of x is hybrid. In particular w is hybrid. Since x is parent to a hybrid vertex, x is normal by (A5), and we may let s be its unique parent. Case 1: Suppose w has (at least) two children u and v. Both are normal by (A5). Choose a normal path from u to y 2 X and a normal path from v to z 2 X . Case 1a: Suppose s 2 X . Then dðx; s; yÞ ¼ ; and dðx; s; zÞ ¼ ; by the proof of Lemma 4.5. But yaz by normality, so this contradicts (ii). Case 1b: Suppose seX . Then s has a normal child b distinct from x by (A6). Choose a normal path from b to e 2 X . Then by the proof of Lemma 4.6 we have dðx; e; yÞ ¼ ; and dðx; e; zÞ ¼ ;. This contradicts (ii). Case 2: Suppose w has exactly one child u. Then u is normal since the parent of a hybrid child is normal and w is hybrid. Since w has outdegree 1, w 2 X . Choose a normal path from u to y 2 X . Case 2a: Suppose s 2 X . Then dðx; s; wÞ ¼ ; and dðx; s; yÞ ¼ ; by the proof of Lemma 4.5. This contradicts (ii). Case 2b: Suppose seX . Then s has a normal child b distinct from x by (A6). Choose a normal path from b to e 2 X . Then by the proof of Lemma 4.6 we have dðx; e; wÞ ¼ ; and dðx; e; yÞ ¼ ;. This contradicts (ii). Case 3: Suppose w has no children, so w is a hybrid leaf. Then w 2 X . Case 3a: Suppose s has a normal child b other than x. By Lemma 4.13, b is not a normal leaf. By (A6) either Subcase (3a1): b has two normal children a and c; or Subcase (3a2): b 2 X and b has a normal child a. Choose normal paths from a to z 2 X , and if c is present choose a normal path from c to u 2 X . In case (3a1) we have both dðx; z; wÞ ¼ ; and dðx; u; wÞ ¼ ; by the proof of Lemma 4.6, contradicting (ii). In case (3a2) we have both dðx; z; wÞ ¼ ; and dðx; b; wÞ ¼ ;, contradicting (ii). Hence Case 3a cannot occur. 347 Case 3b: Suppose s has no child other than x. Then s 2 X and dðx; s; wÞ ¼ ; by the proof of Lemma 4.5. Hence by (ii) either (p ¼ s and q ¼ w) or (p ¼ w and q ¼ s). We may assume p ¼ s and q ¼ w. Then MðxÞ ¼ MðpÞ [ OðxÞ by (SH4), whence MðpÞ MðxÞ, contradicting (iii). Since all three cases are eliminated, it follows that x is a hybrid leaf. By Lemma 4.12, p and q are the parents of x. I claim that x is the only child of p. If not, by (A5), p has a normal child c other than x, and there is a normal path from c to some y 2 X . Then dðx; y; qÞ ¼ ; by Lemma 4.11 as well as dðx; p; qÞ ¼ ;, contradicting (ii). Similarly x is the only child of q. & Lemma 4.15. Assume jX jX3 all members y of X satisfy dðyÞ ¼ ;. Then (i) There exists a hybrid leaf x with parents p and q such that x, p, and q are in X, x is the only child of p, and x is the only child of q. (ii) Neither p nor q is equal to r. (iii) dðx; p; qÞ ¼ ;. (iv) If dðx; a; bÞ ¼ ; then either (p ¼ a and q ¼ b) or (p ¼ b and q ¼ a). (v) There is no y 2 X such that dðx; r; yÞ ¼ ;. (vi) MðpÞD / MðxÞ, and MðqÞD / MðxÞ. Proof. To see (i), by Lemma 4.13 note that the network has no normal leaves. Choose a directed path from r with maximal length (number of arcs) ending at x1 through parent p1 . Then x1 is a leaf, whence x1 is hybrid since there are no normal leaves, and one parent of x1 is p1 . I claim that p1 has no child other than x1 . If p1 has a child c1 other than x1 , then by (A5) every child of p1 other than the hybrid x1 is normal, so c1 is normal. But c1 cannot be a normal leaf, so c1 must have a child, in which case the path from r to p1 to c1 could be extended and the path from r to x1 did not have maximal length. The claim follows, and since p1 has outdegree 1, p1 2 X . Let q1 be the other parent of hybrid x1 . If q1 has no child other than x1 , then x1 , p1 , q1 satisfy (i). Otherwise q1 has a child d 1 other than x1 , d 1 is normal and not a leaf, and we may choose a maximal directed path starting at d 1 ending at x2 through its parent p2 . Then as above x2 is a hybrid leaf, p2 has no child other than x2 , and x2 has other parent q2 . Repeat the process. It must terminate with some hybrid leaf xi with parents pi and qi such that xi is the only child of pi and xi is the only child of qi ; otherwise we generate a pseudocycle, contradicting (A8). Note xi 2 X since it is a leaf, while pi and qi are in X since they have outdegree 1. This proves (i). If p ¼ r, then p ¼ roqox shows that the arc ðp; xÞ is redundant, a contradiction. Hence par, and similarly qar. This proves (ii). By Lemma 4.12, (iii) and (iv) hold. Then (v) follows from (ii) and (iv). To see (vi), suppose MðpÞ MðxÞ. Then Pðx; pÞ ¼ MðpÞ \ MðxÞ ¼ MðpÞ, contradicting (A7). & We can now prove Theorem 4.1. Author's personal copy ARTICLE IN PRESS 348 S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 Call a phylogenetic network N ¼ ðV ; A; r; X Þ ‘‘smaller’’ than a network N 0 ¼ ðV 0 ; A0 ; r0 ; X 0 Þ if either N has fewer vertices than N 0 or else N and N 0 have the same number of vertices but N has more members x 2 X such that dðxÞ ¼ ; than N 0 has members x0 2 X 0 such that dðx0 Þ ¼ ;. The proof is by induction using the notion of smallness. If jV j ¼ 1, then V ¼ frg and N is uniquely determined. If jV j ¼ 2, let V ¼ fr; vg; then clearly A consists of the single arc ðr; vÞ. Hence we may assume jV jX3 whence jX jX3. We assume the result when the network is ‘‘smaller’’ than N ¼ ðV ; A; r; X Þ. By the hypotheses (A1)–(A8) we know that every vertex x 2 X satisfies one of the following descriptions (i)–(ix): (i) x ¼ r. (ii) x is a normal leaf with parent p 2 X . (iii) x is a normal leaf with parent p such that p has a normal child c distinct from x. (iv) x is a normal leaf with parent p such that p has a hybrid child c, p is normal with parent p0 , and p0 2 X . (v) x is a normal leaf with parent p such that p has a hybrid child c, p is normal with parent p0 , and p0 has a normal child b distinct from p. (vi) x is a hybrid leaf with parents p and q. (vii) xar, and x has a normal child. (viii) xar, x has a hybrid child, and x is normal with parent p0 such that p0 2 X . (ix) xar, x has a hybrid child, x is normal with parent p0 , and p0 has a normal child b distinct from x. For each x 2 X , compute dðxÞ. We have different cases depending on the result. Case 1: Suppose there exists x 2 X with dðxÞa;. If x is not a leaf, then one of cases (vii), (viii), or (ix) above occurs, and dðxÞ ¼ ; by Lemmas 4.4–4.6. Hence x is a leaf of N, so one of cases (ii)–(vi) occurs. Then by Lemma 4.7, 4.8, 4.9, 4.10, or 4.11, respectively, OðxÞ ¼ dðxÞ. Case 1a: Suppose x is a normal leaf with parent p. By hypothesis MðxÞ is known. Form a new network N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ by setting V 0 ¼ V fxg and A0 ¼ A fðp; xÞg, X 0 ¼ ðX fxgÞ [ fpg. Note MðpÞ ¼ MðxÞ OðxÞ ¼ MðxÞ dðxÞ, so MðpÞ is known. If X already contains a vertex u such that MðuÞ ¼ MðpÞ, then u and p may be identified together by Lemma 4.2; otherwise, p is a new vertex in X 0 X . Note that N 0 has fewer vertices than N so by induction N 0 is uniquely determined. Hence N is determined by V ¼ V 0 [ fxg, A ¼ A0 [ fðp; xÞg, X 0 ¼ X [ fxg. Note MðxÞ was already known. Case 1b: Suppose x is a hybrid leaf with parents p and q. Let x0 be the separated vertex for x. Then Oðx0 Þ ¼ ;, Mðx0 Þ ¼ MðxÞ OðxÞ ¼ MðxÞ dðxÞ. Form a new network N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ by setting V 0 ¼ ðV fxgÞ [ fx0 g and A0 ¼ ðA fðp; xÞ; ðq; xÞgÞ [ fðp; x0 Þ; ðq; x0 Þg, X 0 ¼ ðX fxgÞ [ fx0 g. Note that N 0 has the same number of vertices as N, but now dðx0 Þ ¼ Oðx0 Þ ¼ OðxÞ OðxÞ ¼ ;. Hence N 0 is ‘‘smaller’’ than N, so by induction N 0 is uniquely determined. Hence N is determined by V ¼ ðV 0 fx0 gÞ[ fxg, A ¼ ðA0 fðp; x0 Þ; ðq; x0 ÞgÞ [ fðp; xÞ; ðq; xÞg, X 0 ¼ ðX fx0 gÞ [ fxg. Note MðxÞ was already known. Observe that the formal constructions in Cases 1a and 1b are identical—the removal of x and insertion of u such that there is an arc ðu; xÞ and such that MðuÞ ¼ MðxÞ dðxÞ. Hence we do not need to know which of the two subcases is occurring. Case 2: Suppose there exists no x 2 X with dðxÞa;. By Lemma 4.13 there are no normal leaves. By Lemma 4.15 there exist distinct x, p, and q in X that satisfy the hypotheses of Lemma 4.14. By Lemma 4.14, x is a hybrid leaf with parents p and q, x is the only child of p, and x is the only child of q. Since dðxÞ ¼ ;, x is already a separated hybrid vertex. By hypothesis MðxÞ is known. Form a new network N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ by setting V 0 ¼ V fxg, A0 ¼ A fðp; xÞ; ðq; xÞg, X 0 ¼ X fxg. Then N 0 is a normal phylogenetic network; note MðpÞ and MðqÞ are known since p and q are in X. Since jV 0 jojV j, by induction it follows that N 0 is uniquely determined. But then N is uniquely determined by V ¼ V 0 [ fxg, A ¼ A0 [ fðp; xÞ; ðq; xÞg, X ¼ X 0 [ fxg. Since MðxÞ was already known, the genome at each member of V is determined. This completes the reconstruction of N. Note that the procedure given above is constructive. In each stage a network N ¼ ðV ; A; r; X Þ is replaced by a network N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ such that jX 0 jpjX j and either jV 0 jojV j or else (jV 0 j ¼ jV j but X 0 has more separated hybrid vertices than X). Suppose that jCj ¼ m and in the initial network jV j ¼ v and jX j ¼ n. The number of stages is then at most nv. The computation of all dðx; a; bÞ at each stage takes time at most Oðn3 mÞ. Hence a naı̈ve implementation has time complexity Oðn4 vmÞ. By Lemma 3.3, every vertex u not in X can be written as u ¼ mrcaðx1 ; x2 Þ for some x1 and x2 in X, whence v ¼ Oðn2 Þ. It follows that the reconstruction procedure has time complexity Oðn6 mÞ. 5. Discussion This paper describes a method for reconstructing a phylogenetic network and its ancestral genomes, given the genomes of the leaves and outgroup, under certain assumptions. The assumptions are sometimes quite strong and future research is desirable to weaken these assumptions. Examples show, however, that some version of assumptions (A5)–(A7) for Theorem 4.1 will be required. The most difficult part of Theorem 4.1 is identifying x, p, and q in X such that x is hybrid with parents p and q. The assumptions (A5)–(A8) serve to permit this identification. Hence a different criterion to recognize a hybrid vertex could potentially help relax these assumptions. It would be of interest to be able to deal with non-binary characters. Another desirable extension would be a reconstruction in case the evolution of characters follows Author's personal copy ARTICLE IN PRESS S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349 a Markov process rather than the Simple Homoplasy Model. Most desirable would be a way to include some manner of homoplasy at normal vertices. Acknowledgments I wish to thank the Isaac Newton Institute in Cambridge UK for its hospitality in a wonderful setting while I wrote this paper. I also thank Mike Steel and Katherina Huber for helpful conversations. Finally I thank the anonymous referees for their excellent corrections and suggestions. References Bandelt, H.-J., Dress, A., 1992. Split decomposition: a new and useful approach to phylogenetic analysis of distance data. Mol. Phylogenet. Evol. 1, 242–252. Baroni, M., Steel, M., 2006. Accumulation phylogenies. Ann. Combin. 10, 19–30. Baroni, M., Semple, C., Steel, M., 2004. A framework for representing reticulate evolution. Ann. Combin. 8, 391–408. Baroni, M., Semple, C., Steel, M., 2006. Hybrids in real time. Syst. Biol. 55, 46–56. 349 Bordewich, M., Semple, C., 2007. Computing the minimum number of hybridization events for a consistent evolutionary history. Discrete Appl. Math. 155, 914–928. Cardona, G., Rossalló, F., Valiente, G., 2007. Comparison of tree-child phylogenetic networks. arXiv:0708.3499v1 [q-bio.PE] 27 August 2007. Gusfield, D., Eddhu, S., Langley, C., 2004a. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination. J. Bioinformatics Comput. Biol. 2, 173–213. Gusfield, D., Eddhu, S., Langley, C., 2004b. The fine structure of galls in phylogenetic networks. INFORMS J. Comput. 16, 459–469. Hein, J., 1990. Reconstructing evolution of sequences subject to recombination using parsimony. Math. Biosci. 98, 185–200. Hein, J., 1993. A heuristic method to reconstruct the history of sequences subject to recombination. J. Mol. Evol. 36, 396–405. Moret, B., Nakhleh, L., Warnow, T., Randal Linder, C.R., Tholse, A., Padolina, A., Sun, J., Timme, R., 2004. Phylogenetic networks: modeling, reconstructibility, and accuracy. IEEE Trans. Comput. Biol. Bioinformatics 1, 13–23. Nakhleh, L., Warnow, T., Linder, C.R., St. John, K., 2005. Reconstructing reticulate evolution in species: theory and practice. J. Comput. Biol. 12, 796–811. Strimmer, K., Moulton, V., 2000. Likelihood analysis of phylogenetic networks using directed graph models. Mol. Biol. Evol. 17, 875–881. Wang, L., Zhang, K., Zhang, L., 2001. Perfect phylogenetic networks with recombination. J. Comput. Biol. 8, 69–78. Willson, S.J., 2007. Unique determination of some homoplasies at hybridization events. Bull. Math. Biol. 69, 1709–1725.