This article appeared in a journal published by Elsevier. The... copy is furnished to the author for internal non-commercial research

advertisement
This article appeared in a journal published by Elsevier. The attached
copy is furnished to the author for internal non-commercial research
and education use, including for instruction at the authors institution
and sharing with colleagues.
Other uses, including reproduction and distribution, or selling or
licensing copies, or posting to personal, institutional or third party
websites are prohibited.
In most cases authors are permitted to post their version of the
article (e.g. in Word or Tex form) to their personal website or
institutional repository. Authors requiring further information
regarding Elsevier’s archiving and manuscript policies are
encouraged to visit:
http://www.elsevier.com/copyright
Author's personal copy
ARTICLE IN PRESS
Journal of Theoretical Biology 252 (2008) 338–349
www.elsevier.com/locate/yjtbi
Reconstruction of certain phylogenetic networks
from the genomes at their leaves
Stephen J. Willson
Department of Mathematics, Iowa State University, Ames, IA 50011, USA
Received 30 October 2007; received in revised form 6 February 2008; accepted 11 February 2008
Available online 19 February 2008
Abstract
A network N is a rooted acyclic digraph. A base-set X for N is a subset of vertices including the root (or outgroup), all leaves, and all
vertices of outdegree 1. A simple model of evolution is considered in which all characters are binary and in which back-mutations occur
only at hybrid vertices. It is assumed that the genome is known for each member of the base-set X. If the network is known and is
assumed to be ‘‘normal,’’ then it is proved that the genome of every vertex is uniquely determined and can be explicitly reconstructed.
Under additional hypotheses involving time-consistency and separation of the hybrid vertices, the network itself can also
be reconstructed from the genomes of all members of X. An explicit polynomial-time procedure is described for performing the
reconstruction.
r 2008 Elsevier Ltd. All rights reserved.
Keywords: Phylogenetic network; Hybrid; Recombination; Reticulate; Speciation
1. Introduction
Phylogenetic relationships are most commonly represented by rooted trees. The extant taxa correspond to
leaves of the trees, while internal vertices correspond to
ancestral species. The arcs correspond to genetic change,
typically by mutations in the DNA such as substitutions,
insertions, and deletions.
There has been increased interest recently in phylogenetic networks that are not necessarily trees. Besides
speciation events, these networks could include such
additional reticulation events as hybridization, recombination, or lateral gene transfer. Basic models of recombination were suggested by Hein (1990, 1993). General
frameworks are discussed in Bandelt and Dress (1992),
Baroni et al. (2004), Moret et al. (2004), and Nakhleh et al.
(2005).
Recent evidence suggests that reticulation events are not
rare but are still less frequent than ordinary speciation
events. A common approach has been to seek networks
Tel.: +1 515 294 7671; fax: +1 515 294 5454.
E-mail address: swillson@iastate.edu
0022-5193/$ - see front matter r 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.jtbi.2008.02.015
with as few reticulation events as possible, in part to give a
lower bound. A ‘‘perfect phylogeny’’ is a network in which
for each character the set of vertices with a particular value
of the character is connected. Wang et al. (2001) considered
the problem of finding a perfect phylogenetic network with
recombination that has the smallest number of recombination events. They suggested that the problem is NP-hard,
and a full proof was given by Bordewich and Semple
(2007). Wang et al. then considered a restricted problem in
which all recombination events are associated with nodedisjoint recombination cycles. Gusfield et al. (2004a) gave
necessary and sufficient conditions to identify these networks, which they called ‘‘galled-trees,’’ and they added a
much more specific and realistic model of recombination
events. Gusfield et al. (2004b) gave a more detailed study of
these node-disjoint cycles.
This paper concerns a direct reconstruction of a
phylogenetic network, without seeking to minimize the
number of reticulation events.
Additional hypotheses are assumed about the network.
Different papers in the literature differ in the detailed
assumptions about the nature of the networks. One
commonly assumed property is that they be rooted acyclic
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
digraphs (Strimmer and Moulton, 2000; Moret et al., 2004;
Nakhleh et al., 2005). Another is a kind of time consistency
at hybridization events, roughly that the parents of a
hybrid be possibly contemporaneous (‘‘temporal representation’’ for Baroni et al., 2006, ‘‘coexistence in time’’ for
Moret et al., 2004). Another is a condition called
‘‘regularity’’ in Baroni et al. (2004) in which, among other
properties, no two vertices have the same set of leaves as
descendents.
This paper performs the reconstructions under the
assumption that the network N is ‘‘normal’’. An exact
definition is given in Section 2, but roughly the idea is as
follows: There is a ‘‘base-set’’ X consisting of vertices with
known DNA, typically containing just the root and the
leaves. The network is normal if from every vertex that is not
in X there is a directed path to a vertex in X such that no
vertex after the initial vertex is a hybrid vertex. This property
is similar to the defining property in ‘‘tree-child networks’’
(Cardona et al., 2007) or the assumption about ‘‘tree nodes’’
in ‘‘model phylogenetic networks’’ (Moret et al., 2004).
The first main theorem (Theorem 3.1) assumes that the
network N is normal. Given the genomes at all members of
X and given N, the theorem asserts that genomes at all
internal vertices may be uniquely reconstructed. The
second main theorem (Theorem 4.1) asserts that, under
some additional assumptions, given the genomes at the
members of X, then the network N itself can also be
uniquely reconstructed. The construction methods are
explicit and indeed of polynomial-time complexity.
All characters are assumed to be binary. Baroni and Steel
(2006) assumed a simple but related model of evolution
called ‘‘accumulation phylogeny’’ in which each child
inherits the entire set of modified characters in each parent.
Thus there may be spontaneous mutations in the network,
but all such mutations are inherited by all descendents.
When each hybridization event is a polyploidy, this
assumption may be appropriate. The current paper seeks
to generalize the evolutionary model and be somewhat more
realistic. It assumes instead that at a hybridization event
only a portion of the genome of each parent is inherited by
each child. On the other hand, we retain the assumption that
the mutated genomes in a parent are inherited by a
‘‘normal’’ child for which there is only a single parent.
The resulting model of evolution is called in this paper the
‘‘Simple Homoplasy Model’’. Full details are in Section 2.
For Theorem 3.1 on reconstructing genomes given the
network, the assumption of normality is the principal
assumption on the network. For Theorem 4.1, in which the
network itself is being reconstructed, additional assumptions are made. We assume a version of time consistency at
hybridization events. We assume that a hybrid vertex has
exactly two parents. We assume that the hybridization
events are separated in the network, so that for example a
hybrid node does not have a hybrid child (this was also
assumed in Moret et al., 2004), but in addition there are
constraints on the immediate descendents of a parent and
grandparent of a hybrid vertex. Examples are presented
339
that indicate that some, at least, of the extra conditions are
required for unique reconstruction of the network.
This paper is a major extension of the author’s paper
(Willson, 2007). The earlier paper obtained a theorem
about reconstructing genomes given the network and the
genomes at the leaves. It did not, however, give a
reconstruction of the network itself. The current paper
extends (Willson, 2007) in two ways. First, it gives a
reconstruction of the network itself. Secondly, it applies to
a more general model of evolution, with less specific
assumptions about the nature of homoplasies. Hence the
reconstruction of the genomes given the network (Theorem
3.1) is also a major generalization of the main theorem of
Willson (2007).
2. Fundamentals
A directed graph or digraph ðV ; AÞ consists of a finite set
V of vertices and a finite set A of arcs, each consisting of an
ordered pair ðu; vÞ where u 2 V , v 2 V , uav, interpreted
as an arrow from u (the parent) to v (the child). There
are no multiple arcs and no loops. A directed path is a
sequence u0 ; u1 ; . . . ; uk of vertices such that for i ¼ 1; . . . ; k,
ðui1 ; ui Þ 2 A. The path is trivial if k ¼ 0. Write upv if there
is a directed path starting at u and ending at v. The graph is
acyclic if there is no nontrivial directed path starting and
ending at the same point. If the graph is acyclic, it is easy to
see that p is a partial order on V.
The acyclic digraph ðV ; AÞ has root r 2 V if for all v 2 V ,
rpv.
The indegree of vertex u is the number of v 2 V such that
ðv; uÞ 2 A. The outdegree of u is the number of v 2 V such
that ðu; vÞ 2 A. If ðV ; AÞ is rooted at r then r is the only
vertex of indegree 0. A leaf is a vertex of outdegree 0. A
normal (or tree) vertex is a vertex of indegree at most 1. A
hybrid vertex (or recombination vertex or reticulation node)
is a vertex of indegree at least 2.
A base-set X is a subset of V that contains the root, all
leaves, and all vertices of outdegree 1. (It may contain other
vertices as well.) The interpretation of X is that its members
correspond to taxa on which direct measurements may be
made. The leaves correspond typically to extant taxa on
which such measurements as DNA are possible. While the
root is usually thought of as being a remote ancestor, it
may be replaced in practice by an outgroup on which
measurements can be made. In consideration of trees, it is
common to suppress any vertices of outdegree 1 other than
possibly the root because nothing in the tree uniquely
identifies such a taxon. If such a vertex remains in the
network, it is because special information is known about
such a taxon, whence we assume measurements on it can be
made and it is in X.
An arc ðu; vÞ is redundant if there exists w 2 V such that
u, v, and w are distinct and upwpv. The inclusion of a
redundant arc is problematic since it duplicates much
genetic information while adding to both indegrees and
outdegrees.
Author's personal copy
ARTICLE IN PRESS
340
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
In this paper a (phylogenetic) network N ¼ ðV ; A; r; X Þ is
a rooted acyclic digraph ðV ; AÞ with root r and base-set X
such that there are no redundant arcs.
The fundamental problem is to learn about N from
information on X.
Let N ¼ ðV ; A; r; X Þ be a phylogenetic network. A
directed path u ¼ u0 ; u1 ; u2 ; . . . ; uk ¼ v (where ðui1 ; ui Þ is
an arc for i ¼ 1; . . . ; k) is a normal path from u to v provided
for i40 ui is normal. Note u itself may or may not be
hybrid. There is a normal path from u to X if there is a
normal path from u to some x such that x 2 X . If u 2 X ,
then the trivial path starting at u is a normal path from u to
X. The network N is normal if from every vertex u 2 V
there is a normal path to X.
Let W be a nonempty subset of V. The most recent
common ancestor of W, denoted mrcaðW Þ, is the vertex
u 2 V , if it exists, such that
(1) For all w 2 W , upw.
(2) Suppose v 2 V satisfies that for all w 2 W , vpw. Then
vpu.
If mrcaðW Þ exists, it is unique. This is because, if u1 and u2
both satisfy the definition then by (1) for all w 2 W , u2 pw,
whence by (2) u2 pu1 . By a symmetric argument, u1 pu2 .
Hence u1 ¼ u2 . It is easy, however, to construct examples
of networks where mrcaðW Þ need not exist.
This paper concerns evolution under the ‘‘Simple
Homoplasy Model.’’ Let C denote a (large finite) set of
binary characters. Let N ¼ ðV ; A; r; X Þ be a phylogenetic
network. Associated with each v 2 V there is a set
MðvÞ C, called the mutated genome of v, such that i 2
MðvÞ iff character i has a different allele in taxon v than the
allele at the root r. Since each character i is binary, it
follows that the genome of v is determined by the genome
at r and by MðvÞ. It is immediate that MðrÞ ¼ ;. Since
measurements can be made on members of X and r 2 X ,
we may assume that for every x 2 X , the genomes of r and
x are known, whence MðxÞ is known.
Following are the assumptions for the Simple Homoplasy
Model.
(SH1) For every v 2 V , there is a set OðvÞ MðvÞ, whose
members are called originating mutations at v.
(SH2) If vaw, then OðvÞ \ OðwÞ ¼ ;.
(SH3) MðrÞ ¼ OðrÞ ¼ ;.
(SH4) If c 2 V is normal with parent p 2 V , then MðcÞ ¼
MðpÞ [ OðcÞ.
(SH5) If c 2 V is hybrid with parents p1 ; p2 ; . . . ; pk then for
each i; 1pipk, there exist sets Pðc; pi Þ C such that
(SH5a) MðcÞ ¼ OðcÞ [ ½[fPðc; pi Þ : i ¼ 1; . . . ; kg
(SH5b) Pðc; pi Þ ¼ MðcÞ \ Mðpi Þ, and
(SH5c) Oðpi Þ Pðc; pi Þ for i ¼ 1; . . . ; k.
Call Pðc; pi Þ the parental contribution to c from pi .
The intuition behind the model is that the sets OðvÞ
identify the new mutations that occur at vertex v. Under an
infinite site hypothesis, there are so many characters and
mutation is sufficiently rare that the same mutation never
occurs twice; hence the sets OðvÞ are pairwise disjoint,
as required in (SH2). Inheritance at a normal vertex c
(with only one parent p) is very simple as described in
(SH4): all mutated characters from p are inherited by c,
and new mutations identified in OðcÞ occur as well. At a
hybrid vertex c with parents p and q, the vertex c inherits
the mutations Pðc; pÞ from parent p and mutations Pðc; qÞ
from parent q. Hence (SH5b) Pðc; pÞ ¼ MðcÞ \ MðpÞ
whence Pðc; pÞ MðpÞ. Every mutation i 2 MðcÞ either
comes from p (i 2 Pðc; pÞ), comes from q (i 2 Pðc; qÞ),
or originates anew at c (i 2 OðcÞ). Hence (SH5a)
MðcÞ ¼ OðcÞ [ Pðc; pÞ [ Pðc; qÞ. The axiom (SH5a) also
allows an arbitrary set of parents. It is easy to see that
if there exists v 2 V such that i 2 MðvÞ, then there exists
u 2 V such that i 2 OðuÞ and upv.
The important final requirement (SH5c), if the hybrid
vertex c has parents p and q, is that OðpÞ Pðc; pÞ and
OðqÞ Pðc; qÞ. This asserts that every mutation in MðpÞ
that originated at p is inherited by c. If this were not the
case, then a mutation originating at p could disappear
immediately, a situation that will be called an immediate
reversion. In the simplest case, suppose p has exactly one
additional child c0 as well as c, and c0 is normal. If there
were an immediate reversion of character i, then i 2 OðpÞ
and i 2 Mðc0 Þ (by SH4) but ieMðcÞ. Since MðpÞ is not
directly measurable, this situation would be indistinguishable from that in which i 2 Oðc0 Þ. An immediate reversion
would thus provide a trivial barrier to reconstruction of the
genome. If we assume there are no immediate reversions,
then once the network is known, this trivial ambiguity
could be instantly recognized.
Let N ¼ ðV ; AÞ be an acyclic digraph. A pseudocycle in N
is a sequence of vertices x0 ; x1 ; x2 ; . . . ; xn from V with n40
such that xn ¼ x0 and for each i (taken mod n) either
(1) ðxi ; xiþ1 Þ is an arc; or
(2) xi is hybrid with distinct parents xi1 and xiþ1 and
ðxiþ1 ; xi Þ is an arc.
A pseudocycle is not a cycle since it is not a directed
path. Nevertheless it is very similar to a cycle since time is
moving forward on most parts of the sequence. The
existence of a pseudocycle indicates a lack of ‘‘time
consistency’’. For example, if there is a temporal representation on the network (Baroni et al., 2006) then each
vertex v has a ‘‘time’’ f ðvÞ such that when v has parents p
and q, then f ðpÞ ¼ f ðqÞ; and when c is a child of u, then
f ðuÞof ðcÞ. Following a pseudocycle we see that the
successive hybrid parents must exist later in time and yet
loop back to the original hybrid node, an impossibility.
Hence the network can have no pseudocycle.
If N ¼ ðV ; A; r; X Þ is a phylogenetic network, suppose x
is a hybrid vertex with parents p and q. Call x a positive
hybrid if OðxÞa;. If x is a positive hybrid, perform an
operation to produce a new network N x as follows: Insert
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
341
Case (1): If m ¼ 0 then x1 ¼ c. Hence xodpx2 ¼
x1 ¼ c. This contradicts nonredundancy of the arc ðx; cÞ.
Case (2): If n ¼ 0 then x2 ¼ d. An argument similar to
case (1) applies.
Case (3): Otherwise m40 and n40. By normality cm has
unique parent cm1 and d n has unique parent d n1 , whence
cm1 ¼ d n1 . But ðn 1Þ þ ðm 1Þon þ m. By the inductive hypothesis it follows cm1 ad n1 , a contradiction. &
Fig. 1. In separation of the hybrid vertex x, the left graph N is replaced by
the right graph N x in which Oðx0 Þ ¼ ;.
Lemma 3.3. Suppose v has distinct children a and b and
there is a normal path from a to x1 and a normal path from b
to x2 . Assume a is normal. Then v ¼ mrcaðx1 ; x2 Þ.
a new vertex x0 called the separated vertex at x. Delete the
arcs ðp; xÞ and ðq; xÞ. Insert new arcs ðp; x0 Þ, ðq; x0 Þ, and
ðx0 ; xÞ, and let Oðx0 Þ ¼ ;. This procedure will be called
separating x, and N x is the separated network. See Fig. 1.
The separated vertex x0 has biological meaning. In the
act of hybridization of taxa p and q to yield x, a new taxon
x0 was produced such that Mðx0 Þ ¼ Pðx0 ; pÞ [ Pðx0 ; qÞ.
Further mutation from x0 led to the taxon x such that
OðxÞ ¼ MðxÞ Mðx0 Þ. Thus x0 denotes the presumed first
hybrid offspring and x denotes the descendent of x0 ,
slightly mutated, which first left sufficient record to be
detectable in the network.
Proof. Let the normal paths be a ¼ a0 ; a1 ; a2 ; . . . ; am ¼ x1
and b ¼ b0 ; b1 ; b2 ; . . . ; bn ¼ x2 . Suppose upx1 and upx2 .
We show upv.
Since upx1 ¼ am , either u ¼ x1 or uox1 , in which case
by normality upam1 since am1 is the unique parent of am .
By a similar argument, if upam1 then either u ¼ am1 or
upam2 . In this manner we see that either u ¼ ai for some
i, 1pipm, or upa0 ¼ a. Similarly we see that either u ¼ bi
for some i, 1pipn, or upb0 ¼ b. By Lemma 3.2 we cannot
have simultaneously u ¼ ai and u ¼ bj for i40 and j40.
Hence the possibilities are
3. Reconstruction of the genome given the network
(i) upa, upb,
(ii) upa, u ¼ bj for some j,
(iii) u ¼ ai for some i, upb.
In this section we show that, given a normal phylogenetic
network N ¼ ðV ; A; r; X Þ, given the mutated sets MðxÞ
for all x 2 X , and assuming the Simple Homoplasy Model,
it is possible to reconstruct all the sets OðvÞ and Pðu; vÞ.
Effectively, the genomes at all vertices are uniquely
determined and can be found in polynomial time.
Theorem 3.1. Assume N ¼ ðV ; A; r; X Þ is normal and the
evolution satisfies the Simple Homoplasy Model. Assume the
network N is known and that for each x 2 X , MðxÞ is known.
Then for all v 2 V , MðvÞ is determined and OðvÞ is
determined. For all hybrid vertices v with parents pi ,
i ¼ 1; . . . ; k, for each i, Pðv; pi Þ is determined.
Throughout Section 3, we shall make the assumptions in
Theorem 3.1. The proof of Theorem 3.1 will first show
that, for all v 2 V , MðvÞ is determined. We can then obtain
all sets OðvÞ and Pðv; pÞ from all the sets MðvÞ. We shall
repeatedly use the immediate consequence of (SH4) that, if
u ¼ u0 ; u1 ; . . . ; uk ¼ a is a normal path, then
MðaÞ ¼ MðuÞ [ Oðu1 Þ [ [ Oðuk Þ
whence MðuÞ MðaÞ.
Lemma 3.2. Let the vertex x 2 V have distinct children c
and d, let c ¼ c0 ; c1 ; . . . ; cm ¼ x1 be a normal path from c to
x1 , and let d ¼ d 0 ; d 1 ; . . . ; d n ¼ x2 be a normal path from d
to x2 . Then x1 ax2 .
Proof. We prove the result by induction on n þ m. Suppose
x1 ¼ x2 .
In case (ii), vobpbj ¼ upa, contradicting the nonredundancy of the arc ðv; aÞ. In case (iii), voapai ¼ upb,
contradicting the nonredundancy of the arc ðv; bÞ. Hence
(i) must apply. But then since a is normal and upa, we
have either u ¼ a or upv. If u ¼ a, then voa ¼ upb
contradicts the nonredundancy of arc ðv; bÞ. Hence upv.
This shows v ¼ mrcaðx1 ; x2 Þ. &
Corollary 3.4. Suppose v has distinct children a and b and
there is a normal path from a to x1 and a normal path from b
to x2 . Assume a is normal. Then Mðx1 Þ \ Mðx2 Þ MðvÞ.
Proof. Let the normal paths be a ¼ a0 ; a1 ; a2 ; . . . ; am ¼ x1
and b ¼ b0 ; b1 ; b2 ; . . . ; bn ¼ x2 . Then by (SH4), Mðx1 Þ ¼
MðvÞ [ OðaÞ [ Oða1 Þ [ [ Oðam Þ and Mðx2 Þ ¼ MðbÞ[
Oðb1 Þ [ Oðb2 Þ [ [ Oðbn Þ.
Suppose i 2 Mðx1 Þ \ Mðx2 Þ. There exists u 2 V such that
i 2 OðuÞ. Then upx1 since i 2 Mðx1 Þ, and similarly upx2 .
By Lemma 3.3, upmrcaðx1 ; x2 Þ ¼ v. Hence u cannot equal
a, a1 ; . . . ; am , b1 ; . . . ; or bn . Hence Mðx1 Þ \ Mðx2 Þ ¼ MðvÞ\
MðbÞ MðvÞ. &
Lemma 3.5. Suppose v has two normal children a and b.
Choose a normal path from a to x 2 X , and a normal path
from b to y 2 X . Then MðvÞ ¼ MðxÞ \ MðyÞ.
Proof. Since a is normal and the path from a to x
is normal, it follows that MðvÞ MðxÞ. Similarly
MðvÞ MðyÞ, whence MðvÞ MðxÞ \ MðyÞ. Conversely,
by Corollary 3.4, MðxÞ \ MðyÞ MðvÞ. &
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
342
Lemma 3.6. Suppose v 2 V is normal with parent p. Suppose
v has normal child c and hybrid child z. Choose a normal path
from c to a 2 X , and a normal path from z to w 2 X . Then
MðvÞ ¼ MðpÞ [ ðMðaÞ \ MðwÞÞ.
from zi to wi 2 X . Then
[
MðvÞ ¼
fMðaÞ \ Mðpi Þ : 1pipkg
[
[ fMðaÞ \ Mðwi Þ : 1pipmg.
Proof. By (SH4), MðpÞ MðvÞ. By Corollary 3.4,
MðaÞ \ MðwÞ MðvÞ. Hence MðpÞ [ ðMðaÞ \ MðwÞÞ MðvÞ.
Conversely, OðvÞ MðaÞ since there is a normal path
from v to a. OðvÞ MðzÞ by (SH5c) since there are
no immediate reversions. MðzÞ MðwÞ since there is a
normal path from z to w. Hence OðvÞ MðwÞ and
OðvÞ MðaÞ \ MðwÞ. Thus MðvÞ ¼ MðpÞ [ OðvÞ MðpÞ[
ðMðaÞ \ MðwÞÞ. &
Proof. The argument is an obvious generalization of the
argument for Lemma 3.7, and is omitted. &
Lemma 3.7. Suppose v is hybrid with parents p and q.
Suppose v has normal child c and hybrid child z. Choose a
normal path from c to a 2 X and a normal path from z to
w 2 X . Then
MðvÞ ¼ ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ [ ðMðaÞ \ MðwÞÞ.
Proof. See Fig. 2. Let the normal path from c to a be
c ¼ c0 ; c1 ; . . . ; cn ¼ a. Let the normal path from z to w
be z ¼ z0 ; z1 ; . . . ; zm ¼ w. Then MðaÞ ¼ MðvÞ [ Oðc0 Þ [
Oðc1 Þ [ [ Oðcn Þ and MðwÞ ¼ MðzÞ [ Oðz1 Þ [ Oðz2 Þ[
[ Oðzm Þ.
Note that ðMðaÞ \ MðpÞÞ MðvÞ. (Otherwise if
i 2 MðaÞ \ MðpÞ, then i 2 MðaÞ so there exists j such that
i 2 Oðcj Þ, and cj pp since i 2 MðpÞ, leading to a cycle.)
Similarly ðMðaÞ \ MðqÞÞ MðvÞ. Note that ðMðaÞ \
MðwÞÞ MðvÞ by Corollary 3.4. Thus
ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ [ ðMðaÞ \ MðwÞÞ MðvÞ.
We can now complete the proof of Theorem 3.1. We first
show that for all v 2 V , MðvÞ is uniquely determined.
Assume by induction that MðuÞ is known when uov.
The case MðrÞ ¼ ; serves as a base for the induction.
By normality, one of the following cases A, B, C, and D
occurs.
Case A: Suppose v 2 X . Then MðvÞ is given.
Case B: Suppose v has outdegree 0 or 1. Then v 2 X and
MðvÞ is given.
Case C: Suppose v has two normal children c1 and c2 .
Then Lemma 3.5 shows MðvÞ ¼ MðxÞ \ MðyÞ for some
specified members x and y of X.
Case D: Suppose veX and v does not have two normal
children. Then v has a normal child c and at least one
hybrid child z.
Subcase D1: If v is normal, then Lemma 3.6 shows
MðvÞ ¼ MðpÞ [ ðMðaÞ \ MðwÞÞ
for specified members a and w of X, and with pov
whence MðpÞ is known by induction.
Subcase D2: If v is hybrid with exactly two parents p and
q, then Lemma 3.7 finds a and w in X such that
MðvÞ ¼ ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ
[ ðMðaÞ \ MðwÞÞ.
Conversely, MðvÞ ¼ OðvÞ [ Pðv; pÞ [ Pðv; qÞ by (SH5a). If
i 2 OðvÞ, then i 2 MðzÞ since there are no immediate
reversions, so i 2 MðaÞ \ MðwÞ. If i 2 Pðv; pÞ then
i 2 MðaÞ \ MðpÞ. If i 2 Pðv; qÞ then i 2 MðaÞ \ MðqÞ.
Hence
MðvÞ ðMðaÞ \ MðpÞÞ [ ðMðaÞ \ MðqÞÞ
[ ðMðaÞ \ MðwÞÞ:
&
Lemma 3.8. Suppose v is hybrid with parents p1 ; p2 ; . . . ; pn .
Suppose v has normal child c and other children z1 ; . . . ; zm .
Choose a normal path from c to a 2 X , and normal paths
Since pov and qov, MðpÞ and MðqÞ are known by
induction. Hence MðvÞ is determined.
Subcase D3: If v is hybrid with an arbitrary number of
parents and hybrid children, then Lemma 3.8 determines
MðvÞ.
Thus for all v 2 V , MðvÞ is uniquely determined. There
remains to show that OðvÞ and Pðu; pÞ are determined. This
follows from three cases:
Case 1: Suppose v is normal with parent p. Then OðvÞ ¼
MðvÞ MðpÞ by (SH4).
Case 2: Suppose v is hybrid with parents p1 ; p2 ; . . . ; pk .
Then
OðvÞ ¼ MðvÞ ðMðp1 Þ [ [ Mðpk ÞÞ
by (SH5).
Case 3: Suppose p is a parent of the hybrid vertex u.
Then Pðu; pÞ ¼ MðuÞ \ MðpÞ by (SH5b).
This completes the proof of Theorem 3.1.
4. Reconstruction of the network
Fig. 2. The situation in Lemma 3.7.
In this section we make additional assumptions on both
the network and the evolution model. Under these
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
additional assumptions, given MðxÞ for all x 2 X , we can
reconstruct the phylogenetic network N as well.
Here is a list of the assumptions:
(A1)
(A2)
(A3)
(A4)
(A5)
(A6)
(A7)
(A8)
N ¼ ðV ; A; r; X Þ is a normal phylogenetic network.
The evolution satisfies the Simple Homoplasy Model.
Every hybrid vertex has exactly two parents.
For all v 2 V , var, if v is normal then OðvÞ is
nonempty.
If p has a hybrid child c, then p is normal and every
child of p other than c is normal.
If h is hybrid, p is a parent of h (hence normal by
(A5)), and p0 is a parent of p, then either
(A6a) p0 has no child other than p, whence p0 2 X , or
else
(A6b) p0 has a normal child b such that bap, and
either
(A6b1) b is a leaf, or
(A6b2) b has two normal children, or else
(A6b3) b 2 X and b has a normal child.
Suppose h is hybrid with parents p and q. Then
Pðh; pÞaMðpÞ and Pðh; qÞaMðqÞ.
N has no pseudocycles.
343
By (A2) and (SH4) there are no homoplasies at normal
vertices, and by (SH5) there are no immediate reversions.
Note (A7) says that at a hybrid vertex, reversions occur
from each parent.
Assumptions (A5)–(A7) appear very technical, yet similar
assumptions are needed to ensure unique reconstruction.
Fig. 3 shows two normal networks A and B with the same
base-set X that satisfy all assumptions except (A6). For
example, in A a grandparent of the hybrid vertex 6 has the
hybrid child 4. It can be shown by exhaustive check that
under the Simple Homoplasy Model the possible characters
(observed only on members of X) are exactly the same in
both A and B. Given all these characters on members of X
and given either A or B, the characters at all remaining
vertices can be recovered by Theorem 3.1. Unique reconstruction of the network itself, however, will not be possible
since both A and B will be solutions. Hence Theorem 4.1
cannot be true without an assumption similar to (A6).
Similarly Fig. 4 exhibits two normal networks C and D
satisfying all assumptions except (A5). Any character in
either network (as observed in members of X) under the
Simple Homoplasy Model is a character in the other.
Hence Theorem 4.1 cannot be true without an assumption
similar to (A5).
Fig. 3. Two normal networks A and B with base-set X ¼ f1; 2; 3; 4; 5; 6; 7; 8g. They fail (A6).
Fig. 4. Two normal networks C and D with base-set X ¼ f1; 2; 3; 4; 5; 6; 7g. They fail (A5).
Author's personal copy
ARTICLE IN PRESS
344
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
Fig. 5. Two normal networks E and F with base-set X ¼ f1; 2; 3; 4; 5; 6; 7g. Each vertex v also shows MðvÞ. Both networks have the same genomes and
satisfy the Simple Homoplasy Model, but they fail (A7).
Finally, Fig. 5 exhibits two networks E and F and for
each vertex v the mutated set MðvÞ. Both networks satisfy
all axioms except (A7). For example, in E, Pð4; 3Þ ¼
Mð3Þ ¼ fa; b; dg. In E, e 2 Oð5Þ, while in F, e 2 Oð4Þ. The
networks differ in which vertex is hybrid. It follows
Theorem 4.1 cannot be true without an assumption
like (A7).
Theorem 4.1. Let N ¼ ðV ; A; r; X Þ be a phylogenetic network that satisfies (A1)–(A8) above. Assume for all x 2 X ,
MðxÞ is given. Then N can be reconstructed uniquely by an
explicit procedure. If jX j ¼ n and jCj ¼ m then the
reconstruction can be done in time Oðn6 mÞ.
From Theorem 3.1 it then follows that for all v 2 V ,
MðvÞ can be reconstructed uniquely.
The proof of 4.1 requires many lemmas to handle many
special cases. We assume (A1)–(A8) throughout this
section. The basic tool is the following:
Assume jX jX3. Suppose for every x 2 X , we know
MðxÞ. Define the stem function d by, whenever a, b, and x
are distinct members of X,
dðx; a; bÞ ¼ MðxÞ ½ðMðaÞ \ MðxÞÞ [ ðMðbÞ \ MðxÞÞ.
If x 2 X , define
dðxÞ ¼ \½dðx; a; bÞ : x; a; b distinct members of X .
Note dðx; a; bÞ MðxÞ. Trivially dðrÞ ¼ ;.
The first use of d will be to identify which members of X
are leaves and which are internal vertices. Lemmas 4.3–4.11
deal with different cases that show
(1) if x 2 X is a leaf, then dðxÞ ¼ OðxÞ, and
(2) if x 2 X is not a leaf, then dðxÞ ¼ ;.
If x is a normal leaf with parent p, then by (SH4),
MðpÞ ¼ MðxÞ OðxÞ whence MðpÞ ¼ MðxÞ dðxÞ by
(1). Thus the genome of the parent p can be reconstructed,
and dðxÞ ¼ OðxÞ identifies the mutations that occurred on
the ‘‘stem’’ of x, i.e., the arc leading to x. (This is the reason
for calling d the ‘‘stem’’ function.) This fact will allow us
recursively to remove normal leaves and replace them by
their parents, simplifying the network. If x is a hybrid leaf,
then the separated vertex x0 satisfies Mðx0 Þ ¼ MðxÞ dðxÞ,
so by a similar process a hybrid leaf may be replaced by its
separated vertex.
In this manner the network will be simplified recursively,
with X changing in the process, until for each x 2 X we
have dðxÞ ¼ ;. Lemma 4.13 shows that the network will
now have no normal leaves. Hence every leaf is hybrid. In
this situation, Lemma 4.14 gives a criterion to identify a
member x of X which is a hybrid leaf and to identify its
parents p and q. Lemma 4.15 proves that there exists x 2 X
that satisfies the criterion in Lemma 4.14. Hence we can
identify a hybrid leaf x and its parents p and q. We then
simplify the network by removing x from X (since its
parents are known), and continue recursively.
The first step is to verify that two distinct vertices cannot
have the same mutated genome. This result will be needed
in order to identify together two vertices with the same
genome constructed by different procedures.
Lemma 4.2. Suppose u and v are vertices such that
MðuÞ ¼ MðvÞ. Then u ¼ v.
Proof. There are three cases to consider:
Case (1): Assume that both u and v are normal.
By (A4) we may choose a 2 OðuÞ and b 2 OðvÞ. Since
MðuÞ ¼ MðvÞ it follows fa; bg MðuÞ ¼ MðvÞ. Since a 2
MðvÞ it follows upv. Since b 2 MðuÞ it follows vpu. Hence
u ¼ v.
Case (2): Suppose one of the vertices, say u, is normal
and the other vertex v is hybrid.
Let p and q denote the parents of v. By (A5) they are
normal, and by (A4) we may choose a 2 OðuÞ, b 2 OðpÞ,
and c 2 OðqÞ. Since there are no immediate reversions, it
follows fb; cg MðvÞ, whence fa; b; cg MðuÞ ¼ MðvÞ.
Note uav since only u is normal. Since a 2 MðvÞ it follows
upv. Since v has only parents p and q by (A3), it follows
either upp or upq. Without loss of generality, assume
upp. Since b 2 MðuÞ, it follows ppu. Hence p ¼ u. Since
c 2 MðuÞ, we see qpu ¼ p. But then qopov shows that
the arc ðq; vÞ was redundant, contradicting (A1).
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
Case (3): Suppose that both u and v are hybrid but uav.
Let pu and qu be the parents of u, and pv and qv be
the parents of v. By (A5) these are normal, and by (A4) we
may choose a 2 Oðpu Þ, b 2 Oðqu Þ, c 2 Oðpv ), and d 2 Oðqv Þ.
Since there are no immediate reversions, it follows fa; bg MðuÞ and fc; dg MðvÞ, whence fa; b; c; dg MðuÞ ¼ MðvÞ.
Since a 2 MðvÞ it follows pu pv, whence either pu ppv or
pu pqv . In like manner we see that either pv ppu or pv pqu .
Similarly either qu ppv or qu pqv ; and either qv ppu or
qv pqu .
Without loss of generality assume pu ppv . If pv ppu , then
pu ¼ pv . By (A5) every child of pu other than u is normal;
since v is hybrid it follows u ¼ v, a contradiction. Hence
pv pqu . If qu ppv , then pv ¼ qu , whence again by (A5) u ¼ v,
a contradiction. Hence qu pqv . If qv pqu , then qu ¼ qv ,
whence u ¼ v, a contradiction. Hence qv ppu . It follows
that pu ppv pqu pqv ppu . Hence all four points are equal, a
contradiction. &
345
Fig. 6. The situation in Lemma 4.10 or in Lemma 4.6 if p and x are
merged into a single vertex x.
Proof. Since x is a leaf and a and b are distinct from x,
it is false that xpa and it is false that xpb. It follows
that OðxÞ \ MðaÞ ¼ ; and OðxÞ \ MðbÞ ¼ ;. Since OðxÞ MðxÞ, the result follows. &
c1 ; . . . ; ck ¼ y. Choose a normal path from b to z 2 X , say
b ¼ b0 ; b1 ; . . . ; bj ¼ z. The situation is then as in Fig. 6
modified so that x and p are merged into a single vertex x. I
claim dðx; y; zÞ ¼ ;.
By (SH4), since b is normal, MðzÞ ¼ Mðp0 Þ [ OðbÞ[
Oðb1 Þ [ [ OðzÞ. By (SH4), MðyÞ ¼ MðcÞ [ Oðc1 Þ [ [
OðyÞ and MðxÞ ¼ Mðp0 Þ [ OðxÞ. If x ¼ bu for some uX0,
then the arc ðp0 ; xÞ would be redundant. Hence
MðzÞ \ MðxÞ ¼ Mðp0 Þ. If x ¼ cu for some u40, then there
would be a directed cycle at x. Hence MðyÞ \ MðxÞ ¼
MðxÞ \ MðcÞ ¼ Pðc; xÞ. Then dðx; y; zÞ ¼ MðxÞ ðMðp0 Þ [
Pðc; xÞÞ ¼ ; because Pðc; xÞ contains OðxÞ by (SH5c). &
Lemma 4.4. Suppose x 2 X and x has a normal child c.
Then dðxÞ ¼ ;.
Lemma 4.7. Suppose x is a normal leaf with parent p and p
has a normal child c distinct from x. Then dðxÞ ¼ OðxÞ.
Proof. If x ¼ r, the result is immediate. Assume xar.
Choose a normal path from c to y 2 X ; say the path is
c ¼ c0 ; c1 ; . . . ; ck ¼ y. I claim dðx; r; yÞ ¼ ;. Since c is
normal, by (SH4), MðyÞ ¼ MðxÞ [ OðcÞ [ Oðc1 Þ [ [OðyÞ. Hence MðxÞ \ MðyÞ ¼ MðxÞ, whence dðx; r; yÞ ¼
MðxÞ MðxÞ ¼ ;. Hence dðxÞ ¼ ;. &
Proof. Since N is normal we may choose a normal path
from c to y 2 X given by c ¼ c0 ; c1 ; c2 ; ; ck ¼ y. I claim
that dðx; r; yÞ ¼ OðxÞ. To see this, observe that by (SH4),
MðxÞ ¼ MðpÞ [ OðxÞ and
Lemma 4.3. Suppose x 2 X is a leaf. Then for all a and b
such that a, b, and x are distinct, OðxÞ dðx; a; bÞ, whence
OðxÞ dðxÞ.
Lemma 4.5. Suppose x 2 X and x has a hybrid child c.
Suppose x is normal with parent p0 and p0 2 X . Then
dðxÞ ¼ ;.
Proof. Suppose c is hybrid with parents x and q. Choose a
normal path from c to y 2 X ; say the path is c ¼ c0 ;
c1 ; . . . ; ck ¼ y. Then MðyÞ ¼ MðcÞ [ Oðc1 Þ[ [ OðyÞ.
Hence MðxÞ \ MðyÞ ¼ MðxÞ \ MðcÞ ¼ Pðc; xÞ. Since x is
normal, MðxÞ ¼ Mðp0 Þ [ OðxÞ, whence MðxÞ \ Mðp0 Þ ¼
Mðp0 Þ. Thus
0
0
dðx; y; p Þ ¼ MðxÞ ½Pðc; xÞ [ Mðp Þ
¼ MðxÞ MðxÞ ¼ ;
since OðxÞ Pðc; xÞ and MðxÞ ¼ Mðp0 Þ [ OðxÞ by (SH4), it
follows that dðxÞ ¼ ;. &
MðyÞ ¼ MðpÞ [ OðcÞ [ Oðc1 Þ [ [ Oðck1 Þ [ OðyÞ.
Note x is distinct from p and all ci . (If x ¼ ci for some i40
then the path p; c; ; ci ¼ x makes the arc ðp; xÞ
redundant. If x ¼ c then this contradicts that c is distinct
from x.) Hence MðxÞ \ MðyÞ ¼ MðpÞ. It follows that
dðx; r; yÞ ¼ MðxÞ MðpÞ ¼ OðxÞ. Hence dðxÞ OðxÞ. By
Lemma 4.3, dðxÞ ¼ OðxÞ. &
Lemma 4.8. Assume jX jX3. Suppose x is a normal leaf with
parent p and p 2 X . Then dðxÞ ¼ OðxÞ.
Proof. If p ¼ r, then since jX jX3, r must have another
child c which is necessaritly normal. Lemma 4.7 then shows
that dðxÞ ¼ OðxÞ.
If par, then dðx; r; pÞ ¼ OðxÞ. To see this, observe that
since MðxÞ ¼ MðpÞ [ OðxÞ and OðxÞ is disjoint from MðpÞ
we have MðxÞ \ MðpÞ ¼ MðpÞ. Since MðrÞ ¼ ; we have
Lemma 4.6. Suppose x 2 X and x has a hybrid child c.
Suppose x is normal with parent p0 and p0 has a normal child
b distinct from x. Then dðxÞ ¼ ;.
dðx; r; pÞ ¼ MðxÞ MðxÞ \ MðpÞ ¼ MðxÞ MðpÞ ¼ OðxÞ.
Proof. Suppose c is hybrid with parents x and q. Choose a
normal path from c to y 2 X ; say the path is c ¼ c0 ;
Lemma 4.9. Suppose x is a normal leaf with parent p.
Suppose p has a hybrid child c whose other parent is q.
Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. &
Author's personal copy
ARTICLE IN PRESS
346
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
Suppose p is normal with parent p0 , and p0 2 X . Then
dðxÞ ¼ OðxÞ.
Lemma 4.11. Suppose x 2 X is a hybrid leaf with parents p
and q. Then dðxÞ ¼ OðxÞ.
Proof. By normality we may choose a normal path
from c to y 2 X , say c ¼ c0 ; c1 ; . . . ; ck ¼ y. We show
dðx; y; p0 Þ ¼ OðxÞ.
By (SH4), MðyÞ ¼ MðcÞ [ Oðc1 Þ [ [ OðyÞ, and
Proof. By normality, choose a normal path from p to
y 2 X , say p ¼ p0 ; p1 ; . . . ; pm ¼ y; and choose a normal
path from q to z 2 X , say q ¼ q0 ; q1 ; . . . ; qn ¼ z.
I claim dðx; y; zÞ ¼ OðxÞ.
Note MðxÞ ¼ OðxÞ [ Pðx; pÞ [ Pðx; qÞ, MðyÞ ¼ MðpÞ[
Oðp1 Þ [ [ OðyÞ, and MðzÞ ¼ MðqÞ [ Oðq1 Þ [ [ OðzÞ.
For i40, MðxÞ \ Oðpi Þ ¼ ; since otherwise pi px leading
to a directed path from p to x contradicting nonredundancy of ðp; xÞ. Hence MðxÞ \ MðyÞ ¼ MðxÞ \ MðpÞ ¼
Pðx; pÞ by (SH5b).
Similarly MðxÞ \ MðzÞ ¼ MðxÞ \ MðqÞ ¼ Pðx; qÞ. It follows that
MðxÞ ¼ Mðp0 Þ [ OðpÞ [ OðxÞ ¼ MðpÞ [ OðxÞ.
Hence MðxÞ \ Mðp0 Þ ¼ Mðp0 Þ, while MðxÞ \ MðyÞ ¼
MðxÞ \ MðcÞ (since for i40, Oðci Þ \ MðxÞ ¼ ; because
otherwise we have ci px and the directed path from p
to ci through c and then from ci to x contradicts that
the arc ðp; xÞ is nonredundant). But MðcÞ ¼ OðcÞ[
Pðc; pÞ [ Pðc; qÞ. Hence
MðxÞ \ MðyÞ ¼ ½MðpÞ [ OðxÞ \ MðcÞ
¼ MðpÞ \ MðcÞ ¼ Pðc; pÞ.
It follows that
dðx; y; zÞ ¼ MðxÞ ðPðx; pÞ [ Pðx; qÞÞ ¼ OðxÞ.
Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ. &
Lemma 4.12. Suppose, for all x 2 X , we have dðxÞ ¼ ;.
dðx; y; p0 Þ ¼ MðxÞ ðPðc; pÞ [ Mðp0 ÞÞ.
Since there are no immediate reversions, OðpÞ Pðc; pÞ MðpÞ, and p is normal so MðpÞ ¼ Mðp0 Þ [ OðpÞ
whence Pðc; pÞ [ Mðp0 Þ ¼ MðpÞ. Hence dðx; y; p0 Þ ¼ MðxÞ
MðpÞ ¼ OðxÞ. It follows that dðxÞ OðxÞ. By Lemma 4.3,
dðxÞ ¼ OðxÞ. &
Lemma 4.10. Suppose x is a normal leaf with parent p.
Suppose p has a hybrid child c whose other parent is q.
Suppose p is normal with parent p0 , and p0 has a normal child
b distinct from p. Then dðxÞ ¼ OðxÞ.
Proof. By normality we may choose a normal path from c
to y 2 X , say c ¼ c0 ; c1 ; . . . ; ck ¼ y. We may also choose a
normal path from b to z 2 X , say b ¼ b0 ; b1 ; . . . ; bj ¼ z. See
Fig. 6. We show dðx; y; zÞ ¼ OðxÞ.
We have MðxÞ ¼ MðpÞ [ OðxÞ, MðyÞ ¼ MðcÞ [ Oðc1 Þ[
[ OðyÞ, and MðzÞ ¼ Mðp0 Þ [ OðbÞ [ Oðb1 Þ [ [ OðzÞ.
Note x is distinct from all bi for ioj since x is a leaf.
Also, xaz since otherwise if j40 then p ¼ bj1 and the
arc ðp0 ; pÞ would be redundant. Moreover, xab since
otherwise by normality p ¼ p0 ; and xap0 since x is a
leaf. Similarly x is distinct from all ci for iok since x is a
leaf. Moreover, xay if k40 since otherwise p ¼ ck1
and there is a directed cycle at p; and xac if k ¼ 0 because
x and c are distinct leaves. Because the originating
sets are pairwise disjoint it follows that MðxÞ \ MðyÞ ¼
MðpÞ \ MðcÞ ¼ Pðc; pÞ and MðxÞ \ MðzÞ ¼ Mðp0 Þ by
Lemma 3.5.
Note
MðpÞ ¼ Mðp0 Þ [ OðpÞ ¼ Mðp0 Þ [ Pðc; pÞ
since
OðpÞ Pðc; pÞ. Hence
dðx; y; zÞ ¼ MðxÞ ½Mðp0 Þ [ Pðc; pÞ ¼ MðxÞ MðpÞ ¼ OðxÞ: &
Hence dðxÞ OðxÞ. By Lemma 4.3, dðxÞ ¼ OðxÞ.
(1) Suppose x 2 X is a hybrid leaf with parents p and q, such
that p and q are in X. Then dðx; p; qÞ ¼ ;.
(2) Suppose in addition that x is the only child of p and the
only child of q. Then whenever a and b in X satisfy that
dðx; a; bÞ ¼ ;, it follows that either (a ¼ p and b ¼ q) or
else (a ¼ q and b ¼ p).
Proof. Since dðxÞ ¼ ;, it follows from Lemma 4.11
that OðxÞ ¼ ;. Hence MðxÞ ¼ OðxÞ [ Pðx; pÞ [ Pðx; qÞ ¼
Pðx; pÞ [ Pðx; qÞ.
From the proof of Lemma 4.11,
dðxÞ ¼ ; ¼ MðxÞ ½ðMðxÞ \ MðpÞÞ [ ðMðxÞ \ MðqÞÞ.
Hence MðxÞ ¼ ½ MðxÞ \ MðpÞ [ ½MðxÞ \ MðqÞ. This
proves (1).
For (2), suppose a and b in X are as described, so
dðx; a; bÞ ¼ ;. Then
MðxÞ ¼ ½MðxÞ \ MðaÞ [ ½MðxÞ \ MðbÞ.
Since OðpÞa; by (A4), we may choose i 2 OðpÞ, whence
i 2 Pðx; pÞ, whence i 2 MðxÞ, whence either i 2 MðaÞ or
i 2 MðbÞ. In particular, either ppa or ppb. Assume
without loss of generality that ppa. Since x is the only
child of p, and we cannot have xpa since x is a leaf, it
follows p ¼ a. Similarly we see that either q ¼ a or b, but
since a and b are distinct it follows q ¼ b. &
Lemma 4.13. Suppose all members x of X satisfy dðxÞ ¼ ;.
Then no member of X is a normal leaf.
Proof. Suppose x 2 X is a normal leaf. By (A4), OðxÞa;,
and by Lemma 4.7, 4.8, 4.9, or 4.10, dðxÞ ¼ OðxÞ, a
contradiction. &
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
Lemma 4.14. Assume all members y of X satisfy dðyÞ ¼ ;.
Suppose there exist distinct x, p, and q in X, all three distinct
from r, such that
(i) dðx; p; qÞ ¼ ;.
(ii) If dðx; a; bÞ ¼ ; for distinct x, a, and b, then either
(p ¼ a and q ¼ b) or (p ¼ b and q ¼ a).
(iii) MðpÞD
/ MðxÞ, and MðqÞD
/ MðxÞ.
Then x is a hybrid leaf with parents p and q, x is the only
child of p, and x is the only child of q.
Proof. Suppose x is not a hybrid leaf. Since x is not a
normal leaf by Lemma 4.13, we see that x has a child w.
For any normal child u of x, by choosing a normal path
from u to y 2 X we would have dðx; r; yÞ ¼ ; by Lemma
4.4. But then by (ii) either p or q equals r, contradicting that
p and q were distinct from r. Hence every child of x is
hybrid. In particular w is hybrid. Since x is parent to a
hybrid vertex, x is normal by (A5), and we may let s be its
unique parent.
Case 1: Suppose w has (at least) two children u and v.
Both are normal by (A5). Choose a normal path from u to
y 2 X and a normal path from v to z 2 X .
Case 1a: Suppose s 2 X . Then dðx; s; yÞ ¼ ; and
dðx; s; zÞ ¼ ; by the proof of Lemma 4.5. But yaz by
normality, so this contradicts (ii).
Case 1b: Suppose seX . Then s has a normal child b
distinct from x by (A6). Choose a normal path from b to
e 2 X . Then by the proof of Lemma 4.6 we have dðx; e; yÞ ¼
; and dðx; e; zÞ ¼ ;. This contradicts (ii).
Case 2: Suppose w has exactly one child u. Then u is
normal since the parent of a hybrid child is normal and w is
hybrid. Since w has outdegree 1, w 2 X . Choose a normal
path from u to y 2 X .
Case 2a: Suppose s 2 X . Then dðx; s; wÞ ¼ ; and
dðx; s; yÞ ¼ ; by the proof of Lemma 4.5. This contradicts
(ii).
Case 2b: Suppose seX . Then s has a normal child b
distinct from x by (A6). Choose a normal path from b to
e 2 X . Then by the proof of Lemma 4.6 we have
dðx; e; wÞ ¼ ; and dðx; e; yÞ ¼ ;. This contradicts (ii).
Case 3: Suppose w has no children, so w is a hybrid leaf.
Then w 2 X .
Case 3a: Suppose s has a normal child b other than x. By
Lemma 4.13, b is not a normal leaf. By (A6) either
Subcase (3a1): b has two normal children a and c; or
Subcase (3a2): b 2 X and b has a normal child a.
Choose normal paths from a to z 2 X , and if c is present
choose a normal path from c to u 2 X . In case (3a1) we
have both dðx; z; wÞ ¼ ; and dðx; u; wÞ ¼ ; by the proof of
Lemma 4.6, contradicting (ii). In case (3a2) we have both
dðx; z; wÞ ¼ ; and dðx; b; wÞ ¼ ;, contradicting (ii). Hence
Case 3a cannot occur.
347
Case 3b: Suppose s has no child other than x. Then s 2 X
and dðx; s; wÞ ¼ ; by the proof of Lemma 4.5. Hence by (ii)
either (p ¼ s and q ¼ w) or (p ¼ w and q ¼ s). We may
assume p ¼ s and q ¼ w. Then MðxÞ ¼ MðpÞ [ OðxÞ by
(SH4), whence MðpÞ MðxÞ, contradicting (iii).
Since all three cases are eliminated, it follows that x is a
hybrid leaf. By Lemma 4.12, p and q are the parents of x. I
claim that x is the only child of p. If not, by (A5), p has a
normal child c other than x, and there is a normal path
from c to some y 2 X . Then dðx; y; qÞ ¼ ; by Lemma 4.11
as well as dðx; p; qÞ ¼ ;, contradicting (ii). Similarly x is the
only child of q. &
Lemma 4.15. Assume jX jX3 all members y of X satisfy
dðyÞ ¼ ;. Then
(i) There exists a hybrid leaf x with parents p and q such
that x, p, and q are in X, x is the only child of p, and x is
the only child of q.
(ii) Neither p nor q is equal to r.
(iii) dðx; p; qÞ ¼ ;.
(iv) If dðx; a; bÞ ¼ ; then either (p ¼ a and q ¼ b) or (p ¼ b
and q ¼ a).
(v) There is no y 2 X such that dðx; r; yÞ ¼ ;.
(vi) MðpÞD
/ MðxÞ, and MðqÞD
/ MðxÞ.
Proof. To see (i), by Lemma 4.13 note that the network has
no normal leaves. Choose a directed path from r with
maximal length (number of arcs) ending at x1 through
parent p1 . Then x1 is a leaf, whence x1 is hybrid since there
are no normal leaves, and one parent of x1 is p1 . I claim
that p1 has no child other than x1 . If p1 has a child c1 other
than x1 , then by (A5) every child of p1 other than the
hybrid x1 is normal, so c1 is normal. But c1 cannot be a
normal leaf, so c1 must have a child, in which case the path
from r to p1 to c1 could be extended and the path from r to
x1 did not have maximal length. The claim follows, and
since p1 has outdegree 1, p1 2 X .
Let q1 be the other parent of hybrid x1 . If q1 has no child
other than x1 , then x1 , p1 , q1 satisfy (i). Otherwise q1 has a
child d 1 other than x1 , d 1 is normal and not a leaf, and we
may choose a maximal directed path starting at d 1 ending
at x2 through its parent p2 . Then as above x2 is a hybrid
leaf, p2 has no child other than x2 , and x2 has other parent
q2 . Repeat the process. It must terminate with some hybrid
leaf xi with parents pi and qi such that xi is the only child of
pi and xi is the only child of qi ; otherwise we generate a
pseudocycle, contradicting (A8). Note xi 2 X since it is a
leaf, while pi and qi are in X since they have outdegree 1.
This proves (i).
If p ¼ r, then p ¼ roqox shows that the arc ðp; xÞ is
redundant, a contradiction. Hence par, and similarly qar.
This proves (ii).
By Lemma 4.12, (iii) and (iv) hold. Then (v) follows from
(ii) and (iv). To see (vi), suppose MðpÞ MðxÞ. Then
Pðx; pÞ ¼ MðpÞ \ MðxÞ ¼ MðpÞ, contradicting (A7). &
We can now prove Theorem 4.1.
Author's personal copy
ARTICLE IN PRESS
348
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
Call a phylogenetic network N ¼ ðV ; A; r; X Þ ‘‘smaller’’
than a network N 0 ¼ ðV 0 ; A0 ; r0 ; X 0 Þ if either N has
fewer vertices than N 0 or else N and N 0 have the same
number of vertices but N has more members x 2 X such
that dðxÞ ¼ ; than N 0 has members x0 2 X 0 such that
dðx0 Þ ¼ ;.
The proof is by induction using the notion of smallness.
If jV j ¼ 1, then V ¼ frg and N is uniquely determined. If
jV j ¼ 2, let V ¼ fr; vg; then clearly A consists of the single
arc ðr; vÞ. Hence we may assume jV jX3 whence jX jX3. We
assume the result when the network is ‘‘smaller’’ than
N ¼ ðV ; A; r; X Þ.
By the hypotheses (A1)–(A8) we know that every vertex
x 2 X satisfies one of the following descriptions (i)–(ix):
(i) x ¼ r.
(ii) x is a normal leaf with parent p 2 X .
(iii) x is a normal leaf with parent p such that p has a
normal child c distinct from x.
(iv) x is a normal leaf with parent p such that p has a
hybrid child c, p is normal with parent p0 , and p0 2 X .
(v) x is a normal leaf with parent p such that p has a
hybrid child c, p is normal with parent p0 , and p0 has a
normal child b distinct from p.
(vi) x is a hybrid leaf with parents p and q.
(vii) xar, and x has a normal child.
(viii) xar, x has a hybrid child, and x is normal with
parent p0 such that p0 2 X .
(ix) xar, x has a hybrid child, x is normal with parent p0 ,
and p0 has a normal child b distinct from x.
For each x 2 X , compute dðxÞ. We have different cases
depending on the result.
Case 1: Suppose there exists x 2 X with dðxÞa;. If x is
not a leaf, then one of cases (vii), (viii), or (ix) above
occurs, and dðxÞ ¼ ; by Lemmas 4.4–4.6. Hence x is a leaf
of N, so one of cases (ii)–(vi) occurs. Then by Lemma 4.7,
4.8, 4.9, 4.10, or 4.11, respectively, OðxÞ ¼ dðxÞ.
Case 1a: Suppose x is a normal leaf with parent p. By
hypothesis MðxÞ is known. Form a new network N 0 ¼
ðV 0 ; A0 ; r; X 0 Þ by setting V 0 ¼ V fxg and A0 ¼ A fðp; xÞg,
X 0 ¼ ðX fxgÞ [ fpg. Note MðpÞ ¼ MðxÞ OðxÞ ¼ MðxÞ
dðxÞ, so MðpÞ is known. If X already contains a vertex u such
that MðuÞ ¼ MðpÞ, then u and p may be identified together
by Lemma 4.2; otherwise, p is a new vertex in X 0 X . Note
that N 0 has fewer vertices than N so by induction N 0
is uniquely determined. Hence N is determined by
V ¼ V 0 [ fxg, A ¼ A0 [ fðp; xÞg, X 0 ¼ X [ fxg. Note MðxÞ
was already known.
Case 1b: Suppose x is a hybrid leaf with parents p and q.
Let x0 be the separated vertex for x. Then Oðx0 Þ ¼ ;,
Mðx0 Þ ¼ MðxÞ OðxÞ ¼ MðxÞ dðxÞ. Form a new network N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ by setting V 0 ¼ ðV fxgÞ [ fx0 g
and A0 ¼ ðA fðp; xÞ; ðq; xÞgÞ [ fðp; x0 Þ; ðq; x0 Þg, X 0 ¼ ðX fxgÞ [ fx0 g. Note that N 0 has the same number of vertices
as N, but now dðx0 Þ ¼ Oðx0 Þ ¼ OðxÞ OðxÞ ¼ ;. Hence N 0
is ‘‘smaller’’ than N, so by induction N 0 is uniquely
determined. Hence N is determined by V ¼ ðV 0 fx0 gÞ[
fxg, A ¼ ðA0 fðp; x0 Þ; ðq; x0 ÞgÞ [ fðp; xÞ; ðq; xÞg, X 0 ¼ ðX fx0 gÞ [ fxg. Note MðxÞ was already known.
Observe that the formal constructions in Cases 1a and 1b
are identical—the removal of x and insertion of u such that
there is an arc ðu; xÞ and such that MðuÞ ¼ MðxÞ dðxÞ.
Hence we do not need to know which of the two subcases is
occurring.
Case 2: Suppose there exists no x 2 X with dðxÞa;. By
Lemma 4.13 there are no normal leaves. By Lemma 4.15
there exist distinct x, p, and q in X that satisfy the
hypotheses of Lemma 4.14. By Lemma 4.14, x is a hybrid
leaf with parents p and q, x is the only child of p, and x is
the only child of q. Since dðxÞ ¼ ;, x is already a separated
hybrid vertex.
By hypothesis MðxÞ is known. Form a new network
N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ by setting V 0 ¼ V fxg, A0 ¼ A
fðp; xÞ; ðq; xÞg, X 0 ¼ X fxg. Then N 0 is a normal phylogenetic network; note MðpÞ and MðqÞ are known since p and
q are in X. Since jV 0 jojV j, by induction it follows that N 0
is uniquely determined. But then N is uniquely determined
by V ¼ V 0 [ fxg, A ¼ A0 [ fðp; xÞ; ðq; xÞg, X ¼ X 0 [ fxg.
Since MðxÞ was already known, the genome at each
member of V is determined.
This completes the reconstruction of N.
Note that the procedure given above is constructive.
In each stage a network N ¼ ðV ; A; r; X Þ is replaced
by a network N 0 ¼ ðV 0 ; A0 ; r; X 0 Þ such that jX 0 jpjX j and
either jV 0 jojV j or else (jV 0 j ¼ jV j but X 0 has more
separated hybrid vertices than X). Suppose that jCj ¼ m
and in the initial network jV j ¼ v and jX j ¼ n. The number
of stages is then at most nv. The computation of all
dðx; a; bÞ at each stage takes time at most Oðn3 mÞ. Hence a
naı̈ve implementation has time complexity Oðn4 vmÞ. By
Lemma 3.3, every vertex u not in X can be written as u ¼
mrcaðx1 ; x2 Þ for some x1 and x2 in X, whence v ¼ Oðn2 Þ. It
follows that the reconstruction procedure has time complexity Oðn6 mÞ.
5. Discussion
This paper describes a method for reconstructing a
phylogenetic network and its ancestral genomes, given the
genomes of the leaves and outgroup, under certain
assumptions. The assumptions are sometimes quite
strong and future research is desirable to weaken these
assumptions. Examples show, however, that some version
of assumptions (A5)–(A7) for Theorem 4.1 will be
required.
The most difficult part of Theorem 4.1 is identifying x, p,
and q in X such that x is hybrid with parents p and q. The
assumptions (A5)–(A8) serve to permit this identification.
Hence a different criterion to recognize a hybrid vertex
could potentially help relax these assumptions.
It would be of interest to be able to deal with non-binary
characters. Another desirable extension would be a
reconstruction in case the evolution of characters follows
Author's personal copy
ARTICLE IN PRESS
S.J. Willson / Journal of Theoretical Biology 252 (2008) 338–349
a Markov process rather than the Simple Homoplasy
Model.
Most desirable would be a way to include some manner
of homoplasy at normal vertices.
Acknowledgments
I wish to thank the Isaac Newton Institute in Cambridge
UK for its hospitality in a wonderful setting while I wrote
this paper. I also thank Mike Steel and Katherina Huber
for helpful conversations. Finally I thank the anonymous
referees for their excellent corrections and suggestions.
References
Bandelt, H.-J., Dress, A., 1992. Split decomposition: a new and useful
approach to phylogenetic analysis of distance data. Mol. Phylogenet.
Evol. 1, 242–252.
Baroni, M., Steel, M., 2006. Accumulation phylogenies. Ann. Combin. 10,
19–30.
Baroni, M., Semple, C., Steel, M., 2004. A framework for representing
reticulate evolution. Ann. Combin. 8, 391–408.
Baroni, M., Semple, C., Steel, M., 2006. Hybrids in real time. Syst. Biol.
55, 46–56.
349
Bordewich, M., Semple, C., 2007. Computing the minimum number of
hybridization events for a consistent evolutionary history. Discrete
Appl. Math. 155, 914–928.
Cardona, G., Rossalló, F., Valiente, G., 2007. Comparison of tree-child
phylogenetic networks. arXiv:0708.3499v1 [q-bio.PE] 27 August
2007.
Gusfield, D., Eddhu, S., Langley, C., 2004a. Optimal, efficient reconstruction of phylogenetic networks with constrained recombination.
J. Bioinformatics Comput. Biol. 2, 173–213.
Gusfield, D., Eddhu, S., Langley, C., 2004b. The fine structure of galls in
phylogenetic networks. INFORMS J. Comput. 16, 459–469.
Hein, J., 1990. Reconstructing evolution of sequences subject to
recombination using parsimony. Math. Biosci. 98, 185–200.
Hein, J., 1993. A heuristic method to reconstruct the history of sequences
subject to recombination. J. Mol. Evol. 36, 396–405.
Moret, B., Nakhleh, L., Warnow, T., Randal Linder, C.R., Tholse, A.,
Padolina, A., Sun, J., Timme, R., 2004. Phylogenetic networks:
modeling, reconstructibility, and accuracy. IEEE Trans. Comput. Biol.
Bioinformatics 1, 13–23.
Nakhleh, L., Warnow, T., Linder, C.R., St. John, K., 2005. Reconstructing reticulate evolution in species: theory and practice. J. Comput.
Biol. 12, 796–811.
Strimmer, K., Moulton, V., 2000. Likelihood analysis of phylogenetic
networks using directed graph models. Mol. Biol. Evol. 17, 875–881.
Wang, L., Zhang, K., Zhang, L., 2001. Perfect phylogenetic networks with
recombination. J. Comput. Biol. 8, 69–78.
Willson, S.J., 2007. Unique determination of some homoplasies at
hybridization events. Bull. Math. Biol. 69, 1709–1725.
Download