Restrictions on meaningful phylogenetic networks

advertisement
Restrictions on meaningful phylogenetic networks
Abstract: It is common to assume that a phylogenetic network is a rooted acyclic
digraph. Let X denote the set of leaves. The author argues that, if the network is to
be uniquely determined by data on X, then additional assumptions about the network
are needed. Under a simple model of genetic inheritance with no homoplasies, the
network should be assumed regular (in the sense of Baroni et al 2004; see Baroni
and Steel 2006).
A vertex is called "hybrid" if it has at least two parents and "normal" if it has exactly
one parent. If v is a vertex, a "normal path from v to X" is a directed path starting at
v and ending at some member of X, such that each vertex on the path (other than
possibly v itself) is normal. Under a simple model of inheritance that allows
homoplasies only at hybrid vertices, there is unique determination of the genomes at
vertices provided that from each vertex v there is a normal path to X. If from every
vertex there exists a normal path to X, then in fact the graph is regular. Hence the
author argues that the search for meaningful phylogenetic networks should be
restricted to networks such that from each vertex there is a normal path to X.
Restrictions on meaningful phylogenetic networks
Stephen J. Willson
Department of Mathematics
Iowa State University
Ames, Iowa 50011
USA
swillson@iastate.edu
A phylogenetic network is an acyclic rooted directed graph G in which the leaves
are identified with known taxa.
1
12
11
10
9
2
13
14
15
3
4
7
8
6
5
X = {1,2,3,4,5,6,7,8} is the base-set.
X contains the vertices corresponding to taxa about which direct information is
known. Other vertices are reconstructions.
The problem is to learn about G from information on X.
What makes phylogenetic trees meaningful?
1
8
10
9
2
11
12
3
•
6
7
5
4
Look for a most recent common ancestor:
mrca(2,6) = 8.
•
Predict where some character evolved:
If a character is observed in taxa 2, 3, 4, 5 and nowhere else, where did it most
likely arise?
Write v ! w if there is a directed path (possibly trivial) from v to w.
1
12
11
10
9
2
13
14
15
3
4
7
8
6
5
11 ! 4
It is false that 11 ! 6.
Interpret v ! w to mean that v has the possibility of directly contributing to the
genome of w.
What we can learn depends on our model of evolution.
Let C be the set of characters. Assume
(1) Each character is binary. (An allele agrees with the root or disagrees.)
(2) Each vertex v has a genome g(v) " C
where i # g(v) iff v exhibits the allele of i different from the allele at the root.
1
12
11
10
9
2
13
3
14
15
7
8
6
4
5
C = {d,e,f,h,i,j,k}
g(3) = {d,e,h} means:
(1) On characters d, e, h the genome of 3 disagrees with the genome at the root 1.
(2) On characters f, i, j, k the genome of 3 agrees with the genome at the root 1.
Evolution Model 1. (Monotonic, or gene-aggregation).
Every informative character i mutates exactly once (at ui ).
All descendents of ui (and only those descendents) exhibit the modified character i.
1
12
11
10
9
2
13
3
14
15
7
8
6
4
5
If ui = 10, then modified character i is present in precisely {4,5,6,7,8,9,10,14,15}.
Following Baroni, Semple, and Steel 2004, form the cluster map
c: V $ P(X) by
c(v) = {x # X: v!x}.
1
12
11
10
9
2
13
14
15
3
4
7
8
6
5
c(11) = {2,3,4,5}
c(6) = {6}
If a character originates at 11, we see it expressed at {2,3,4,5}.
Example.
1
6
7
8
2
9
3
10
4
5
Here X = {1,2,3,4,5}
c(6) = {2,3,4,5} = c(7)
We could not tell whether a character in {2,3,4,5} originates at 6 or at 7 by looking
at genomes of members of X.
These two networks will have the same patterns on X:
1
1
6
7
7
8
9
10
8
9
10
4
5
4
5
2
3
2
3
To be able to determine where a character i originates under this simple model we
must have that c is one-to-one.
Example.
1
12
11
10
6
7
2
3
c(1) = {1,2,3,4,5}
c(2) = {2}
c(3) = {3}
c(4) = {4}
c(5) = {5}
c(6) = {6}
c(7) = {2,3}
c(8) = {3,4}
c(9) = {4,5}
c(10) = {2,3,4,5}
c(11) = {3,4,6}
c(12) = {2,3,4,5,6}.
8
4
9
5
1
12
11
10
6
7
2
8
3
4
9
5
Hence c is one-to-one.
c(8) = {3,4}
c(10) = {2,3,4,5}
Note c(8) " c(10) but it is false that 10 ! 8.
But compare:
1
12
11
10
6
7
2
8
3
4
9
5
Note that c(10) = {2, 3, 4, 5}, exactly as before.
The genomes at all the members of X are exactly the same as before.
Assuming Model 1, we can't distinguish the two graphs on the basis of information
on X.
Moral. Assume Model 1. If G is to exhibit all the genetic influences that could
determine the genomes at members of X, then
u ! v iff c(v) " c(u).
Proof.
If u ! v then c(v) " c(u), because if x # X and v ! x, then u ! v ! x.
Suppose c(v) " c(u) but it is false that u ! v.
Then there is some possible influence of u on v that is not described by the network.
Adding the arc (u,v) would not change any genomes of X.
QED
Definition. (Baroni, Semple, Steel 2004) The network G is regular iff
(1) c is one-to-one; and
(2) u ! v iff c(v) " c(u).
Theorem. Assume Model 1. Suppose
(1) G is to have the genomes of all vertices determined by the genomes at members
of X; and
(2) G is to exhibit all the genetic influences that could determine the genomes at
members of X.
Then G must be regular.
I argue that any meaningful network must be regular.
Otherwise, it is not justified by the data on X alone in this simplest of models for
inheritance.
Model 1 is very strong, especially at hybrid vertices.
A vertex v is hybrid if it has at least two parents. A vertex v is normal if it has at
most one parent.
1
12
11
10
9
2
13
14
15
3
4
5
15 is hybrid.
All other vertices are normal.
7
6
8
Objection to Model 1:
If y has parents p and q, it is unlikely that it inherits all the modified genes in both p
and q. Such inheritance might happen in polyploidy but not in general. We should
allow homoplasies at hybrid vertices.
q
p
y
Another difficulty.
p
x
q
y
If i # g(x) but i % g(y), maybe
(1) i originates at x; or
(2) i originates at p, but doesn't get inherited by y.
Such an appearance at p but immediate disappearance in p's child y is an immediate
homoplasy.
Assume there are no immediate homoplasies.
Model (2) Monotonic with hybrid homoplasies
Assume normal vertices inherit all the modified characters in their parent.
Assume hybrid vertices may or may not inherit a character from a parent.
Assume there are no immediate homoplasies.
1
12
11
10
9
2
13
3
14
15
7
8
6
4
5
If ui = 9 then possible patterns are {7, 8}
If ui = 10 then possible patterns are {4, 5, 6, 7, 8}, {6, 7, 8}
If ui = 11 then possible patterns are {2, 3, 4, 5}, {2, 3}
If ui = 12 then possible patterns: {2, 3, 4, 5, 6, 7, 8}, {2, 3, 6, 7, 8}
If ui = 13 then possible patterns are {3, 4, 5}
If ui = 14 then possible patterns are {4, 5, 6}
If ui = 15 then possible patterns are {4, 5}
Note {6} is not a possibility if ui = 14 since it requires an immediate homoplasy.
1
12
11
10
9
2
13
3
14
15
7
6
4
5
If ui = 2 then possible patterns are
If ui = 3 then possible patterns are
If ui = 4 then possible patterns are
If ui = 5 then possible patterns are
If ui = 6 then possible patterns are
If ui = 7 then possible patterns are
If ui = 8 then possible patterns are
{2}
{3}
{4}
{5}
{6}
{7}
{8}
8
1
12
11
10
9
2
13
14
15
3
4
7
8
6
5
Each of the indicated patterns occurs exactly once in the example.
Define the generalized cluster map
gc: V-{r} $ P(P(X))
gc(v) = { U: U " X and U arises under Model (2) as the pattern that tells where
some character i originating at v is exhibited}.
gc(10) = {{4,5,6,7,8}, {6,7,8}}
The generalized cluster map gc is generalized injective if
(1) no U " X appears in both gc(u) and gc(v), and
(2) & never appears in gc(u).
Moral. If the genomes of the vertices are to be uniquely determined by information
on X under Model 2, then gc is generalized injective.
Proof. If U # gc(u) and U # gc(v), then when {x # X: i # g(x)} = U, we cannot tell
whether u = ui or v = ui. QED
I argue that if G is a meaningful network, then its generalized cluster map should be
generalized injective.
How do we recognize such networks?
A normal path from u to v is a directed path from u to v such that every vertex
after u is normal. Note u may or may not be hybrid.
1
12
11
10
9
2
13
3
14
15
7
8
6
4
5
The path 12, 11, 2 is normal.
The path 14, 15, 5 is not normal.
There is a normal path from u to X if there is a normal path from u to some x # X.
There is a normal path from 10 to X given by 10, 9, 7.
There is a normal path from 15 to X given by 15, 5.
Definition. A network is normal if from every vertex u there is a normal path to X.
Main Theorem. Suppose G is a normal network. Then
(1) G is regular.
(2) gc is generalized injective.
Example.
1
10
11
13
2
12
15
14
16
3
17
18
4
9
8
5
7
6
G is normal. Hence G is regular, and gc is generalized injective.
Example.
1
10
9
6
7
8
5
2
4
3
Here X = {1,2,3,4,5}. There is no normal path from 7 to X.
gc(9) contains {2,3} and gc(6) contains {2,3}. Hence gc is not generalized injective.
Theorem.
Let X consist only of the leaves and the root. Suppose that every vertex other than
the root has outdegree 0 or 2, while the root has outdegree 1. Assume that the
generalized cluster map gc is generalized injective. Then the network is normal.
Moral. I propose that meaningful networks are normal.
How complicated are normal networks?
Theorem.
Suppose that G = (V,A,r,X) is an acyclic rooted network and |X| = n (including the
root). Let v = |V|.
(1) If G is not regular, then v is unbounded.
(2) If G is regular, then v ! 2n-1.
(3) If G is a tree, then v ! 2n-2.
How complicated are normal networks?
Theorem.
Suppose that G = (V,A,r,X) is an acyclic rooted network and |X| = n (including the
root). Let v = |V|.
(1) If G is not regular, then v is unbounded.
(2) If G is regular, then v ! 2n-1.
(3) If G is a tree, then v ! 2n-2.
(4) If G is normal, then v ! (n2-n+2)/2
Thus the assumption of normality is a large simplification.
Let K be a nonempty set of vertices. Then the
most recent common ancestor of K
= mrca(K)
is the vertex v such that
(1) For all k # K, v ! k;
(2) Suppose that for all k # K, u ! k; then u!v
if it exists.
Example.
0
11
10
9
6
1
7
2
3
8
4
5
mrca(3,4) = 10.
mrca(2,4) does not exist since both 9 and 10 are candidates.
Some properties of normal networks:
(1) For each vertex v, v = mrca(c(v)).
(2) For each vertex v that is not in X, there exist x and y in X such that
v = mrca(x, y).
(3) There may exist x and y in X such that mrca(x, y) does not exist.
Future work:
(1) Find more properties of normal networks.
(2) Models 1 and 2 are kinds of perfect phylogeny. Find analogous restrictions for
other models of evolution and for characters that are not binary.
(3) What is the maximum number of vertices in a normal network with |X| = n?
We know v ! (n2-n+2)/2.
(4) What are properties of "regularization" and "normalization"?
Summary
I suggest that we should seek normal networks, i.e., networks such that from every
vertex u there is a normal path to X.
References
Baroni, Mihaela, Charles Semple, and Mike Steel, A framework for representing
reticulate evolution, Annals of Combinatorics 8 (2004) 391-408.
Baroni, Mihaela and Mike Steel. Accumulation phylogenies. Annals of
Combinatorics 10 (2006) 19-30.
Willson, Stephen. Unique determination of some homoplasies at hybridization
events. Bulletin of Mathematical Biology 69 (2007) 1709-1725.
Thank you for your attention.
1
12
11
10
6
7
2
8
3
4
9
5
Here c is one-to-one.
c(8) = {3,4}
c(10) = {2,3,4,5}
Note c(8) " c(10) but it is false that 10 ! 8.
But compare:
1
12
11
10
6
7
2
8
3
4
9
5
and
1
12
11
10
13
7
2
8
3
4
6
9
5
and
1
12
11
10
6
13
7
2
8
3
4
9
5
In these graphs, there is also an extra vertex, but no character originated at the new
vertex 13. One might look for a character i such that
{x # X: i # g(x)} is {3,4,5} for the third graph or {2,3,4} for the fourth graph.
The graphs differ, yet the leaves might have exactly the same genomes under Model
1.
An arc (u,v) is redundant if there is a vertex w such that u<w<v.
We assume there are no redundant arcs. The possibility of u contributing to the
genome of v is already implied by the other arcs.
1
12
11
10
9
2
13
3
14
15
7
8
6
4
5
The arc (12,15) is redundant since 12 < 13 < 15, so arc (12, 15) should be omitted.
Hence there is an arc (u,v) iff u!v, u'v, and there is no w such that
w ' u, w' v, u!w!v.
Download