An algorithm of reconciliation of gene and species trees and

advertisement
An algorithm of reconciliation of gene and species trees and
inferring gene duplications, losses and horizontal transfers
K. Yu. Gorbunov, V. A. Lyubetsky
For computer science audience the algorithm [1] is discussed without biological details. The
algorithm reconciles given gene and species trees accounting for the evolutionary events of gene
duplications, losses and horizontal transfers.
Introduction. For non-biologists let us define some terms, which are widely explained in
modern text books on molecular biology etc. (some also cited in [1]). The gene is a sequence of
characters in four-letter alphabet fixed at any given time snap. The gene tree is a tree with its
vertices assigned genes with certain amount of affinity. In gene trees and species trees terminal
leaves contain extant (now existing) genes and species, while internal vertices contain ancestral
genes and species. Edges of species tree S can be subdivided into stacks of slices, from the root
(ancestral state) to the leaves (modern state), such that a single slice contains edges coexisting in
time and thus being able to undergo coexisting evolutionary events. In doing so, some of long
edges in tree S will be sliced into subedges having only one descendant subedge. Call edges and
subedges of the sliced tree the edges of tree S. The algorithm of slicing the tree S is described in
[1] and possesses a natural property: any edge is contained within one slice and any connected
edges are contained in different slices. Finally, a time slice contains only non-connected edges
from S equidistant from its root. Trees themselves can be interpreted as discrete times. The gene
tree and the species tree differ in the sense that evolutionary times for gene and species may be
different. A gene in a species can undergo three types of events: copy itself and retain the
parental copy (duplication); disappear (loss); copy, retain the parental copy and transfer the other
to another species (transfer with retention) or lose the parental copy (transfer without retention).
A species can undergo only one event: copy itself as a set and the two sets proceed to evolve
independently in time (speciation). Species evolution can be interpreted as co-evolution of its set
of genes; more distantly related species evolve more differently. A general task is to map the
evolution of genes onto evolution of species, i.e., reconcile evolutionary times of genes and
species. Such reconciliation is done [1] by mapping called β of gene tree G into species tree S.
Mapping β is computed by the algorithm as a minimum of functional H, a weighted sum of three
values: numbers of duplications, losses and transfers (under given β of G into S); i.e. H is a
function of β. In other words, the evolution is described as evolution of the ancestral gene in the
ancestral species with minimal amount of evolutionary events (the maximum parsimony
principle); here ancestral genes/species are defined as those assigned to the roots of trees G and
S. A set of all event types and their corresponding time slices is called the scenario. Minimizing
H leads to finding an H-optimal scenario, which is referred to below as optimal scenario. An H
value in a scenario is called the total cost; for defining H each event type is assigned its own cost
for single event. Cost=penalty
Let us go into technical details. Call an edge in species tree S a tube. In initial trees each
leaf in tree G is a gene and each leaf in tree S – a species, with one exception: tree S has a so
called outgroup, a leaf (edge) assigned no species and connected by a tube with the root. The
outgroup tube also contains time slices and represents a group of organisms outside tree S and
gene gain/loss due to mutagenesis. Each leaf of a gene tree (a gene) is mapped onto a leaf in the
species tree (a species). One species can contain several genes; this relation between leaves in G
and leaves in S is called the gene-species relation. The algorithm constructs an optimal scenario
of mapping one binary gene tree G into the binary species tree S. This mapping can be visualized
by introducing auxiliary tree G', which has all vertices from G but also additional vertices
producing two descendants, one of which is always a leaf marked with a cross; the cross means
gene loss. Call edges leading to crossed leaves blind edges, they do not include their beginning
but do include the end. Each split in S corresponds to a split in G' (speciation event implies
copying of all genes). Vertices and edges in trees G and G' are identical, except for blind edges.
In other words, without blind edges G' is isomorphic to initial G. Trees G, G' and S are polarized
as descending from the root placed above. Each tree is added the root edge, which goes up the
root. The root edge in G' represents the common ancestor of all genes in G' (and in G), while the
root edge in tree S represents a common ancestor of all species in it. Tree G' is contained within
tubes of S; root edge of G' is contained within the root tube of S, the vertex of tree G' is
contained within a tube or within the split of a tube. The term “split” refers to the end of the tube
or the edge. One tube can contain several edges of G'. A transfer event corresponds to an edge of
G' starting in one tube and ending in another within the same time slice. Depending on the type
of transfer, the parental gene copy either retains in the initial tube or does not. In mapping β each
leaf in G corresponds to a leaf in S (gene is known from a given species), i.e. at the leaves level β
concurs with the gene-species relation; each leaf in G' not marked with a cross exists in a leaf of
tree S.
The events, which total cost is to be minimized are: gene duplication (split in tree G'
inside the tube with the retention of both copies in the tube); loss (crossed blind edges); two
transfer types described above; gain (transfer from the outgroup, regardless of retention or no
retention of the gene copy); transfer to the outgroup without gene retention. Also consider three
types of events with zero costs here (values of the costs do not affect the algorithm): split in S
with the survival of both descendants of the corresponding split in G', duplication in the
outgroup and duplication in the root tube (with all descendants of at least one of its copy
transferred to the outgroup already in the root).
Algorithm formulation from [1]. Given are trees G and S. Trees G and G' (without
blind edges) are isomorphic, then edge e in G is identical to edge e in G'. The algorithm
enumerates pairs <edge e in G, tube d in S> and builds auxiliary function f(e,d), with its value
indicating the event type along edge e of G' in tube d under the assumption that e is inside d and
the mapping into S of the subtree of G with e as the root edge is optimal (everywhere in the sense
of functional H). When f(e,d) is known for all pairs <e,d>, it trivially defines both the sought
optimal mapping β from G into S and tree G'; for that f(e,d) is used as a collection of references,
starting from f(e0,d0), where e0 is the root edge in G and d0 – root tube in S.
Enumeration proceeds from leaves to the root. For each edge e, all tubes d are
enumerated starting from later time slices, and from the outgroup tube within a slice (one of the
tubes appeared after slicing the outgroup). Induction starts when e and d end in leaves, then f(e,d)
= «e is inside d» if e is a gene from species d; otherwise f(e,d') = «e is transferred without
retention to such d, which species contains gene e». At an induction step for each pair <e,d>
several states exist describing the fate of e in d and the optimal one is identified:
1) edge e splits inside tube d into edges e1 and e2 without losses;
2)
edge e reaches the split of tube d and splits into descendant e1 and e2 inside d1
and d2, respectively, without losses;
3) edge e reaches tube d1, the single descendant of tube d;
4)
edge e reaches the split of d with a loss of exactly one descendant e1 or e2 in d1 or
5)
edge e splits within d into e1 and e2 with the transfer of either e1 or e2 into
in d2;
another tube within the same time slice with d but not belonging to the outgroup, with the other
descendent surviving within d;
6)
edge e splits within d into e1 and e2 with the transfer of either e1 or e2 into
another tube within the same time slice with d but not belonging to the outgroup, with loss of
other descendent in d; if a transfer without retention is considered as an independent event then
the loss does not contribute to the total cost, but a transfer considered to result from transfer with
retention and subsequent loss contributes a sum of these individual events;
7)
edge e splits in d into e1 and e2, with the transfer of one descendant into another
tube, which is both within the same time slice with d and belongs to the outgroup, and the loss of
the other in d.
Cases 1-5 and 7 do not interfere in induction and easily define f(e,d) with inductive
references; e.g., for 1 we define f(e,d) = «duplication, <e1,d>, <e2,d>».
Case 6 is slightly more complex. Applying induction requires to consider the next step,
which again allows for the above described seven cases. However, two consecutive transfers
without retention (two cases 6) are not optimal because can be described by a single transfer.
Other cases do not interfere in induction. Note: if tube d does not belong to the outgroup, the
case «5 after 6» can be omitted as having the same H value as the case “transfer of one copy
from d with retention and transfer of the other copy from d without retention”. If d belongs to the
outgroup but the cost of transfer with retention is not less than the cost of gain, the case «5 after
6» can again be omitted: a scenario with two gains has no higher H value. The case «5 after 6» is
then considered only if d belongs to the outgroup and the cost of transfer with retention is less
than the gain cost.
Define m.n. as consecutive cases m and n. To decrease computing time a standard trick
(well known from any text books in computer science) should be used to eliminate the
enumeration of potential transfer acceptor tubes in cases 5, 6.1, 6.2, 6.3, 6.4, 6.5: during
enumeration of pairs <e,d> for each e and each time slice relevant information is stored on which
tube d' is optimal to accept the transfer of e (d' is outside the outgroup). Relevant information is:
case 5: minimal cost of the scenario starting with pair <e,d'>;
case 6.1: minimal sum of the costs of the scenarios starting with <e1,d'> and <e2,d'>,
where e1 and e2 descend from e;
case 6.2: minimal sum of the costs of the scenarios starting with <e1,d1'> and <e2,d2'>,
where e1 and e2 descend from e, d1' and d2' descend from d';
case 6.3: minimal cost of the scenario starting with <e,d1'>, where d1' is the only
descendant of d';
case 6.4: minimal cost of the scenario starting with <e,d1'>, where d1' is one of the two
descendants of d';
case 6.5: analogously to 5.
Therefore, at the step corresponding to pair <e,d> optimal transfer acceptor tubes are
known for all cases and were precomputed for pairs like <e1,*> и <e2,*>, where e1 and e2
descend from edge e (cases 5, 6.1, 6.2, 6.5) or for pairs <e,d'>, where d' belongs to a later time
slice than d (cases 6.3, 6.4). If in any of 5, 6.1, 6.2, 6.3 or 6.4 the best acceptor tube is such d
which contains edge e, the transfer is not parsimonious and such the case is rejected (in case 5
the transfer is replaced by duplication based only on the assumption that its cost is less that the
transfer cost). Case 6.5, when tubes d' and d'' are optimal for descendants e1 and e2 of e,
respectively, coincide, is rejected in favor of the gain of e in d' and its subsequent duplication
(using the above stated assumption). If d' and d'' do not coincide case 6.5 corresponds to the
scenario where the gain of e (e.g. in d') is followed by its split and transfer of e2 from d' into d'',
i.e. a transfer with retention. Note that the assumption of duplication and transfer costs ratio is
not necessary if in all cases the algorithm keeps both optimal and best suboptimal tubes.
Algorithm complexity. Computing time is proportional to the product of the number of
gene tree edges and tubes in species tree with already imposed time slices. During slicing the
tubes by algorithm from [1] the number of new vertices does not exceed the square of number m
of vertices in the initial species tree. It enumerates vertices from leaves to the root and each
visited vertex is assigned a value, – the time separating it from the time in extant leaves. The
values are sorted by increasing and each tube acquires the number of new vertices equal to the
number of integers values falling in between the times at the ends of the initial tube. Thus, a tube
cannot contain more than m new tubes, which proves the square estimate. This gives the
computing time cubic to the value of the maximal number of leaves among the gene tree G and
the initial species tree S. In formula it is ≤O(n∙m2) where n is the number of leaves in G and m is
the number of leaves in S. It is an absolutely trivial estimation.
Algorithm accuracy. The accuracy follows from two trivial observations as well. First,
the list of the described seven events is exhaustive, i.e. nothing another can happen with edge e
into a tube d. Second, for pairs <e1,d1> and <e2,d2>, where e1 and e2 are descendants of e, and
d1 and d2 – probably coinciding tubes, optimal mappings are independent and can be merged
into a single mapping for initial pair <e,d>. This property is important for applications of the
algorithm to phylogenetic networks (acyclic oriented graphs where each vertex has both two
descendants and two ancestors). Our current algorithm does not apply to networks because two
subnets with root edges e1 and e2 can have common vertices, which impose the condition of
consistence between pairs of corresponding mappings: these vertices must be mapped into the
same point. However, in case when only species tree S is represented as a network, the algorithm
is applicable and accurate.
The authors are grateful to Leonid Rusin for valuable discussion and help in preparation.
Reference
[1] K. Yu. Gorbunov and V. A. Lyubetsky. Reconstructing the Evolution of Genes along the
Species Tree Molecular Biology, 2009, Vol. 43, No. 5, pp. 881–893. ISSN 0026-8933, Molecular
Biology, 2009, Vol. 43, No. 5, pp. 881–893. © Pleiades Publishing, Inc., 2009. Original Russian
Text © K.Yu. Gorbunov, V.A. Lyubetsky, 2009, published in Molekulyarnaya Biologiya, 2009,
Vol. 43, No. 5, pp. 946–958.
Download