An algorithm of reconciliation of gene and species trees and

An algorithm of reconciliation of gene and species trees and inferring gene duplications, losses and horizontal transfers K. Yu. Gorbunov, V. A. Lyubetsky For computer science audience the algorithm [1] is discussed without biological details. The algorithm reconciles given gene and species trees accounting for the evolutionary events of gene duplications, losses and horizontal transfers. Introduction. For non-biologists let us define some terms, which are widely explained in modern text books on molecular biology etc. (some also cited in [1]). The gene is a sequence of characters in four-letter alphabet fixed at any given time snap. The gene tree is a tree with its vertices assigned genes with certain amount of affinity. In gene trees and species trees terminal leaves contain extant (now existing) genes and species, while internal vertices contain ancestral genes and species. Edges of species tree S can be subdivided into stacks of slices, from the root (ancestral state) to the leaves (modern state), such that a single slice contains edges coexisting in time and thus being able to undergo coexisting evolutionary events. In doing so, some of long edges in tree S will be sliced into subedges having only one descendant subedge. Call edges and subedges of the sliced tree the edges of tree S. The algorithm of slicing the tree S is described in [1] and possesses a natural property: any edge is contained within one slice and any connected edges are contained in different slices. Finally, a time slice contains only non-connected edges from S equidistant from its root. Trees themselves can be interpreted as discrete times. The gene tree and the species tree differ in the sense that evolutionary times for gene and species may be different. A gene in a species can undergo three types of events: copy itself and retain the parental copy (duplication); disappear (loss); copy, retain the parental copy and transfer the other to another species (transfer with retention) or lose the parental copy (transfer without retention). A species can undergo only one event: copy itself as a set and the two sets proceed to evolve independently in time (speciation). Species evolution can be interpreted as co-evolution of its set of genes; more distantly related species evolve more differently. A general task is to map the evolution of genes onto evolution of species, i.e., reconcile evolutionary times of genes and species. Such reconciliation is done [1] by mapping called β of gene tree G into species tree S. Mapping β is computed by the algorithm as a minimum of functional H, a weighted sum of three values: numbers of duplications, losses and transfers (under given β of G into S); i.e. H is a function of β. In other words, the evolution is described as evolution of the ancestral gene in the ancestral species with minimal amount of evolutionary events (the maximum parsimony principle); here ancestral genes/species are defined as those assigned to the roots of trees G and S. A set of all event types and their corresponding time slices is called the scenario. Minimizing H leads to finding an H-optimal scenario, which is referred to below as optimal scenario. An H value in a scenario is called the total cost; for defining H each event type is assigned its own cost for single event. Cost=penalty Let us go into technical details. Call an edge in species tree S a tube. In initial trees each leaf in tree G is a gene and each leaf in tree S – a species, with one exception: tree S has a so called outgroup, a leaf (edge) assigned no species and connected by a tube with the root. The outgroup tube also contains time slices and represents a group of organisms outside tree S and gene gain/loss due to mutagenesis. Each leaf of a gene tree (a gene) is mapped onto a leaf in the species tree (a species). One species can contain several genes; this relation between leaves in G and leaves in S is called the gene-species relation. The algorithm constructs an optimal scenario of mapping one binary gene tree G into the binary species tree S. This mapping can be visualized by introducing auxiliary tree G', which has all vertices from G but also additional vertices producing two descendants, one of which is always a leaf marked with a cross; the cross means gene loss. Call edges leading to crossed leaves blind edges, they do not include their beginning but do include the end. Each split in S corresponds to a split in G' (speciation event implies copying of all genes). Vertices and edges in trees G and G' are identical, except for blind edges. In other words, without blind edges G' is isomorphic to initial G. Trees G, G' and S are polarized as descending from the root placed above. Each tree is added the root edge, which goes up the root. The root edge in G' represents the common ancestor of all genes in G' (and in G), while the root edge in tree S represents a common ancestor of all species in it. Tree G' is contained within tubes of S; root edge of G' is contained within the root tube of S, the vertex of tree G' is contained within a tube or within the split of a tube. The term “split” refers to the end of the tube or the edge. One tube can contain several edges of G'. A transfer event corresponds to an edge of G' starting in one tube and ending in another within the same time slice. Depending on the type of transfer, the parental gene copy either retains in the initial tube or does not. In mapping β each leaf in G corresponds to a leaf in S (gene is known from a given species), i.e. at the leaves level β concurs with the gene-species relation; each leaf in G' not marked with a cross exists in a leaf of tree S. The events, which total cost is to be minimized are: gene duplication (split in tree G' inside the tube with the retention of both copies in the tube); loss (crossed blind edges); two transfer types described above; gain (transfer from the outgroup, regardless of retention or no retention of the gene copy); transfer to the outgroup without gene retention. Also consider three types of events with zero costs here (values of the costs do not affect the algorithm): split in S with the survival of both descendants of the corresponding split in G', duplication in the outgroup and duplication in the root tube (with all descendants of at least one of its copy transferred to the outgroup already in the root). Algorithm formulation from [1]. Given are trees G and S. Trees G and G' (without blind edges) are isomorphic, then edge e in G is identical to edge e in G'. The algorithm enumerates pairs <edge e in G, tube d in S> and builds auxiliary function f(e,d), with its value indicating the event type along edge e of G' in tube d under the assumption that e is inside d and the mapping into S of the subtree of G with e as the root edge is optimal (everywhere in the sense of functional H). When f(e,d) is known for all pairs <e,d>, it trivially defines both the sought optimal mapping β from G into S and tree G'; for that f(e,d) is used as a collection of references, starting from f(e0,d0), where e0 is the root edge in G and d0 – root tube in S. Enumeration proceeds from leaves to the root. For each edge e, all tubes d are enumerated starting from later time slices, and from the outgroup tube within a slice (one of the tubes appeared after slicing the outgroup). Induction starts when e and d end in leaves, then f(e,d) = «e is inside d» if e is a gene from species d; otherwise f(e,d') = «e is transferred without retention to such d, which species contains gene e». At an induction step for each pair <e,d> several states exist describing the fate of e in d and the optimal one is identified: 1) edge e splits inside tube d into edges e1 and e2 without losses; 2) edge e reaches the split of tube d and splits into descendant e1 and e2 inside d1 and d2, respectively, without losses; 3) edge e reaches tube d1, the single descendant of tube d; 4) edge e reaches the split of d with a loss of exactly one descendant e1 or e2 in d1 or 5) edge e splits within d into e1 and e2 with the transfer of either e1 or e2 into in d2; another tube within the same time slice with d but not belonging to the outgroup, with the other descendent surviving within d; 6) edge e splits within d into e1 and e2 with the transfer of either e1 or e2 into another tube within the same time slice with d but not belonging to the outgroup, with loss of other descendent in d; if a transfer without retention is considered as an independent event then the loss does not contribute to the total cost, but a transfer considered to result from transfer with retention and subsequent loss contributes a sum of these individual events; 7) edge e splits in d into e1 and e2, with the transfer of one descendant into another tube, which is both within the same time slice with d and belongs to the outgroup, and the loss of the other in d. Cases 1-5 and 7 do not interfere in induction and easily define f(e,d) with inductive references; e.g., for 1 we define f(e,d) = «duplication, <e1,d>, <e2,d>». Case 6 is slightly more complex. Applying induction requires to consider the next step, which again allows for the above described seven cases. However, two consecutive transfers without retention (two cases 6) are not optimal because can be described by a single transfer. Other cases do not interfere in induction. Note: if tube d does not belong to the outgroup, the case «5 after 6» can be omitted as having the same H value as the case “transfer of one copy from d with retention and transfer of the other copy from d without retention”. If d belongs to the outgroup but the cost of transfer with retention is not less than the cost of gain, the case «5 after 6» can again be omitted: a scenario with two gains has no higher H value. The case «5 after 6» is then considered only if d belongs to the outgroup and the cost of transfer with retention is less than the gain cost. Define m.n. as consecutive cases m and n. To decrease computing time a standard trick (well known from any text books in computer science) should be used to eliminate the enumeration of potential transfer acceptor tubes in cases 5, 6.1, 6.2, 6.3, 6.4, 6.5: during enumeration of pairs <e,d> for each e and each time slice relevant information is stored on which tube d' is optimal to accept the transfer of e (d' is outside the outgroup). Relevant information is: case 5: minimal cost of the scenario starting with pair <e,d'>; case 6.1: minimal sum of the costs of the scenarios starting with <e1,d'> and <e2,d'>, where e1 and e2 descend from e; case 6.2: minimal sum of the costs of the scenarios starting with <e1,d1'> and <e2,d2'>, where e1 and e2 descend from e, d1' and d2' descend from d'; case 6.3: minimal cost of the scenario starting with <e,d1'>, where d1' is the only descendant of d'; case 6.4: minimal cost of the scenario starting with <e,d1'>, where d1' is one of the two descendants of d'; case 6.5: analogously to 5. Therefore, at the step corresponding to pair <e,d> optimal transfer acceptor tubes are known for all cases and were precomputed for pairs like <e1,*> и <e2,*>, where e1 and e2 descend from edge e (cases 5, 6.1, 6.2, 6.5) or for pairs <e,d'>, where d' belongs to a later time slice than d (cases 6.3, 6.4). If in any of 5, 6.1, 6.2, 6.3 or 6.4 the best acceptor tube is such d which contains edge e, the transfer is not parsimonious and such the case is rejected (in case 5 the transfer is replaced by duplication based only on the assumption that its cost is less that the transfer cost). Case 6.5, when tubes d' and d'' are optimal for descendants e1 and e2 of e, respectively, coincide, is rejected in favor of the gain of e in d' and its subsequent duplication (using the above stated assumption). If d' and d'' do not coincide case 6.5 corresponds to the scenario where the gain of e (e.g. in d') is followed by its split and transfer of e2 from d' into d'', i.e. a transfer with retention. Note that the assumption of duplication and transfer costs ratio is not necessary if in all cases the algorithm keeps both optimal and best suboptimal tubes. Algorithm complexity. Computing time is proportional to the product of the number of gene tree edges and tubes in species tree with already imposed time slices. During slicing the tubes by algorithm from [1] the number of new vertices does not exceed the square of number m of vertices in the initial species tree. It enumerates vertices from leaves to the root and each visited vertex is assigned a value, – the time separating it from the time in extant leaves. The values are sorted by increasing and each tube acquires the number of new vertices equal to the number of integers values falling in between the times at the ends of the initial tube. Thus, a tube cannot contain more than m new tubes, which proves the square estimate. This gives the computing time cubic to the value of the maximal number of leaves among the gene tree G and the initial species tree S. In formula it is ≤O(n∙m2) where n is the number of leaves in G and m is the number of leaves in S. It is an absolutely trivial estimation. Algorithm accuracy. The accuracy follows from two trivial observations as well. First, the list of the described seven events is exhaustive, i.e. nothing another can happen with edge e into a tube d. Second, for pairs <e1,d1> and <e2,d2>, where e1 and e2 are descendants of e, and d1 and d2 – probably coinciding tubes, optimal mappings are independent and can be merged into a single mapping for initial pair <e,d>. This property is important for applications of the algorithm to phylogenetic networks (acyclic oriented graphs where each vertex has both two descendants and two ancestors). Our current algorithm does not apply to networks because two subnets with root edges e1 and e2 can have common vertices, which impose the condition of consistence between pairs of corresponding mappings: these vertices must be mapped into the same point. However, in case when only species tree S is represented as a network, the algorithm is applicable and accurate. The authors are grateful to Leonid Rusin for valuable discussion and help in preparation. Reference [1] K. Yu. Gorbunov and V. A. Lyubetsky. Reconstructing the Evolution of Genes along the Species Tree Molecular Biology, 2009, Vol. 43, No. 5, pp. 881–893. ISSN 0026-8933, Molecular Biology, 2009, Vol. 43, No. 5, pp. 881–893. © Pleiades Publishing, Inc., 2009. Original Russian Text © K.Yu. Gorbunov, V.A. Lyubetsky, 2009, published in Molekulyarnaya Biologiya, 2009, Vol. 43, No. 5, pp. 946–958.

An algorithm of reconciliation of gene and species trees and

Related documents

Products

Support

An algorithm of reconciliation of gene and species trees and

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib