Directed scenarios of gene histories over an evolutionary tree as virtual reticulation events Boris Mirkin School of Computer Science and Information Systems, Birkbeck University of London, UK and: E. Koonin, M. Galperin, Y. Wolf, T. Fenner Outline: - Three aspects: Classification Visualisation Evolution - What is a gene history - Two approaches to building gene histories: - Mapping gene trees to species tree: beautiful theory, poor results - Mapping data straightforwardly: Building static scenarios Building directed scenarios Classification: Finding cluster structures Interpreting cluster structures Visualisation: A cognitive activity of building a mental image of a phenomenon in question for gaining insight and understanding (R. Spence, Information Addison-Wesley, 2001) Visualization, S. Dali: Dream Caused by the Flight of a Bee Around a Pomegranate One Second before Waking Up (1944) Our framework for visualisation: Mapping data onto a known ground image such as coordinate plane, geography map, or a genealogy tree for gaining insight and understanding Ground image: An evolutionary (binary rooted) tree, an outdated and simplified schema, to be annotated with mapped data Task: Mapping data of gene families onto an evolutionary tree to gain insight of gene histories involving events of: Emergence inheritance duplication | two mechanisms of horizontal transfer | evolutionary change loss Hybridization is not considered here Data: 66 species 4873 gene families (COGs) Ground image: Evolutionary tree over 66 genomes Example of a gene family: COG1035 Coenzyme F420-reducing hydrogenase, beta subunit 22 genes that occur in 8 species of 66 Distances between 8 genomes according to COG1035 proteins Afu Mac Mja Mka Mth Nos Sme Syn 2.00 2.09 1.93 2.17 3.28 3.53 3.33 1.04 1.09 0.96 2.71 3.21 2.76 0.63 0.45 2.62 3.20 2.66 0.86 2.70 3.04 2.75 2.70 2.73 2.84 4.00 0.22 4.00 A history of COG1035 with mapping: COG1035 66 129 128 65 64 63 130 62 61 125 127 60 124 122 121 123 126 120 118 131 119 59 58 57 56 55 54 53 117 116 52 51 50 49 48 111 47 110 115 46 109 112 108 107 45 44 43 42 41 114 40 103 39 102 100 99 101 38 37 36 104 98 35 97 96 34 33 32 113 95 105 31 94 93 91 92 90 30 29 28 27 26 25 24 23 86 22 85 21 82 84 106 20 81 19 80 87 83 18 79 78 77 17 16 15 14 13 76 88 75 12 11 10 73 89 71 72 74 70 69 9 8 7 6 5 4 3 68 67 2 1 Ecu Sce Spo Hbs Mac Afu Mka Mth Mja Pho Pab Tac Tvo Sso Ape Pya Tma Aae Dra Cgl Mle Mtu MtC Syn Nos Fnu Cac Lla Spy Spn Sau Lin Bsu Bha Mpu Uur Mpn Mge Ctr Cpn Tpa Bbu Xfa Pae Vch Buc Ype Sty Eco EcZ Ecs Hin Pmu Rso Nme NmA Ccr Atu Sme Bme Mlo Rpr Rco Cje Hpy jHp Two ways: (i) Building a within COG species tree with the distances and then MAPPING the gene tree to the species tree; models: emergence inheritance duplication loss NO HORIZONTAL TRANSFER (ii) Mapping COG data onto the species tree directly, never done before – work in progress; models: emergence inheritance horizontal transfer loss NO DUPLICATION so far (i) Mapping individual gene trees to a species tree Reconciled tree . a b c (i) a1 b1 c1 a (ii) a2 b c b2 c2 (iii) Figure 1. A gene tree (i) versus species tree (ii): the reconciled tree (iii) explains the difference by the duplication of the species tree with losses of lineages leading to c1, a2 and b2. (1) intuitively appealing pictures (2) accommodates the presence of the so-called paralogs, similar proteins in a species (3) difficult to handle when there are many different gene trees Annotating duplication +– + – a b (i) c a + (ii) b + c – Figure 2. Species in the left and right subtrees of gene tree (i) annotated by symbols + and – in the species tree (ii), with the follow-up raising the symbols along the tree. (1) less intuitively appealing (2) maps different gene trees onto the same species tree without changing the latter (3) can be analysed mathematically: Theorem: The total number of losses at mapping a gene tree to the species tree is equal to the number of incompatible pairs plus twice the number of duplications [MIR 95], [EUL 98]. Lca mapping a b (i) c a + b + (ii) c – Figure 3. Lca mapping: internal nodes of gene tree (i) onto species tree (ii): 1 contraction, 3 retractions. (i) can be done in linear time over number of species (ii) Theorem: The inconsistencies, contraction (1) and retraction (2) in mapping parent-child pairs, correspond to duplications and to those duplications for which the collateral children are losses [EUL 98]. (iii) Theorem: The total number of losses is the number of all the intermediate nodes in retractions plus the number one-sided contractions under the lca mapping M [EUL 98]. Distance matrices to model horizontal transfer events: 2 stages 1 stage. Parsimoniously mapping an individual gene set to the tree: algorithm PARS PARENT Inherited Not inherited CHILD1 Gain Loss Inherited Gi1 Li1 Not inher Gn1 Ln1 Gain Loss Gi Li Gn Ln CHILD2 Gain Loss Inherited Gi2 Li2 Not inher Gn2 Ln2 Figure 4. Events in a parent-children triple according to a parsimonious scenario. Initial setting: At a leaf node the four sets Gi, Li, Gn and Ln are empty, except that Gn ={a} if gene a is present in the given leaf or Li = {a} if a is not present. Criterion: #LOSS + gp#HT MIN Presence-absence pattern of COG0572 Uridine Kinase y o z m k a p b l w d c r n s f g h e x j u t i q v Reconstructed MP history of COG 0572 Uridine Kinase (A static scenario) presence gain 6 events: 4 losses 2 gains inheritance loss y o z m k a p b l w d c r n s f g h e x j u t i q v Reconstructing LUCA at 26 cellular species - 26 species (bacteria, archaea and yeast) - 3166 gene families (COGs) - gain penalty from 0.1 to 10 Reconstructed contents of the root (LUCA) an external criterion for selecting gain penalty: At gp=1, 572 genes in LUCA best approximated an organism capable to survive [Mir03] This leads to equal empirical frequencies of losses and gains 2 stage. Principle of maximum correlation (i) evolutionary tree must be clocked (ii) directed scenario: Each gain, g, has its source, s L L S S T R T R Time 100 70 20 a b c d e a b c d e Figure 5. Horizontal transfer event; (C1) directly from node S to node a and (C2) from the middle of T's life-span to the middle of a's life span. Distances between a, b and c: No transfer C1 c d D=|200 200| | 140| C2 c d c d D1=|70 70| D2=|100 50| a | 140| | 140| c Principle: find such a directed scenario that is best matching the observed distance matrix between proteins Virtually reticulated trees? b a c d e b cad e Figure 6. Changed tree topologies corresponding to directed scenarios C1 and C2 on Figure 5. Synchronization of source-target timing D B D D C B C A B C A A Figure 6. Generic patterns of interrelation between life spans of a horizontal transfer target, AB, (vertical line on the left) and its source, CD (vertical line on the right), to generate the synchronising distance between them, AB at (i) or AD at (ii) and (iii). Deriving weight: horizontal transfer penalty At 66 genomes, with principle of maximum correlation applied to build correlated distances over maximum parsimony scenarios, of 4872 COGs: 3975 correspond to gp=1 602 to gp=2 295 to gp=3 thus concurring with the results of our previous analysis of 26 genomes Future work: - Complete a mathematical computational framework for principle of maximum correlation and the - Include paralogous (representing multiple copies of a gene within a genome) proteins to model duplication - Extend model to virus and other species types (to include host effects) - Include more data on gene families