Trying to reconstruct the history of genes families Roberto Marangoni*^, Nadia Pisanti*, Paolo Ferragina*^, Antonio Frangioni*, Fabrizio Luccio*^ *Dept. of Informatics, University of Pisa, Italy ^C.I.S.S.C. (Interdisciplinary Center for Complex Systems Study), University of Pisa, Italy. E-mail: marangon@di.unipi.it 1. Evolution, Information and Complexity 2. Duplications, genes families and paralogs These three concepts are hard to define, when referred to a biological organism. We can give only “working” definitions like: 1) Evolution, recalling the classic Darwinian definition, it is “descent with modifications” (i.e., sons are not equal to fathers; but up to now, there is no generally accepted definition of biological evolution). 2) Information; when we refer it to a biological organism, we can define it as the information stored in the genome, even if it is not completely true, since the development of an organism is specified not only by the DNA, but also by the concurrence of mother-RNA, proteins and other factors. 3) Complexity: a tentative definition can be found following something like an “algorithmic” approach. One can ask: how many words are enough to describe a bacterium? And, how many to describe a human? Of course, in the last case, many more words are needed. One can say, in this case, that a human is more complex than a bacterium. There are two kinds of mechanisms described in literature, able to create new information in genomes: 1) Exogenous mechanisms, like horizontal transfers and transfections, the final result of which is the insertion into a genome of a DNA segment coming from another specie. Even if this kind of process is important, its quantitative contribution to a genome is not so relevant. 2) Endogenous mechanisms, mostly represented by duplications. They have been described whole-genome-, largesegment-, tandem- and single gene-duplications. Duplications make genomes clusterizable into genes families. Usually, members of genes families are sharing a high homology in their sequence, and, when they are functionally active, they perform very similar biological functions: they are called paralog genes or simply paralogs. Endogenous mechanisms represent quantitatively the most important process that leads to an increase of the genomic information. Biological evolution has generated more and more complex organisms; but to a high complexity corresponds a high information content; and therefore the general problem for the biological evolution moves to: In order to better understand the mechanism by which genomes have increased their size and multiplied their functional capabilities, it is necessary to study the behavior of duplication events; the first step is to investigate the history of a genic family: how many duplication events have occurred, in which order, etc. How does information increase during evolution? How to reconstruct the history of genic families? 3. Building a paralogy tree 4. PaTre To reconstruct the history of genes families, under the hypothesis that every family member derives from a duplication process of another member, means to put the set of members into a tree, that we call paralogy tree, in which the root represents the most ancient gene of the family, and each directed arc represents the relationship matrix-copy in a duplication process. This is not a phylogenesis study and this is not a phylogenetic tree!!! Differently from philogenetic studies, in which one measure the similarity between two or more sequences, in order to infer which could had been the possible common ancestor, in this kind of study we need to use an asymmetric function to compare two sequences, which is able to express which sequence could have been the matrix and which copy, in an hypothetical duplication process. This kind of function has to address two basic biological requirements, which derive from the presently known duplications: 1) Copies are usually shorter than matrixes, since the event of a segment insertion after a duplication is a rare event. 2) To insert segments has metabolic costs, while to delete segments has no cost. The method we present, called PaTre, is made up of the following steps: 1) Input: all the paralog sequences of a family; 2) Computation of the TD values for each possible couple of paralogs in the input set and construction of the directed graph (see fig. 1) that expresses, for each couple, the probability of the relationship matrix-copy/copy-matrix. 3) Extraction of the Lightest Spanning Arborescence (LSA) by means of Edomond’s algorithm [2,3]. We assume the extracted LSA as the paralogy tree (fig. 2) Fig. 1: example of a directed graph We ask PaTre to give an output not only of the optimal solution, but also the sub-optimal ones, which are useful in the following. We have developed a method for paralogy tree construction, based on Transformation Distance (TD) [1] as the basic function to compare two sequences, since it addresses the biological requirements stated above. How to obtain the paralogy tree? 5. Testing PaTre str02 str05 AF087266 Unfortunately, there are NO documented histories of genes families in nature, so that we used a simulation procedure to test PaTre. We have therefore developed a simulator, that receives in input a gene, and generates (by iterating the duplication-with-modification mechanism) a family of simulated paralogs, the history of which is, of course, known (fig. 3). AF087337 AF087395 str10 CTR249638 CTR249639 str04 str06 5b. Testing the simulator AF087343 AF087406 str07 To test how the simulated data are similar with respect to real ones, we run the simulation on different gene families, starting from different sequences, and then we use a standard clustering algorithm, giving an input of both simulated and real sequences. The results show that, for several of the most diffused genes families, the generated clusters contain both simulated and real sequences, thus demonstrating a good degree of “mimetic” capability of the simulator. As in the example shown in fig. 4, where the simulated sequences have the name “str##” and the real sequences are named “AF…” AF087357 str01 AF087196 str03 AF087326 str08 CTORFBRSA str09 Fig. 2: example of PaTre output Fig. 4: similarity tree computed on simulated and real sequences Fig. 3: example of simulator output Cost: 7840 - Distance: 0% MFINFRP 1044 str01 1187 1035 str02 str03 757 955 str04 str05 704 str06 505 526 str07 394 str08 6. Applications on simulated data Applying PaTre on simulated families, we always get the corrected tree. Fig. 5 compares the simulated tree with the one reconstructed by PaTre: they are completely overlapping each other. We have tested PaTre on more than 60 families in 20 different organisms. str09 str11 7. Using similarity-based algorithms str10 output from PaTre for the simulated Ribosomal Protein of M. pneumoniae MFINFRP If we used a simulated family as input for a similarity-based algorithm like ClustalW, and try to generate something like a phylogenetic tree based on that data, we get a tree that is completely different from the true one. Fig. 6 shows the output of ClustalW obtained on the same input set of the example in Fig. 5: it is not the expected tree. 0 1 str11 428 305 2 3 str05 str06 str08 str07 str02 4 PaTre passes the test on simulated data 5 Similarity based algorithms are not suitable to reconstruct the history of genes families. 6 str10 str09 7 The simulated paralogy tree for the Ribosomal Proteins family of M. pneumoniae 9 str03 8 str01 11 str04 10 Fig. 5 (see above) 8. Applications to real cases Fig. 4: the tree reconstructed by ClustalW on the same data of Fig. 5 Cost: 4430 - Distance: 0% 9. Open problems Catdzeta 82 We have applied PaTre to some real cases in which experimental evidence have given suggestions about the possible history of genes families. In particular, we have tested: 1) Bacterial duplications, in which PaTre has always identified a duplication process that linked two genes known as duplicated genes. 2) The Shaggy/GSK3 family in Arabidopsis thaliana, where the evidences of some duplication events [4] have been confirmed by the paralogy tree reconstructed by PaTre (Fig. 7a and 7b) 572 Catiota Catetha 674 Catepsilon 594 688 Catgamma 222 catalpha There are still several open problems concerning, in particular: 1. a detailed study of the robustness of PaTre; 2. a method to take into account Steiner points; 3. a design of an optimal distance to use in the all-against-all comparison Catteta 684 Further work is required, of course 726 Catbeta Catdelta 188 Catkappa The degree of reliability of PaTre is also supported by experimental evidence 10. Future developments Trees comparisons: the probability to choose a gene as matrix for a new duplication: does it depend on the gene “age” or not? (different answers lead to different trees…); use paralogy trees built for the same family in different organisms to extract phylogenetic information. If we have grants, we will do everything!. Fig. 7: a) the Shaggy/GSK3 family of Arabidopsis thaliana in a similarity tree: the clouds identify duplication events; b) the paralogy tree reconstructed by PaTre References 1) J.S. Varré, J.P. Delahaye, E. Rivals, The Transformation Distance: a dissimilarity measure based on movements of segments, German Conference on Bioinformatics, Köln, 1998. 2) J. Edmonds, Optimum branchings, J. Res. Nat. Bur. Standards, 71B, 223-240, 1967. 3) R.E. Tarjan, Finding optimum branchings, Network, 7, 25-35, 1977. 4) R. Tavares, Contribution a la caracterisation de la sous-famille des proteines serine/threonine kinases du type SHAGGY/GSK-3 chez Arabidopsis thaliana; University of Paris sud, 2000.