Presentazione di PowerPoint

advertisement
Trying to reconstruct the history of genes families
Roberto Marangoni*^, Nadia Pisanti*, Paolo Ferragina*^, Antonio Frangioni*, Fabrizio Luccio*^
*Dept. of Informatics, University of Pisa, Italy
^C.I.S.S.C. (Interdisciplinary Center for Complex Systems Study), University of Pisa, Italy.
E-mail: marangon@di.unipi.it
1. Evolution, Information and Complexity
2. Duplications, genes families and paralogs
These three concepts are hard to define, when referred to a biological organism. We can give only “working”
definitions like:
1) Evolution, recalling the classic Darwinian definition, it is “descent with modifications” (i.e., sons are not equal to
fathers; but up to now, there is no generally accepted definition of biological evolution).
2) Information; when we refer it to a biological organism, we can define it as the information stored in the genome,
even if it is not completely true, since the development of an organism is specified not only by the DNA, but also
by the concurrence of mother-RNA, proteins and other factors.
3) Complexity: a tentative definition can be found following something like an “algorithmic” approach. One can
ask: how many words are enough to describe a bacterium? And, how many to describe a human? Of course, in
the last case, many more words are needed. One can say, in this case, that a human is more complex than a
bacterium.
There are two kinds of mechanisms described in literature, able to create new information in genomes:
1) Exogenous mechanisms, like horizontal transfers and transfections, the final result of which is the insertion into a
genome of a DNA segment coming from another specie. Even if this kind of process is important, its quantitative
contribution to a genome is not so relevant.
2) Endogenous mechanisms, mostly represented by duplications. They have been described whole-genome-, largesegment-, tandem- and single gene-duplications. Duplications make genomes clusterizable into genes families.
Usually, members of genes families are sharing a high homology in their sequence, and, when they are
functionally active, they perform very similar biological functions: they are called paralog genes or simply
paralogs. Endogenous mechanisms represent quantitatively the most important process that leads to an increase of
the genomic information.
Biological evolution has generated more and more complex organisms; but to a high complexity corresponds a
high information content; and therefore the general problem for the biological evolution moves to:
In order to better understand the mechanism by which genomes have increased their size and multiplied their
functional capabilities, it is necessary to study the behavior of duplication events; the first step is to investigate
the history of a genic family: how many duplication events have occurred, in which order, etc.
How does information increase during evolution?
How to reconstruct the history of genic families?
3. Building a paralogy tree
4. PaTre
To reconstruct the history of genes families, under the hypothesis that every family member derives from a
duplication process of another member, means to put the set of members into a tree, that we call paralogy tree, in
which the root represents the most ancient gene of the family, and each directed arc represents the relationship
matrix-copy in a duplication process.
This is not a phylogenesis study and this is not a phylogenetic tree!!!
Differently from philogenetic studies, in which one measure the similarity between two or more sequences, in
order to infer which could had been the possible common ancestor, in this kind of study we need to use an
asymmetric function to compare two sequences, which is able to express which sequence could have been the
matrix and which copy, in an hypothetical duplication process.
This kind of function has to address two basic biological requirements, which derive from the presently known
duplications:
1) Copies are usually shorter than matrixes, since the event of a segment insertion after a duplication is a rare event.
2) To insert segments has metabolic costs, while to delete segments has no cost.
The method we present, called PaTre, is made up of the following steps:
1) Input: all the paralog sequences of a family;
2) Computation of the TD values for each possible couple of paralogs in the
input set and construction of the directed graph (see fig. 1) that expresses, for
each couple, the probability of the relationship matrix-copy/copy-matrix.
3) Extraction of the Lightest Spanning Arborescence (LSA) by means of
Edomond’s algorithm [2,3].
We assume the extracted LSA as the paralogy tree (fig. 2)
Fig. 1: example of a directed graph
We ask PaTre to give an output not only of the optimal solution, but also the
sub-optimal ones, which are useful in the following.
We have developed a method for paralogy tree construction, based on Transformation Distance (TD) [1] as the
basic function to compare two sequences, since it addresses the biological requirements stated above.
How to obtain the paralogy tree?
5. Testing PaTre
str02
str05
AF087266
Unfortunately, there are NO documented histories of genes families in
nature, so that we used a simulation procedure to test PaTre.
We have therefore developed a simulator, that receives in input a gene,
and generates (by iterating the duplication-with-modification
mechanism) a family of simulated paralogs, the history of which is, of
course, known (fig. 3).
AF087337
AF087395
str10
CTR249638
CTR249639
str04
str06
5b. Testing the simulator
AF087343
AF087406
str07
To test how the simulated data are similar with respect to real ones, we
run the simulation on different gene families, starting from different
sequences, and then we use a standard clustering algorithm, giving an
input of both simulated and real sequences. The results show that, for
several of the most diffused genes families, the generated clusters
contain both simulated and real sequences, thus demonstrating a good
degree of “mimetic” capability of the simulator. As in the example
shown in fig. 4, where the simulated sequences have the name “str##”
and the real sequences are named “AF…”
AF087357
str01
AF087196
str03
AF087326
str08
CTORFBRSA
str09
Fig. 2: example of PaTre output
Fig. 4: similarity tree computed on simulated
and real sequences
Fig. 3: example of simulator output
Cost: 7840 - Distance: 0%
MFINFRP
1044
str01
1187
1035
str02
str03
757
955
str04
str05
704
str06
505
526
str07
394
str08
6. Applications on simulated data
Applying PaTre on simulated families, we always get the
corrected tree. Fig. 5 compares the simulated tree with the one
reconstructed by PaTre: they are completely overlapping each
other.
We have tested PaTre on more than 60 families in 20 different
organisms.
str09
str11
7. Using similarity-based algorithms
str10
output from PaTre for the
simulated Ribosomal Protein of
M. pneumoniae
MFINFRP
If we used a simulated family as input for a similarity-based
algorithm like ClustalW, and try to generate something like a
phylogenetic tree based on that data, we get a tree that is
completely different from the true one.
Fig. 6 shows the output of ClustalW obtained on the same input set
of the example in Fig. 5: it is not the expected tree.
0
1
str11
428
305
2
3
str05
str06
str08
str07
str02
4
PaTre passes the test on simulated data
5
Similarity based algorithms are not suitable to reconstruct
the history of genes families.
6
str10
str09
7
The simulated paralogy tree for the
Ribosomal Proteins family of
M. pneumoniae
9
str03
8
str01
11
str04
10
Fig. 5 (see above)
8. Applications to real cases
Fig. 4: the tree reconstructed by ClustalW on the same
data of Fig. 5
Cost: 4430 - Distance: 0%
9. Open problems
Catdzeta
82
We have applied PaTre to some real cases in which experimental
evidence have given suggestions about the possible history of
genes families. In particular, we have tested:
1) Bacterial duplications, in which PaTre has always identified a
duplication process that linked two genes known as duplicated
genes.
2) The Shaggy/GSK3 family in Arabidopsis thaliana, where the
evidences of some duplication events [4] have been confirmed
by the paralogy tree reconstructed by PaTre (Fig. 7a and 7b)
572
Catiota
Catetha
674
Catepsilon
594
688
Catgamma
222
catalpha
There are still several open problems concerning, in particular:
1. a detailed study of the robustness of PaTre;
2. a method to take into account Steiner points;
3. a design of an optimal distance to use in the all-against-all
comparison
Catteta
684
Further work is required, of course
726
Catbeta
Catdelta
188
Catkappa
The degree of reliability of PaTre is also supported by
experimental evidence
10. Future developments
Trees comparisons:
the probability to choose a gene as matrix for a new duplication:
does it depend on the gene “age” or not? (different answers lead to
different trees…);
use paralogy trees built for the same family in different organisms
to extract phylogenetic information.
If we have grants, we will do everything!.
Fig. 7: a) the Shaggy/GSK3 family of Arabidopsis thaliana in a similarity tree: the clouds
identify duplication events; b) the paralogy tree reconstructed by PaTre
References
1) J.S. Varré, J.P. Delahaye, E. Rivals, The Transformation Distance: a dissimilarity measure based on movements of segments, German Conference on Bioinformatics, Köln,
1998.
2) J. Edmonds, Optimum branchings, J. Res. Nat. Bur. Standards, 71B, 223-240, 1967.
3) R.E. Tarjan, Finding optimum branchings, Network, 7, 25-35, 1977.
4) R. Tavares, Contribution a la caracterisation de la sous-famille des proteines serine/threonine kinases du type SHAGGY/GSK-3 chez Arabidopsis thaliana; University of Paris
sud, 2000.
Download