homologous genomics

advertisement
ISMB 2004 tutorial: Orthology Analysis
Prof. Erik Sonnhammer
Center for Genomics and Bioinformatics
Karolinska Institutet
S-17177 Stockholm
SWEDEN
Email: Erik.Sonnhammer@cgb.ki.se
Abstract
The increasing availability of complete genome sequences is revolutionizing the
study of molecular evolution and comparative genomics. By comparing the gene
contents of different genomes, we gain direct insight into which protein functions are
shared and which have evolved in independent ways. This is important for
understanding how novel protein functions arise, how phenotype is encoded in
genes, and to predict which genes have the same function in different organisms.
When comparing genomes, the terms orthologs and paralogs are commonly used.
Unfortunately there is widespread confusion about their exact meanings. Many
researchers believe that orthologs are simply genes (proteins) with the same function
in different organisms, whereas paralogs are simply homologs within one organism.
This does however not agree with the original definitions of orthology and paralogy
given by Fitch in 1970. This tutorial will explain the original, evolutionary definition of
these and related terms, and survey the recent developments in the field. The theory
of a number of approaches that can be used to detect orthologs and paralogs from
gene sequences will be described. Some of these approaches will be studied in
more detail, particularly those that have been used to generate databases of
orthologs. Applications will be given of concrete examples, both from direct analysis
and from mining available ortholog databases.
Audience
Bioinformaticians, sequence analysts, computational biologists, and other
researchers in the fields of genome annotation and comparative genomics. Basic
knowledge of biology, homology analysis, phylogeny, and molecular evolution is a
good background.
Length: half day (4 hours)
Detailed outline of the presentation:
1.Concepts
1.1.Homologous genes – what does it mean?
1.2.Fitch's definitions of Orthology, Paralogy, Xenology etc. in 1970.
1.3.The need for complete genome sequences in orthology analysis
1.4.Overview of available genomes; Model organisms.
1.5.What Fitch forgot: co-orthologs and different types of paralogs.
1.6.Evolution through duplication vs. gene loss. Difference between eukaryotic
and prokaryotic rates of duplication.
1.7.The relation between orthology and function conservation.
1.8.Duplication and function divergence
2.Methods for ortholog identification
2.1.Tree-based approaches: basic concepts, full or partial tree reconciliation
2.2.Blast-based approaches: basic concepts, simple pairwise or multiple genes
2.3.Including more than two species
2.4.COGs
2.5.OrthoMCL
2.6.DARWIN
2.7.Inparanoid
2.8.RIO
2.9.Orthostrapper
3.Databases of orthologs
3.1.COGs/KOGs
3.2.OrthoMCL
3.3.TOGA/EGO
3.4.Inparanoid
3.5.HOPS
3.6.Orthologs of disease genes: EGO or Orthodisease(based on Inparanoid)
4.Application
4.1.Example: finding all human-fly orthologs
4.2.Example: Use COGs or RIO to detect orthologs among bacteria.
4.3.Example: Use RIO, TOGA or Orthostrapper to find orthologs among
eukaryotes.
4.4.Summary of different approaches: strength and weaknesses of the
phylogenetic vs. the blast-based approaches
4.5.How to find non-orthologs (lineage-specific genes)?
4.6.Lineage-specific expansions (clusters of co-orthologs).
5.Discussion
5.1.Limitations of current approaches
5.2.Expected impact of future genome sequencing on ortholog analysis
5.3.Conclusions
References
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S,
Howe KL, Marshall M, Sonnhammer EL.
"The Pfam protein families database"
Nucleic Acids Res. 30:276-280 (2002)
Fitch, W.M. (1970) Distinguishing homologous from analogous proteins. Syst Zool
19: 99-113
Fitch, W.M. (2000) Homology a personal view on some of the problems. Trends.
Genet. 16, 227-231
Hollich V, Storm CEV and Sonnhammer ELL
"OrthoGUI: graphical presentation of Orthostrapper results"
Bioinformatics, 18:1272-1273 (2002)
Hirsh and Fraser 2001. Nature 411: 1046-1049
Jensen RA.
Orthologs and paralogs - we need to get it right.
Genome Biol. 2001, 2:INTERACTIONS1002
Jordan, I.K. et al. (2001)
Lineage-specific gene expansions in bacterial and archaeal genomes.
Genome Res. 11, 555-565
Jordan et al.,
Bacteria: clear difference in rates between essential and nonessential.
Genome Res. 2002 12:962-968:
Kondrashov et al., Genome Biol 2002 3:RESEARCH0008
Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F,
Antonescu V, White J, Holt I, Liang F, Quackenbush J.
Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments
(TOGA).
Genome Res. 2002, 12:493-502.
Li L, Stoeckert CJ Jr, Roos DS.
OrthoMCL: identification of ortholog groups for eukaryotic genomes.
Genome Res. 2003 13:2178-2189.
Pal et al., Nature 421:496:
Remm M, Storm CEV, and Sonnhammer ELL
"Automatic clustering of orthologs and in-paralogs from pairwise species
comparisons"
J. Mol. Biol. 314:1041-1052 (2001)
Sonnhammer ELL and Koonin EV
"Orthology, paralogy and proposed classification for paralog subtypes"
TIG 18:619-620 (2002)
Storm C and Sonnhammer ELL
"Automated ortholog inference from phylogenetic trees and calculation of orthology
reliability"
Bioinformatics 18:92-99 (2002)
Storm C and Erik L.L. Sonnhammer
"NIFAS:Visual analysis of domain structure evolution"
Bioinformatics 17:343-348 (2001)
Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS,
Kiryutin B, Galperin MY, Fedorova ND, Koonin EV.
The COG database: new developments in phylogenetic classification of proteins from
complete genomes.
Nucleic Acids Res. 2001, 29:22-8.
Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov
DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV,
Vasudevan S, Wolf YI, Yin JJ, Natale DA.
The COG database: an updated version includes eukaryotes.
BMC Bioinformatics. 2003 Sep 11;4(1):41.
Zmasek CM, Eddy SR.
RIO: Analyzing proteomes by automated phylogenomics using resampled inference
of orthologs.
BMC Bioinformatics. 2002, 3:14
Zmasek CM, Eddy SR.
A simple algorithm to infer gene duplication and speciation events on a gene tree.
Bioinformatics. 2001, 17:821-8.
About the presenter
Professor Erik Sonnhammer has been at the Karolinska Institutet, Stockholm
Sweden, since 1998, heading the bioinformatics unit at the Center for Genomics and
Bioinformatics. Previously he has been at the National Center for Biotechnology
Information (NCBI) in Bethesda, MD, USA, and at the Sanger Centre in Cambridge
UK, where he developed tools for genomic sequence analysis, including the Pfam
database of protein domain families. He has a number of publications on orthology
analysis and protein family and protein domain analysis. Professor Sonnhammer has
taught bioinformatics courses for many years, mainly in the field of comparative
genomics and protein domain family analysis.
Download