ISMB 2004 tutorial: Orthology Analysis Prof. Erik Sonnhammer Center for Genomics and Bioinformatics Karolinska Institutet S-17177 Stockholm SWEDEN Email: Erik.Sonnhammer@cgb.ki.se Abstract The increasing availability of complete genome sequences is revolutionizing the study of molecular evolution and comparative genomics. By comparing the gene contents of different genomes, we gain direct insight into which protein functions are shared and which have evolved in independent ways. This is important for understanding how novel protein functions arise, how phenotype is encoded in genes, and to predict which genes have the same function in different organisms. When comparing genomes, the terms orthologs and paralogs are commonly used. Unfortunately there is widespread confusion about their exact meanings. Many researchers believe that orthologs are simply genes (proteins) with the same function in different organisms, whereas paralogs are simply homologs within one organism. This does however not agree with the original definitions of orthology and paralogy given by Fitch in 1970. This tutorial will explain the original, evolutionary definition of these and related terms, and survey the recent developments in the field. The theory of a number of approaches that can be used to detect orthologs and paralogs from gene sequences will be described. Some of these approaches will be studied in more detail, particularly those that have been used to generate databases of orthologs. Applications will be given of concrete examples, both from direct analysis and from mining available ortholog databases. Audience Bioinformaticians, sequence analysts, computational biologists, and other researchers in the fields of genome annotation and comparative genomics. Basic knowledge of biology, homology analysis, phylogeny, and molecular evolution is a good background. Length: half day (4 hours) Detailed outline of the presentation: 1.Concepts 1.1.Homologous genes – what does it mean? 1.2.Fitch's definitions of Orthology, Paralogy, Xenology etc. in 1970. 1.3.The need for complete genome sequences in orthology analysis 1.4.Overview of available genomes; Model organisms. 1.5.What Fitch forgot: co-orthologs and different types of paralogs. 1.6.Evolution through duplication vs. gene loss. Difference between eukaryotic and prokaryotic rates of duplication. 1.7.The relation between orthology and function conservation. 1.8.Duplication and function divergence 2.Methods for ortholog identification 2.1.Tree-based approaches: basic concepts, full or partial tree reconciliation 2.2.Blast-based approaches: basic concepts, simple pairwise or multiple genes 2.3.Including more than two species 2.4.COGs 2.5.OrthoMCL 2.6.DARWIN 2.7.Inparanoid 2.8.RIO 2.9.Orthostrapper 3.Databases of orthologs 3.1.COGs/KOGs 3.2.OrthoMCL 3.3.TOGA/EGO 3.4.Inparanoid 3.5.HOPS 3.6.Orthologs of disease genes: EGO or Orthodisease(based on Inparanoid) 4.Application 4.1.Example: finding all human-fly orthologs 4.2.Example: Use COGs or RIO to detect orthologs among bacteria. 4.3.Example: Use RIO, TOGA or Orthostrapper to find orthologs among eukaryotes. 4.4.Summary of different approaches: strength and weaknesses of the phylogenetic vs. the blast-based approaches 4.5.How to find non-orthologs (lineage-specific genes)? 4.6.Lineage-specific expansions (clusters of co-orthologs). 5.Discussion 5.1.Limitations of current approaches 5.2.Expected impact of future genome sequencing on ortholog analysis 5.3.Conclusions References Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL. "The Pfam protein families database" Nucleic Acids Res. 30:276-280 (2002) Fitch, W.M. (1970) Distinguishing homologous from analogous proteins. Syst Zool 19: 99-113 Fitch, W.M. (2000) Homology a personal view on some of the problems. Trends. Genet. 16, 227-231 Hollich V, Storm CEV and Sonnhammer ELL "OrthoGUI: graphical presentation of Orthostrapper results" Bioinformatics, 18:1272-1273 (2002) Hirsh and Fraser 2001. Nature 411: 1046-1049 Jensen RA. Orthologs and paralogs - we need to get it right. Genome Biol. 2001, 2:INTERACTIONS1002 Jordan, I.K. et al. (2001) Lineage-specific gene expansions in bacterial and archaeal genomes. Genome Res. 11, 555-565 Jordan et al., Bacteria: clear difference in rates between essential and nonessential. Genome Res. 2002 12:962-968: Kondrashov et al., Genome Biol 2002 3:RESEARCH0008 Lee Y, Sultana R, Pertea G, Cho J, Karamycheva S, Tsai J, Parvizi B, Cheung F, Antonescu V, White J, Holt I, Liang F, Quackenbush J. Cross-referencing eukaryotic genomes: TIGR Orthologous Gene Alignments (TOGA). Genome Res. 2002, 12:493-502. Li L, Stoeckert CJ Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003 13:2178-2189. Pal et al., Nature 421:496: Remm M, Storm CEV, and Sonnhammer ELL "Automatic clustering of orthologs and in-paralogs from pairwise species comparisons" J. Mol. Biol. 314:1041-1052 (2001) Sonnhammer ELL and Koonin EV "Orthology, paralogy and proposed classification for paralog subtypes" TIG 18:619-620 (2002) Storm C and Sonnhammer ELL "Automated ortholog inference from phylogenetic trees and calculation of orthology reliability" Bioinformatics 18:92-99 (2002) Storm C and Erik L.L. Sonnhammer "NIFAS:Visual analysis of domain structure evolution" Bioinformatics 17:343-348 (2001) Tatusov RL, Natale DA, Garkavtsev IV, Tatusova TA, Shankavaram UT, Rao BS, Kiryutin B, Galperin MY, Fedorova ND, Koonin EV. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 2001, 29:22-8. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003 Sep 11;4(1):41. Zmasek CM, Eddy SR. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics. 2002, 3:14 Zmasek CM, Eddy SR. A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics. 2001, 17:821-8. About the presenter Professor Erik Sonnhammer has been at the Karolinska Institutet, Stockholm Sweden, since 1998, heading the bioinformatics unit at the Center for Genomics and Bioinformatics. Previously he has been at the National Center for Biotechnology Information (NCBI) in Bethesda, MD, USA, and at the Sanger Centre in Cambridge UK, where he developed tools for genomic sequence analysis, including the Pfam database of protein domain families. He has a number of publications on orthology analysis and protein family and protein domain analysis. Professor Sonnhammer has taught bioinformatics courses for many years, mainly in the field of comparative genomics and protein domain family analysis.