Course : Comparative Genomics (KB8007) Assignment : Practical 6 : Orthology Prediction Group : Group1 Student Name: Ino Jessada Short summary of what you have done. This assignment asked to find the orthologs of our predicted 3 genes from eukaryotes. But the problem is that what we had is simply set of genes predicted by GenScan but the key values in those 2 orthologs database are either Protein ID or Transcript ID. So first of all, we need to look for the Protein ID of our genes. To do this, we performed local BLAST search our genes against the reference source of Protein ID. In this case, we used the reference source, H.sapiens.fa, from Inparanoid database, http://inparanoid.sbc.su.se/download/current/sequences/processed/. We executed the formatDB to create database and then executed blastall with blastp parameter to get the result. Then we just simply picked the matched Protein IDs with the lowest e-value. Here are our Protein IDs 1. ENSP00000297267 2. ENSP00000356033 3. ENSP00000384136 Then according to the instruction, we checked if these 3 proteins are presented in both in both Inparanoid and Treefam database. And these 3 protein are presented there. After that, we picked, for each of the three genes, picked ten species in which both InParanoid and TreeFam have found orthologs Describe the TreeFam and InParanoid algorithms and their differences. TreeFam algorithm TreeFam is a database of phylogenetic trees of gene families found in animals. It uses completely sequenced genomes to build the phylogenetic trees. In TreeFam, orthologs and paralogs are inferred from the phylogenetic tree of a gene family. Tree-based approach is more robust because evolutionary rates, and therefore pair-wise BLAST scores, can vary greatly between members of the same gene family. TreeFam algorithm can be divided into 2 parts, the first part consisting of automatically generated trees, TreeFam-B, and the second part consisting of manually curated trees ,TreeFam-A. In the first part, it uses the PhIGs as seeds for building gene families tree then expand the tree by looking for a sequence with similarity using HMM In the second part, because of gene duplications and losses which can be found in TreeFam-B, TreeFam uses the following tools/algorithms for tree curation Duplication/Loss Inference algorithm – This algorithm is based on Zmasek and Eddy's Speciation versus Duplication Inference (SDI) algorithm for inferring gene duplications in a phylogenetic tree. In contrast to SDI, DLI also infers gene losses, and allows for multifurcations in the species tree. Tree curation tool (tctool); Lachlan Coin; http://www.sanger.ac.uk/Software/analysis/tctool. - This program allows the curator to visually adjust the gene tree topology and recalculate a score which reflects both how well the topology explains the sequence alignment and (optionally) how closely the topology agrees with the species tree. Inparanoid algorithm The InParanoid project gathers proteomes of completely sequenced eukaryotic species plus Escherichia coli and calculates pairwise ortholog relationships among them. Inparanoid algorithm is BLAST based method which forms the orthologous groups using similarity search. Inparanoid simply perform BLAST all-against-all comparison with criteria as following The score of the alignment of conserved region must be high enough. To do this, BLAST homology inference is only accepted if the region aligned by BLAST corresponds to a large enough fractions of the lengths of the proteins. For both the query and the match sequence, the distance between the first and the last aligned residue must equal or exceed 50% of the length of the sequence. For both the query and the match sequence, the sum of the lengths of the aligned regions on that sequence must equal or exceed 25% of the length of the sequence. Filter out the low complexity region. The difference between two algorithms It’s different mainly in the underlying method they used, Inparanoid using BLAST and TreeFam using gene families (Phylogenetic tree) and HMM. And their similarity is that they use completely sequenced eukaryotic genomes Detailed discussion of the results achieved with the two methods and the differences between their predictions. Result Matched Species Protein ID Transcript ID Species ENSP00000297267 ENST00000297267 Inparanoid Protein ID Treefam Transcript ID Gallus gallus ENSGALP00000019042 ENSGALT00000019065 Completely Homologous Gasterosteus aculeatus ENSGACP00000011876 ENSGACT00000011900 Completely Homologous Macaca mulatta ENSMMUP00000000681 ENSMMUT00000000734 Completely Homologous Monodelphis domestica ENSMODP00000009358 ENSMODT00000009542 Completely Homologous Mus musculus ENSMUSP00000095036 ENSMUST00000097425 Completely Homologous Ornithorhynchus anatinus ENSOANP00000023007 ENSOANT00000023011 Completely Homologous Oryzias latipes ENSORLP00000016640 ENSORLT00000016641 Completely Homologous Pan troglodytes ENSPTRP00000032009 ENSPTRT00000034628 Completely Homologous Rattus norvegicus ENSRNOP00000029878 ENSRNOT00000034842 Completely Homologous ENSBTAP00000039682 ENSBTAT00000039897 ENSBTAP00000039680 ENSBTAT00000039895 ENSBTAP00000039684 ENSBTAT00000039899 ENSBTAP00000025952 ENSBTAT00000025952 ENSDARP00000063825 ENSDART00000063826 None ENSDART00000025689 Gallus gallus ENSGALP00000022298 ENSGALT00000022338 Completely Homologous Gasterosteus aculeatus ENSGACP00000013206 ENSGACT00000013231 Completely Homologous Macaca mulatta ENSMMUP00000015729 ENSMMUT00000016796 Completely Homologous Monodelphis domestica ENSMODP00000009421 ENSMODT00000009606 Completely Homologous ENSMUSP00000047431 ENSMUST00000036370 ENSMUSP00000110228 ENSMUST00000114581 None ENSMUST00000063683 Ornithorhynchus anatinus ENSOANP00000018935 ENSOANT00000018938 Completely Homologous Oryzias latipes ENSORLP00000015190 ENSORLT00000015191 Completely Homologous Xenopus tropicalis ENSXETP00000021546 ENSXETT00000021546 Completely Homologous Caenorhabditis briggsae CBP36600 CBG12867 Completely Homologous Aedes aegypti AAEL007915-PA AAEL007915-RA Completely Homologous Anopheles gambiae AGAP000562-PA AGAP000562-RA Completely Homologous Canis familiaris ENSCAFP00000021092 ENSCAFT00000022709 Completely Homologous Gallus gallus ENSGALP00000021028 ENSGALT00000021058 Completely Homologous ENSGACP00000019661 ENSGACT00000019699 Completely Homologous ENSMODP00000018175 ENSMODT00000018507 Completely Homologous ENSMUSP00000000590 ENSMUST00000000590 Completely Homologous ENSORLP00000011511 ENSORLT00000011512 None ENSORLT00000006905 ENSPTRP00000007326 ENSPTRT00000007939 Bos taurus Completely Homologous Bos taurus Completely Homologous Danio rerio ENSP00000356033 ENST00000367066 Mus musculus ENSP00000384136 ENST00000405097 Gasterosteus aculeatus Monodelphis domestica Mus musculus Partialy Overlapped Oryzias latipes Pan troglodytes Analyze Partialy Overlapped Partialy Overlapped Completely Homologous Analysis According to our results there are 3 proteins that appear in orthologs of TreeFam but not in InParanoid. Because InParanoid algorithm relies on BLAST as the underlying homology detection tool, I run blastall on those three proteins against H.sapiens.fa to see if there are any similarity and here are the results and analysis for each proteins ENSDART00000025689 Its best match is ENSP00000356033, the one that it should be ortholog, with e-value 8.93154e-40. So the analysis is that the match score is not high enough to be considered as ortholog for Inparanoid but TreeFam recognized it using HMM ENSMUST00000063683 Its best match is ENSP00000356033, which should be its ortholog, with e-value 3.34284e-122. In this case, e-value is very low. If we consider only the match score this pair should be an ortholog but Inparanoid doesn’t recognize them as ortholog might be because of its criteria such as filtering out short transcripts, the distance between the first and the last aligned residue and the sum of the lengths of the aligned regions. ENSORLT00000006905 Its best match is ENSP00000384136, which also should be its ortholog, with e-value 0. Same as previous protein that InParanoid doesn’t recognize might be because of its criteria Reference TreeFam: a curated database of phylogenetic trees of animal gene families http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347480/ InParanoid 7: new algorithms and tools for eukaryotic orthology analysis http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808972