Group1_Practical6_report

advertisement
Course : Comparative Genomics (KB8007)
Assignment : Practical 6 : Orthology Prediction
Group : Group1
Student Name:


Ino
Jessada
Short summary of what you have done.
This assignment asked to find the orthologs of our predicted 3 genes from eukaryotes.
But the problem is that what we had is simply set of genes predicted by GenScan but the key
values in those 2 orthologs database are either Protein ID or Transcript ID.
So first of all, we need to look for the Protein ID of our genes. To do this, we performed
local BLAST search our genes against the reference source of Protein ID. In this case, we used
the reference source, H.sapiens.fa, from Inparanoid database,
http://inparanoid.sbc.su.se/download/current/sequences/processed/. We executed the formatDB
to create database and then executed blastall with blastp parameter to get the result. Then we
just simply picked the matched Protein IDs with the lowest e-value. Here are our Protein IDs
1. ENSP00000297267
2. ENSP00000356033
3. ENSP00000384136
Then according to the instruction, we checked if these 3 proteins are presented in both
in both Inparanoid and Treefam database. And these 3 protein are presented there.
After that, we picked, for each of the three genes, picked ten species in which both
InParanoid and TreeFam have found orthologs
Describe the TreeFam and InParanoid algorithms and their differences.
TreeFam algorithm
TreeFam is a database of phylogenetic trees of gene families found in animals. It uses
completely sequenced genomes to build the phylogenetic trees. In TreeFam, orthologs and
paralogs are inferred from the phylogenetic tree of a gene family. Tree-based approach is more
robust because evolutionary rates, and therefore pair-wise BLAST scores, can vary greatly
between members of the same gene family.
TreeFam algorithm can be divided into 2 parts, the first part consisting of
automatically generated trees, TreeFam-B, and the second part consisting of manually
curated trees ,TreeFam-A.
In the first part, it uses the PhIGs as seeds for building gene families tree then expand
the tree by looking for a sequence with similarity using HMM
In the second part, because of gene duplications and losses which can be found in
TreeFam-B, TreeFam uses the following tools/algorithms for tree curation


Duplication/Loss Inference algorithm – This algorithm is based on Zmasek and Eddy's
Speciation versus Duplication Inference (SDI) algorithm for inferring gene duplications in
a phylogenetic tree. In contrast to SDI, DLI also infers gene losses, and allows for
multifurcations in the species tree.
Tree curation tool (tctool); Lachlan Coin;
http://www.sanger.ac.uk/Software/analysis/tctool. - This program allows the curator to
visually adjust the gene tree topology and recalculate a score which reflects both how
well the topology explains the sequence alignment and (optionally) how closely the
topology agrees with the species tree.
Inparanoid algorithm
The InParanoid project gathers proteomes of completely sequenced eukaryotic species
plus Escherichia coli and calculates pairwise ortholog relationships among them.
Inparanoid algorithm is BLAST based method which forms the orthologous
groups using similarity search.
Inparanoid simply perform BLAST all-against-all comparison with criteria as following




The score of the alignment of conserved region must be high enough. To do this, BLAST
homology inference is only accepted if the region aligned by BLAST corresponds to a
large enough fractions of the lengths of the proteins.
For both the query and the match sequence, the distance between the first and the last
aligned residue must equal or exceed 50% of the length of the sequence.
For both the query and the match sequence, the sum of the lengths of the aligned
regions on that sequence must equal or exceed 25% of the length of the sequence.
Filter out the low complexity region.
The difference between two algorithms
It’s different mainly in the underlying method they used, Inparanoid using BLAST and
TreeFam using gene families (Phylogenetic tree) and HMM.
And their similarity is that they use completely sequenced eukaryotic genomes
Detailed discussion of the results achieved with the two methods and the differences
between their predictions.
Result
Matched Species
Protein ID
Transcript ID
Species
ENSP00000297267
ENST00000297267
Inparanoid Protein ID
Treefam Transcript ID
Gallus gallus
ENSGALP00000019042
ENSGALT00000019065
Completely Homologous
Gasterosteus
aculeatus
ENSGACP00000011876
ENSGACT00000011900
Completely Homologous
Macaca mulatta
ENSMMUP00000000681
ENSMMUT00000000734
Completely Homologous
Monodelphis
domestica
ENSMODP00000009358
ENSMODT00000009542
Completely Homologous
Mus musculus
ENSMUSP00000095036
ENSMUST00000097425
Completely Homologous
Ornithorhynchus
anatinus
ENSOANP00000023007
ENSOANT00000023011
Completely Homologous
Oryzias latipes
ENSORLP00000016640
ENSORLT00000016641
Completely Homologous
Pan troglodytes
ENSPTRP00000032009
ENSPTRT00000034628
Completely Homologous
Rattus norvegicus
ENSRNOP00000029878
ENSRNOT00000034842
Completely Homologous
ENSBTAP00000039682
ENSBTAT00000039897
ENSBTAP00000039680
ENSBTAT00000039895
ENSBTAP00000039684
ENSBTAT00000039899
ENSBTAP00000025952
ENSBTAT00000025952
ENSDARP00000063825
ENSDART00000063826
None
ENSDART00000025689
Gallus gallus
ENSGALP00000022298
ENSGALT00000022338
Completely Homologous
Gasterosteus
aculeatus
ENSGACP00000013206
ENSGACT00000013231
Completely Homologous
Macaca mulatta
ENSMMUP00000015729
ENSMMUT00000016796
Completely Homologous
Monodelphis
domestica
ENSMODP00000009421
ENSMODT00000009606
Completely Homologous
ENSMUSP00000047431
ENSMUST00000036370
ENSMUSP00000110228
ENSMUST00000114581
None
ENSMUST00000063683
Ornithorhynchus
anatinus
ENSOANP00000018935
ENSOANT00000018938
Completely Homologous
Oryzias latipes
ENSORLP00000015190
ENSORLT00000015191
Completely Homologous
Xenopus tropicalis
ENSXETP00000021546
ENSXETT00000021546
Completely Homologous
Caenorhabditis
briggsae
CBP36600
CBG12867
Completely Homologous
Aedes aegypti
AAEL007915-PA
AAEL007915-RA
Completely Homologous
Anopheles gambiae
AGAP000562-PA
AGAP000562-RA
Completely Homologous
Canis familiaris
ENSCAFP00000021092
ENSCAFT00000022709
Completely Homologous
Gallus gallus
ENSGALP00000021028
ENSGALT00000021058
Completely Homologous
ENSGACP00000019661
ENSGACT00000019699
Completely Homologous
ENSMODP00000018175
ENSMODT00000018507
Completely Homologous
ENSMUSP00000000590
ENSMUST00000000590
Completely Homologous
ENSORLP00000011511
ENSORLT00000011512
None
ENSORLT00000006905
ENSPTRP00000007326
ENSPTRT00000007939
Bos taurus
Completely Homologous
Bos taurus
Completely Homologous
Danio rerio
ENSP00000356033
ENST00000367066
Mus musculus
ENSP00000384136
ENST00000405097
Gasterosteus
aculeatus
Monodelphis
domestica
Mus musculus
Partialy Overlapped
Oryzias latipes
Pan troglodytes
Analyze
Partialy Overlapped
Partialy Overlapped
Completely Homologous
Analysis
According to our results there are 3 proteins that appear in orthologs of TreeFam but not
in InParanoid. Because InParanoid algorithm relies on BLAST as the underlying homology
detection tool, I run blastall on those three proteins against H.sapiens.fa to see if there are any
similarity and here are the results and analysis for each proteins

ENSDART00000025689
Its best match is ENSP00000356033, the one that it should be ortholog, with e-value
8.93154e-40. So the analysis is that the match score is not high enough to be considered as
ortholog for Inparanoid but TreeFam recognized it using HMM

ENSMUST00000063683
Its best match is ENSP00000356033, which should be its ortholog, with e-value
3.34284e-122.
In this case, e-value is very low. If we consider only the match score this pair should be
an ortholog but Inparanoid doesn’t recognize them as ortholog might be because of its criteria
such as filtering out short transcripts, the distance between the first and the last aligned residue
and the sum of the lengths of the aligned regions.

ENSORLT00000006905
Its best match is ENSP00000384136, which also should be its ortholog, with e-value 0.
Same as previous protein that InParanoid doesn’t recognize might be because of its
criteria
Reference


TreeFam: a curated database of phylogenetic trees of animal gene families
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1347480/
InParanoid 7: new algorithms and tools for eukaryotic orthology analysis
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2808972
Download