Printer Friendly Document

advertisement
COMPARATIVE GENOMICS WORKSHOP - Day 3
PHYLOGENETIC TREES FOR GENOME ANNOTATION, DATA ANALYSIS AND
PRESENTATION
1. Using precomputed phylogenetic trees in SEED for annotation and
functional predictions
The OVERALL homology scores (expressed as % identity or % similarity) are no longer an
adequate basis for functional projections. To assist SEED curators and users in making annotation
decisions phylogenetic trees have been precomputed for the majority of PEGs in the SEED
database. Now a SEED user can view his/her favorite protein in the context of a protein family it is
a member of, see in significant detail it’s phylogenetic relationship to other members of the family,
follow sub-branching within the tree – which often leads to conclusions about biological functions
that would not be apparent otherwise. With over 8,000 prokaryotic genomes in the database (and
growing) building a comprehensive and accurate tree for any microbial protein family is a
computationally intensive and lengthy process. Building them on demand every time would have
been impractical and very inconvenient for a user. Elsewhere the SEED does allow building much
smaller local trees on the fly (as you saw earlier). But these GLOBAL trees are different - in scope
and functionalities they provide. The precomputed global trees in SEED are rich annotation tools
with many added features and functions. To familiarize ourselves with them, let us follow one
example together with the instructor on the screen using the guidelines below. And then you can
explore any protein family of your choice
Let’s look at a phylogenetic tree for the Dihydroneopterin aldolase (EC 4.1.2.25) protein family,
starting with the E. coli protein: fig|83333.1.peg.3006
*
Open
the
PEG
page
for
fig|83333.1.peg.3006:
http://theseed.uchicago.edu/FIG/seedviewer.cgi?page=Annotation&feature=fig|83333.1.peg.3006&
user=SvetaG
* Locate ‘alignment and trees’ feature within a gray box area.
available for this E. coli gene
Four of them are currently
* Follow one of the links: ‘alignments’ or ‘trees’. Both point to the same display page where one
can ‘Select Alignment and/or Tree’ and customize the view:
- Note the ‘Alignment and trees Help’ button in the top left corner – feel free to come here
anytime you feel lost
- Let us leave all settings at default for starts, changing only the coloring scheme – from ‘by
residue’ to ‘by consensus’
- Out of the 4 trees listed for this PEG, we’ll select the top one, Tree #00002005 (activating
radio button by it) and click ‘update’
- Look at alignment, discuss how these specific trees are constructed
starts and ends, determine the area included in alignment:
locate the query sequence fig|83333.1.peg.3006 in the alignment (use the
browser’s ‘find on page’ feature)
open the corresponding PEG page  compare to the full length sequence
color coding legend is available under the alignment, just above the tree
- Look at the tree, discuss
Two coloring systems, highlighting either (i) annotation consistency or (ii) the
area of a protein included in alignment
Domain structure, protein fusion are immediately obvious:
RED:
N-terminus aligned -
PURPLE: C-terminus aligned fig|349741.3.peg.1549, fig|481448.4.peg.1879)
GREEN:
fig|195103.9.peg.1230
fig|391165.8.peg.767
middle aligned -
(or
fig|218496.1.peg.162
any color in between is also possible
Color intensity is a warning against frivolous functional projections: the shorter
the region involved in alignment, the more saturated the color.
- compare fig|367928.5.peg.432 and fig|446465.4.peg.2235
- Annotation tools: radio buttons (square and circle), group selection at branching points,
command buttons, ‘Update’ versus ‘Assign’ buttons
- Labels: pluses, crosses, bold type, vertical bars (change focus PEG to fig|869305.3.peg.980
to see automatic function projection scoring)
- Other menu options
* Copy and paste ‘fig|83333.1.peg.3006’ into the ‘Focus protein ID’ window. Press ‘All
alignments and trees with this protein’ button. Let us explore other trees associated with it:
- Activate radio buttons for different Tree IDs 00002656, 00004091, or 00005198 and click
‘Update’
- Note statistics for the number of various annotations associated with each – these tree are
almost redundant
* Copy and paste a fusion protein ‘fig|391165.8.peg.767’ into the ‘Focus protein ID’ window,
explore the trees listed for it:
- Note that in addition to the 4 trees associated with Dihydroneopterin aldolase
(fig|83333.1.peg.3006 ), an additional tree is available, which is dominated by the duf556 family .
It all other genomes in the SEED database where it occurs
Exploring this small tree can provide functional leads for this hypothetical family:
First, it pointed us the fact that duf556 is occasionally fused with one of folate biosynthetic
enzymes (Dihydroneopterin aldolase) and hence, might be functionally associated with the same
pathway
Second, it shows the occurrence profile of the duf556 family at a glance: it is dominated heavily by
Methanogenic bacteria and methylotrophs. Notably, it occurs in Eubacteria and Archaea also.
Third, zooming into representatives of the duf556 family in different organisms can give an idea of
their chromosomal neighbors, which in turn might provide further functional clues
* Let us use this tree (since it is conveniently small) to explore the menu options that we didn’t
address:
- Color the alignment ‘by residue’ or ‘by consensus’
- Sapling versus SEED
Global phylogenetic trees pre-computed in SEED are designed for a very specific purpose:
accurate PROJECTIION of ANNOTATIONS based on phylogeny, which has clear advantages over
homology-based projection. In addition they can provide leads for functional predictions for
hypothetical protein families
2. Creating phylogenetic distribution figures by combining SEED subsystems
and iToL
* Creating the files to import into iTol from a SEED subsystem
Go
to
:
FolateExerciseMaizeClassVDC
subsystem
in
Public
Seed
(http://pubseed.theseed.org/SubsysEditor.cgi?page=SubsystemOverview)
Go the spreadsheet view, then go to Limit display and choose in the user set
vcrecy_MaizeworkshopItol and hit limit display.
Click on the Export data to Excel
Save your excel file and give it a name
Open your file in Excel, you may need to select “All files”, and open as tab delimited file. To do this,
follow the prompts in Excel: First select “DELIMITED” file, and then click “NEXT”. Only check the
box next to “TAB”, and then click finish.
You will need to manipulate it to create your input data. The first column has a genome ID in this
format Actinomyces coleocanis DSM 15436 (525245.3). You need to extract 525245 automatically
- this is the NCBI taxonomic number that iTol will recognize. To do that we will take advantage of
automatic macros in Excel.
- 1) Delete the Domain, Taxonomy, variant, pattern and # clustered columns keeping the
columns empty.
- 2) Select the organism column, go to the data tab and hit the text to column Tab.
- 3) Hit the next after delimited has been selected.
- 4) Then choose other and put in a ‘(‘ ; uncheck the “TAB” box and click finish.
- 5) Check for bugs: usually Strains names that have one of the characters you split pose
problems. Fix it (or, just delete that strain for this exercise)
- 6) Go the column you created repeating the operation with a ‘.’ and click finish.
- 7) For the three columns FolK, FolE1 and FolE2 put a 0 in columns that are empty, put a
1 in columns that are not empty (use the data/Sort function of Excel to make easy)
- 8) Delete all columns except the genome name, the genome NCBI TaxID, the FolK,
FolE1 and FolE2 columns, and save the file with another name
- 9) From this file you will create four different files saved in Tab delimited format
 The first with only the Taxonomic ID column,
 The second with the TaxID and the FolK columns, the third with TaxID and
FolE1 columns and the fourth with TaxID and FolE2 columns
 Do not forget to eliminate the header row in all files before saving.

Use the “SAVE AS” function. Go to File/Save-As and change the format to
(.TXT). Excel will ask if you really want to use this file type, Click
“Continue”.
You now have your files to import into iTol
*Creating a phylogenetic distribution Figure in iTol (itol.embl.de)
Go to iTol and set up a user account, this allows for more options and saving of trees.
TO ADD YOUR TAXONOMIC TREE:
Use the 'Other Trees" tab and scroll all the way to the bottom of the page. Here you can paste
into the box or use the upload tool.
ALL FILES MUST FOLLOW THE iToL FORMAT.
1) a text (.txt) file with one column. This column MUST use genus species names used by
NCBI (format genus_species_subname). Or, use the NCBI TaxID number. NOTE: It is
much easier to just use TaxID. These numbers can be obtained from SEED (first part of the
number behing the name but before the period)
You can alter the names latter in 2 different ways.
2) After selecting file, click 'Generate Tree'. This will generate 4 versions of your tree. In
general use the TaxID/Expanded tree. You can always add names or collapse later.
3) Click 'Show Tree'. Click the 'Select tree for clipboard copy' to highlight all. Copy to your
clipboard "CTRL C" or “CMMD C” for Mac
4) If you set up an account, you will have a tab labeled with "your name" Trees and Data, click
it.
5) Go to the project you are working on and click 'Add a tree to this project'
6) Paste your tree into the first box and name your tree. If you had saved a Newick format of
your tree previously, you can added it now.
7) Now is your first chance to add data. You can also add and manipulate your data later. Data
must be in text files and in column.
DO NOT USE HEADERS in your file.
Example using TaxID and Presence or Absence of 1 gene.
272557 1
399549 1
368408 1
436308 0
** You can change how the data will be displayed, for this use "BINARY"
Click upload and the tree and data field will be built.
8) From here, you have many options (your tree has been saved to your workspace). You can
view your tree, define range (ie, color domains of life), and you can automatically assign taxon
(this is why I like using the TaxID as input).
9) To assign taxonomy: click 'Automatically assign taxonomy' This may take a few minutes if
you have a large file.
10) If you wish, you can define color ranges (color the tree) by clicking 'define color ranges'.
Ranges colors the leaves, whereas Clades will color the tree. Be sure to save.
VIEW YOUR TREE!
EXPORTING YOUR TREE:
On the tree view page, click 'Export'
Use the EPS format (small file size that can be rasterized by Photoshop and can be opened by
Preview on Mac)
Several options to choose from which are self-explanatory. If you added data, be sure that the
data box is checked.
Click 'Export Tree' This takes to you another page.
Download and Save.
RELABELING LEAVES
You may want each leaf of the tree named with something other than NCBI names or maybe
you have a tree that does not use TaxIDs.
On the "Your name" Trees and Data tab, use the pull-down arrow under the 'Features' header
that corresponds to the tree you wish to edit.
Select "Re-label Leaves". Here you can upload a file which contains the current ID numbers and
the name you wish to use. Also you can type to names in the fields.
Example from above converting TaxID into Order names.
272557 Desulfurococcales
399549 Sulfolobales
368408 Thermoproteales
436308 marine archaeal group 1
Click 'Upload" and then you can view your tree.
NOTE: This change is not permanent. You can display the original names or you new names.
Also, when exporting tree, you can choose the new or old names.
PLANTSEED: AN INTERKINGDOM COMPARATIVE GENOMICS RESOURCE
PlantSEED is a platform where information about plant metabolic pathways and their components
has been retrieved from several resources (like KEGG and AraCyc among others), and has been
integrated and organized within the context of the SEED subsystems structure. Each gene’s
Annotation Overview Page includes a Compare Regions Tool which allows two types of analysis:
a) the automatic identification of the gene’s closest homologs in other plant species, and b) the
study of the bacterial gene homologs of the plant query and their respective clustering. It provides
the user with invaluable information about the putative function of plant proteins in a very
ergonomical configuration.
PlantSEED is not a static database. It keeps on growing, being corrected and adjusted, as more
plant genomes are incorporated into its database and its subsystems are updated. Corrections are
based on evidence provided by literature mining as well as pathway gapfilling methodology.
Exercise: Finding the pyrimidine reductase involved in riboflavin biosynthesis
in plants
We are going to use our platform to identify proteins which could play this functional role in
Arabidopsis and other model plants present in PlantSEED’s database.
The first steps in riboflavin biosynthesis are illustrated below. The enzymes that catalyze the first
two steps in plants were already known by 2004, but until last year, the enzyme that catalyzes
pyrimidine reduction in plants (Q) was unknown.
Q
Our exercise will be to use the PlantSEED platform to identify a putative pyrimidine reductase
using as starting point the sequence of Escherichia coli’s diaminohydroxyphosphoribosyl
aminopyrimidine deaminase / 5-amino-6-(5-phosphoribosylamino) uracil reductase.
Log in: PlantSEED http://plantseed.theseed.org
*At top right, enter username in the left box (and if needed a password in right box) and click
‘login’ (use your RAST user name).
*Point to the Navigate tab, a menu pops up. Right click on the BLAST search to open its link in a
new tab.
*Copy the sequence given below and paste it on the BLAST window in the PlantSEED.
>gi|16128399|ref|NP_414948.1| fused diaminohydroxyphosphoribosylaminopyrimidine deaminase and 5-amino-6-(5phosphoribosylamino) uracil reductase [Escherichia coli str. K-12 substr. MG1655]
MQDEYYMARALKLAQRGRFTTHPNPNVGCVIVKDGEIVGEGYHQRAGEPHAEVHALRMAGEKAKGATAYV
TLEPCSHHGRTPPCCDALIAAGVARVVASMQDPNPQVAGRGLYRLQQAGIDVSHGLMMSEAEQLNKGFLK
RMRTGFPYIQLKLGASLDGRTAMASGESQWITSPQARRDVQLLRAQSHAILTSSATVLADDPALTVRWSE
LDEQTQALYPQQNLRQPIRIVIDSQNRVTPVHRIVQQPGETWFARTQEDSREWPETVRTLLIPEHKGHLD
LVVLMMQLGKQQINSIWVEAGPTLAGALLQAGLVDELIVYIAPKLLGSDARGLCTLPGLEKLADAPQFKF
KEIRHVGPDVCLHLVGA
*Select the first instance of Arabidopsis thaliana and click the BLAST button. You get two hits
of significance: fig|3702.11.peg.8572 with an E= 1e-46 and fig|3702.11.peg.14378 with an E = 1e30.
*Click on the link of fig|3702.11.peg.8572. This is AT4G20960. Explore the information given at
the top of the page.
*Click on the cdd link. How does the Arabidopsis protein domains compare with the Escherichia
coli homolog domains? Which functional roles can be predicted for this protein?
*Return to the Annotation Overview page and go down to the Compare Regions tool:
*Click on the advanced display option. Relax both E values cut off to E = e-15 and extend
the search to 200 regions.
*Click on the “draw” button. Explore the information offered by this tool.
Main points to notice:
- The identification of the Arabidopsis homolog and the corresponding plant homologs.
- Seamless transition to bacterial genomes.
- Closest homologs to plant genes are cyanobacterial ones. What does this suggest? What
problem does this pose for clustering analysis?
- There are several bacterial genomes with homologs clustering with other riboflavin synthesis
pathway genes, in particular riboflavin synthase,
3,4-dihydroxy-2-butanone 4-phosphate synthase (EC 4.1.99.12) / GTP cyclohydrolase II
(EC 3.5.4.25),
6,7-dimethyl-8-ribityllumazine synthase (EC 2.5.1.78).
- What does this clustering suggest?
- At the bottom of the figure is another Arabidopsis gene homolog which corresponds to
fig|3702.11.peg.14378. This is AT3G47390.
*Click on this red arrow. Explore the information provided by the annotation overview page of this
second homolog.
*Click on the cdd link. How does this homolog structure compare with the other Arabidopsis and
bacterial homologs?
*Go back to the Annotation overview page of this second Arabidopsis homolog. Check the
information provided by its Compare Regions Tool. How do the other plant homologs
characteristics compare to each other?
*Go back to the Plant SEED portal. *Click on the Gene Trees tab. *Type riboflavin on the
window corresponding to the subsystem column. *Click enter. A table displaying the
functional roles and the corresponding plant genes is displayed containing links to alignments and
phylogenetic trees for these genes. Notice there are two rows for
Diaminohydroxyphosphoribosylaminopyrimidine deaminase (EC 3.5.4.26), one corresponding to
AT3G47390, and the other to AT4G20960. There are also two rows for 5-amino-6-(5phosphoribosylamino) uracil reductase (EC 1.1.1.193), one for AT3G47390, and the other for
AT4G20960. What does this mean?
*Click on the links to the trees and alignments of each and compare them. What conclusions
can you make?
-Summarize the observations harvested from the PlantSEED platform. We will discuss them
together with the results of Fischer M et al 2004 [PMID:15208317] and Ouyang M et al, 2010
[PMID:20580123] and design an hypothesis to test in the lab.
-To learn about the experimental work done recently to identify this enzyme go to Hasnain G et al
2013 [PMID: 23150645] we will discuss their results and compare them with our hypothesis
A GLIMPSE of the PATRIC DATABASE – PATHOSYSTEMS RESOURCE
INTEGRATION CENTER
PATRIC database is an independent microbial genomic resource based on the underlying SEED
data. It does not use any of SEED tools or displays, but develops its own, largely complementary
to what SEED has to offer. Hence, it pays to get to get to know it at least briefly, as some of
PATRIC original tools are extremely powerful and its data collection is vast and constantly growing.
This presentation can only give you a glimpse of this great resource. To make the most of this
brief introduction to we strongly encourage you in advance to read the following publications and to
register as a user
* To register go to http://patricbrc.org/portal/portal/patric/Home and create an account. In a few
minutes you should get a confirmation e-mail. Follow the link provided there to activate your
account
* About PATRIC: PMID: 24225323
* Intro for the exercise: PMID: 24381303
1. Pathway heatmaps
* Go to http://patricbrc.org/portal/portal/patric/Home and login
- Under the Organisms gray Tab on the very top of page choose ‘Clostridium’
- Choose the Pathways Tab
- In the top left corner use Pathway Class search field to choose ‘Metabolism of cofactors and
vitamins’ from a dropdown menu
- Next to it use Pathway Name search field to select Folate biosynthesis from a dropdown
menu
- Further to the right use the Annotation search field to choose PATRIC
- Now click Filter Table button on the far right. The metabolic pathways Table on this page will
reload to display a single pathway as you specified
- Click on Folate Biosynthesis and then to the Heatmap Tab
* How do you interpret this in terms of the variation of Folate pathway in Clostridia? Will discuss
this it in class
* More specifically look at the distribution of these protein families in different organisms:
EC 2.7.6.3
- 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase
(Fol K)
EC 4.1.2.25 - Dihydroneopterin aldolase (FolB)
Tip: Note a small square diagram at the top left top corner of the Heatmap display. Use small
gliding handles there to expand and view your selection
Tip: Go back and forth between the KEGG Map Tab and the Heatmap Tap to help rationalize the
data
Tip: Try using Flip Axis button above the Heatmap display. You can also change the order of
columns or rows (just drag and drop)
2. Integrative data mining with PATRIC
* Go to http://patricbrc.org/portal/portal/patric/Home
- Under the Tools gray Tab on the very top of page choose ‘ID Mapping’
- Paste the list of UniProt IDs given below into the IDs box on the left
- You must select the exact types of IDs here for ID mapping to work. If you don't to know your
ID types, this might take several tries to figure out. For the purpose of this exercise choose
‘UniProtKB-ID’ in the “FROM ID Type’ window. And choose ‘PATRIC Locus Tag’ in the “TO ID
Type’ window
- Now hit Search button
Tip: By default PATRIC displays only top 20 rows in a table. Change this setting to 50 at the
bottom of the table. You might have to do this often, as most of the time you want to see the whole
table
Tip: You can select all the features and save them in your workplace as an eDUF-Vibrio group if
you want to keep this list for future use. Use Workspace’ – Add Feature(s) tool right above the
table on the left to do this (this option is available only if you have registered earlier are now
browsing as a signed-in user).
- Find the PATRIC locus Tag for Q9KTK5 and click on it. Tip: use CTRL-F or Command-F of
your Browser
- You are now on the main page for any gene in PATRIC. It is a hub for multiple features
(similar to a PEG page in SEED)
- Alternatively you could access the same page by typing “Q9KTK5” into the main search
window located at the top left corner of every page in PATRIC. Please try it now. Make sure to
keep quotation marks around the query
- From this gene page you can explore different links. Some will bring you back to pages and
resources you know, like the String, CDD or SEED. The Tabs bring you to Transcriptomics data
- With the goal of predicting a function for this unknown protein (Q9KTK5) please explore two
strategies 1) Through the SEED Viewer link go to the SEED database and find the genes that physically
cluster with this one (You should know how to do this after the two days of workshop! And
of course, never trust any annotations). Tip: look not only in V.cholerae genomes, but in
also other Vibrio species, e.g. V. furnissii, V. coralliilyticus, V. parahaemolyticus, etc
2) Through the Transcriptomics Tab find conditions under which the expression of this gene
is induced. You can also find the top correlated genes (but this might take time)
>> Using the leads from the gene clustering and transcriptomics data try to make a functional
prediction for this eDUF. We will discuss this it in class
UniProt protein IDs
Q9KM02
Q9KKS9
Q9KRP9
Q9KTK5
Q9KTL5
Q9KTF7
Q9KMK7
Q9KPS4
H9L4N8
Q9KMA8
Q9KKY0
Q9KRZ3
Q9KQF8
Q9K2J6
Q9KRB8
Q9KQZ4
Q9KQ55
Q9KLF5
Q9KS70
Q9KQY9
Q9K315
Q9KKX9
Q9KSK9
Q9KSE7
Q9KQ57
Q9KRV3
Q9KMC9
Q9K2R8
Q9KKS0
Q9KN40
Q9KLT9
Q9KRJ6
Q9KQ50
Q9KTL7
Q9KRW6
Q9KN27
Q9KM18
Answer Key:
Please DO NOT open until instructed to do so Top Secret
Download