COMPARATIVE GENOMICS WORKSHOP - Day 3 PHYLOGENETIC TREES FOR GENOME ANNOTATION, DATA ANALYSIS AND PRESENTATION 1. Using precomputed phylogenetic trees in SEED for annotation and functional predictions The OVERALL homology scores (expressed as % identity or % similarity) are no longer an adequate basis for functional projections. To assist SEED curators and users in making annotation decisions phylogenetic trees have been precomputed for the majority of PEGs in the SEED database. Now a SEED user can view his/her favorite protein in the context of a protein family it is a member of, see in significant detail it’s phylogenetic relationship to other members of the family, follow sub-branching within the tree – which often leads to conclusions about biological functions that would not be apparent otherwise. With over 8,000 prokaryotic genomes in the database (and growing) building a comprehensive and accurate tree for any microbial protein family is a computationally intensive and lengthy process. Building them on demand every time would have been impractical and very inconvenient for a user. Elsewhere the SEED does allow building much smaller local trees on the fly (as you saw earlier). But these GLOBAL trees are different - in scope and functionalities they provide. The precomputed global trees in SEED are rich annotation tools with many added features and functions. To familiarize ourselves with them, let us follow one example together with the instructor on the screen using the guidelines below. And then you can explore any protein family of your choice Let’s look at a phylogenetic tree for the Dihydroneopterin aldolase (EC 4.1.2.25) protein family, starting with the E. coli protein: fig|83333.1.peg.3006 * Open the PEG page for fig|83333.1.peg.3006: http://theseed.uchicago.edu/FIG/seedviewer.cgi?page=Annotation&feature=fig|83333.1.peg.3006& user=SvetaG * Locate ‘alignment and trees’ feature within a gray box area. available for this E. coli gene Four of them are currently * Follow one of the links: ‘alignments’ or ‘trees’. Both point to the same display page where one can ‘Select Alignment and/or Tree’ and customize the view: - Note the ‘Alignment and trees Help’ button in the top left corner – feel free to come here anytime you feel lost - Let us leave all settings at default for starts, changing only the coloring scheme – from ‘by residue’ to ‘by consensus’ - Out of the 4 trees listed for this PEG, we’ll select the top one, Tree #00002005 (activating radio button by it) and click ‘update’ - Look at alignment, discuss how these specific trees are constructed starts and ends, determine the area included in alignment: locate the query sequence fig|83333.1.peg.3006 in the alignment (use the browser’s ‘find on page’ feature) open the corresponding PEG page compare to the full length sequence color coding legend is available under the alignment, just above the tree - Look at the tree, discuss Two coloring systems, highlighting either (i) annotation consistency or (ii) the area of a protein included in alignment Domain structure, protein fusion are immediately obvious: RED: N-terminus aligned - PURPLE: C-terminus aligned fig|349741.3.peg.1549, fig|481448.4.peg.1879) GREEN: fig|195103.9.peg.1230 fig|391165.8.peg.767 middle aligned - (or fig|218496.1.peg.162 any color in between is also possible Color intensity is a warning against frivolous functional projections: the shorter the region involved in alignment, the more saturated the color. - compare fig|367928.5.peg.432 and fig|446465.4.peg.2235 - Annotation tools: radio buttons (square and circle), group selection at branching points, command buttons, ‘Update’ versus ‘Assign’ buttons - Labels: pluses, crosses, bold type, vertical bars (change focus PEG to fig|869305.3.peg.980 to see automatic function projection scoring) - Other menu options * Copy and paste ‘fig|83333.1.peg.3006’ into the ‘Focus protein ID’ window. Press ‘All alignments and trees with this protein’ button. Let us explore other trees associated with it: - Activate radio buttons for different Tree IDs 00002656, 00004091, or 00005198 and click ‘Update’ - Note statistics for the number of various annotations associated with each – these tree are almost redundant * Copy and paste a fusion protein ‘fig|391165.8.peg.767’ into the ‘Focus protein ID’ window, explore the trees listed for it: - Note that in addition to the 4 trees associated with Dihydroneopterin aldolase (fig|83333.1.peg.3006 ), an additional tree is available, which is dominated by the duf556 family . It all other genomes in the SEED database where it occurs Exploring this small tree can provide functional leads for this hypothetical family: First, it pointed us the fact that duf556 is occasionally fused with one of folate biosynthetic enzymes (Dihydroneopterin aldolase) and hence, might be functionally associated with the same pathway Second, it shows the occurrence profile of the duf556 family at a glance: it is dominated heavily by Methanogenic bacteria and methylotrophs. Notably, it occurs in Eubacteria and Archaea also. Third, zooming into representatives of the duf556 family in different organisms can give an idea of their chromosomal neighbors, which in turn might provide further functional clues * Let us use this tree (since it is conveniently small) to explore the menu options that we didn’t address: - Color the alignment ‘by residue’ or ‘by consensus’ - Sapling versus SEED Global phylogenetic trees pre-computed in SEED are designed for a very specific purpose: accurate PROJECTIION of ANNOTATIONS based on phylogeny, which has clear advantages over homology-based projection. In addition they can provide leads for functional predictions for hypothetical protein families 2. Creating phylogenetic distribution figures by combining SEED subsystems and iToL * Creating the files to import into iTol from a SEED subsystem Go to : FolateExerciseMaizeClassVDC subsystem in Public Seed (http://pubseed.theseed.org/SubsysEditor.cgi?page=SubsystemOverview) Go the spreadsheet view, then go to Limit display and choose in the user set vcrecy_MaizeworkshopItol and hit limit display. Click on the Export data to Excel Save your excel file and give it a name Open your file in Excel, you may need to select “All files”, and open as tab delimited file. To do this, follow the prompts in Excel: First select “DELIMITED” file, and then click “NEXT”. Only check the box next to “TAB”, and then click finish. You will need to manipulate it to create your input data. The first column has a genome ID in this format Actinomyces coleocanis DSM 15436 (525245.3). You need to extract 525245 automatically - this is the NCBI taxonomic number that iTol will recognize. To do that we will take advantage of automatic macros in Excel. - 1) Delete the Domain, Taxonomy, variant, pattern and # clustered columns keeping the columns empty. - 2) Select the organism column, go to the data tab and hit the text to column Tab. - 3) Hit the next after delimited has been selected. - 4) Then choose other and put in a ‘(‘ ; uncheck the “TAB” box and click finish. - 5) Check for bugs: usually Strains names that have one of the characters you split pose problems. Fix it (or, just delete that strain for this exercise) - 6) Go the column you created repeating the operation with a ‘.’ and click finish. - 7) For the three columns FolK, FolE1 and FolE2 put a 0 in columns that are empty, put a 1 in columns that are not empty (use the data/Sort function of Excel to make easy) - 8) Delete all columns except the genome name, the genome NCBI TaxID, the FolK, FolE1 and FolE2 columns, and save the file with another name - 9) From this file you will create four different files saved in Tab delimited format The first with only the Taxonomic ID column, The second with the TaxID and the FolK columns, the third with TaxID and FolE1 columns and the fourth with TaxID and FolE2 columns Do not forget to eliminate the header row in all files before saving. Use the “SAVE AS” function. Go to File/Save-As and change the format to (.TXT). Excel will ask if you really want to use this file type, Click “Continue”. You now have your files to import into iTol *Creating a phylogenetic distribution Figure in iTol (itol.embl.de) Go to iTol and set up a user account, this allows for more options and saving of trees. TO ADD YOUR TAXONOMIC TREE: Use the 'Other Trees" tab and scroll all the way to the bottom of the page. Here you can paste into the box or use the upload tool. ALL FILES MUST FOLLOW THE iToL FORMAT. 1) a text (.txt) file with one column. This column MUST use genus species names used by NCBI (format genus_species_subname). Or, use the NCBI TaxID number. NOTE: It is much easier to just use TaxID. These numbers can be obtained from SEED (first part of the number behing the name but before the period) You can alter the names latter in 2 different ways. 2) After selecting file, click 'Generate Tree'. This will generate 4 versions of your tree. In general use the TaxID/Expanded tree. You can always add names or collapse later. 3) Click 'Show Tree'. Click the 'Select tree for clipboard copy' to highlight all. Copy to your clipboard "CTRL C" or “CMMD C” for Mac 4) If you set up an account, you will have a tab labeled with "your name" Trees and Data, click it. 5) Go to the project you are working on and click 'Add a tree to this project' 6) Paste your tree into the first box and name your tree. If you had saved a Newick format of your tree previously, you can added it now. 7) Now is your first chance to add data. You can also add and manipulate your data later. Data must be in text files and in column. DO NOT USE HEADERS in your file. Example using TaxID and Presence or Absence of 1 gene. 272557 1 399549 1 368408 1 436308 0 ** You can change how the data will be displayed, for this use "BINARY" Click upload and the tree and data field will be built. 8) From here, you have many options (your tree has been saved to your workspace). You can view your tree, define range (ie, color domains of life), and you can automatically assign taxon (this is why I like using the TaxID as input). 9) To assign taxonomy: click 'Automatically assign taxonomy' This may take a few minutes if you have a large file. 10) If you wish, you can define color ranges (color the tree) by clicking 'define color ranges'. Ranges colors the leaves, whereas Clades will color the tree. Be sure to save. VIEW YOUR TREE! EXPORTING YOUR TREE: On the tree view page, click 'Export' Use the EPS format (small file size that can be rasterized by Photoshop and can be opened by Preview on Mac) Several options to choose from which are self-explanatory. If you added data, be sure that the data box is checked. Click 'Export Tree' This takes to you another page. Download and Save. RELABELING LEAVES You may want each leaf of the tree named with something other than NCBI names or maybe you have a tree that does not use TaxIDs. On the "Your name" Trees and Data tab, use the pull-down arrow under the 'Features' header that corresponds to the tree you wish to edit. Select "Re-label Leaves". Here you can upload a file which contains the current ID numbers and the name you wish to use. Also you can type to names in the fields. Example from above converting TaxID into Order names. 272557 Desulfurococcales 399549 Sulfolobales 368408 Thermoproteales 436308 marine archaeal group 1 Click 'Upload" and then you can view your tree. NOTE: This change is not permanent. You can display the original names or you new names. Also, when exporting tree, you can choose the new or old names. PLANTSEED: AN INTERKINGDOM COMPARATIVE GENOMICS RESOURCE PlantSEED is a platform where information about plant metabolic pathways and their components has been retrieved from several resources (like KEGG and AraCyc among others), and has been integrated and organized within the context of the SEED subsystems structure. Each gene’s Annotation Overview Page includes a Compare Regions Tool which allows two types of analysis: a) the automatic identification of the gene’s closest homologs in other plant species, and b) the study of the bacterial gene homologs of the plant query and their respective clustering. It provides the user with invaluable information about the putative function of plant proteins in a very ergonomical configuration. PlantSEED is not a static database. It keeps on growing, being corrected and adjusted, as more plant genomes are incorporated into its database and its subsystems are updated. Corrections are based on evidence provided by literature mining as well as pathway gapfilling methodology. Exercise: Finding the pyrimidine reductase involved in riboflavin biosynthesis in plants We are going to use our platform to identify proteins which could play this functional role in Arabidopsis and other model plants present in PlantSEED’s database. The first steps in riboflavin biosynthesis are illustrated below. The enzymes that catalyze the first two steps in plants were already known by 2004, but until last year, the enzyme that catalyzes pyrimidine reduction in plants (Q) was unknown. Q Our exercise will be to use the PlantSEED platform to identify a putative pyrimidine reductase using as starting point the sequence of Escherichia coli’s diaminohydroxyphosphoribosyl aminopyrimidine deaminase / 5-amino-6-(5-phosphoribosylamino) uracil reductase. Log in: PlantSEED http://plantseed.theseed.org *At top right, enter username in the left box (and if needed a password in right box) and click ‘login’ (use your RAST user name). *Point to the Navigate tab, a menu pops up. Right click on the BLAST search to open its link in a new tab. *Copy the sequence given below and paste it on the BLAST window in the PlantSEED. >gi|16128399|ref|NP_414948.1| fused diaminohydroxyphosphoribosylaminopyrimidine deaminase and 5-amino-6-(5phosphoribosylamino) uracil reductase [Escherichia coli str. K-12 substr. MG1655] MQDEYYMARALKLAQRGRFTTHPNPNVGCVIVKDGEIVGEGYHQRAGEPHAEVHALRMAGEKAKGATAYV TLEPCSHHGRTPPCCDALIAAGVARVVASMQDPNPQVAGRGLYRLQQAGIDVSHGLMMSEAEQLNKGFLK RMRTGFPYIQLKLGASLDGRTAMASGESQWITSPQARRDVQLLRAQSHAILTSSATVLADDPALTVRWSE LDEQTQALYPQQNLRQPIRIVIDSQNRVTPVHRIVQQPGETWFARTQEDSREWPETVRTLLIPEHKGHLD LVVLMMQLGKQQINSIWVEAGPTLAGALLQAGLVDELIVYIAPKLLGSDARGLCTLPGLEKLADAPQFKF KEIRHVGPDVCLHLVGA *Select the first instance of Arabidopsis thaliana and click the BLAST button. You get two hits of significance: fig|3702.11.peg.8572 with an E= 1e-46 and fig|3702.11.peg.14378 with an E = 1e30. *Click on the link of fig|3702.11.peg.8572. This is AT4G20960. Explore the information given at the top of the page. *Click on the cdd link. How does the Arabidopsis protein domains compare with the Escherichia coli homolog domains? Which functional roles can be predicted for this protein? *Return to the Annotation Overview page and go down to the Compare Regions tool: *Click on the advanced display option. Relax both E values cut off to E = e-15 and extend the search to 200 regions. *Click on the “draw” button. Explore the information offered by this tool. Main points to notice: - The identification of the Arabidopsis homolog and the corresponding plant homologs. - Seamless transition to bacterial genomes. - Closest homologs to plant genes are cyanobacterial ones. What does this suggest? What problem does this pose for clustering analysis? - There are several bacterial genomes with homologs clustering with other riboflavin synthesis pathway genes, in particular riboflavin synthase, 3,4-dihydroxy-2-butanone 4-phosphate synthase (EC 4.1.99.12) / GTP cyclohydrolase II (EC 3.5.4.25), 6,7-dimethyl-8-ribityllumazine synthase (EC 2.5.1.78). - What does this clustering suggest? - At the bottom of the figure is another Arabidopsis gene homolog which corresponds to fig|3702.11.peg.14378. This is AT3G47390. *Click on this red arrow. Explore the information provided by the annotation overview page of this second homolog. *Click on the cdd link. How does this homolog structure compare with the other Arabidopsis and bacterial homologs? *Go back to the Annotation overview page of this second Arabidopsis homolog. Check the information provided by its Compare Regions Tool. How do the other plant homologs characteristics compare to each other? *Go back to the Plant SEED portal. *Click on the Gene Trees tab. *Type riboflavin on the window corresponding to the subsystem column. *Click enter. A table displaying the functional roles and the corresponding plant genes is displayed containing links to alignments and phylogenetic trees for these genes. Notice there are two rows for Diaminohydroxyphosphoribosylaminopyrimidine deaminase (EC 3.5.4.26), one corresponding to AT3G47390, and the other to AT4G20960. There are also two rows for 5-amino-6-(5phosphoribosylamino) uracil reductase (EC 1.1.1.193), one for AT3G47390, and the other for AT4G20960. What does this mean? *Click on the links to the trees and alignments of each and compare them. What conclusions can you make? -Summarize the observations harvested from the PlantSEED platform. We will discuss them together with the results of Fischer M et al 2004 [PMID:15208317] and Ouyang M et al, 2010 [PMID:20580123] and design an hypothesis to test in the lab. -To learn about the experimental work done recently to identify this enzyme go to Hasnain G et al 2013 [PMID: 23150645] we will discuss their results and compare them with our hypothesis A GLIMPSE of the PATRIC DATABASE – PATHOSYSTEMS RESOURCE INTEGRATION CENTER PATRIC database is an independent microbial genomic resource based on the underlying SEED data. It does not use any of SEED tools or displays, but develops its own, largely complementary to what SEED has to offer. Hence, it pays to get to get to know it at least briefly, as some of PATRIC original tools are extremely powerful and its data collection is vast and constantly growing. This presentation can only give you a glimpse of this great resource. To make the most of this brief introduction to we strongly encourage you in advance to read the following publications and to register as a user * To register go to http://patricbrc.org/portal/portal/patric/Home and create an account. In a few minutes you should get a confirmation e-mail. Follow the link provided there to activate your account * About PATRIC: PMID: 24225323 * Intro for the exercise: PMID: 24381303 1. Pathway heatmaps * Go to http://patricbrc.org/portal/portal/patric/Home and login - Under the Organisms gray Tab on the very top of page choose ‘Clostridium’ - Choose the Pathways Tab - In the top left corner use Pathway Class search field to choose ‘Metabolism of cofactors and vitamins’ from a dropdown menu - Next to it use Pathway Name search field to select Folate biosynthesis from a dropdown menu - Further to the right use the Annotation search field to choose PATRIC - Now click Filter Table button on the far right. The metabolic pathways Table on this page will reload to display a single pathway as you specified - Click on Folate Biosynthesis and then to the Heatmap Tab * How do you interpret this in terms of the variation of Folate pathway in Clostridia? Will discuss this it in class * More specifically look at the distribution of these protein families in different organisms: EC 2.7.6.3 - 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine pyrophosphokinase (Fol K) EC 4.1.2.25 - Dihydroneopterin aldolase (FolB) Tip: Note a small square diagram at the top left top corner of the Heatmap display. Use small gliding handles there to expand and view your selection Tip: Go back and forth between the KEGG Map Tab and the Heatmap Tap to help rationalize the data Tip: Try using Flip Axis button above the Heatmap display. You can also change the order of columns or rows (just drag and drop) 2. Integrative data mining with PATRIC * Go to http://patricbrc.org/portal/portal/patric/Home - Under the Tools gray Tab on the very top of page choose ‘ID Mapping’ - Paste the list of UniProt IDs given below into the IDs box on the left - You must select the exact types of IDs here for ID mapping to work. If you don't to know your ID types, this might take several tries to figure out. For the purpose of this exercise choose ‘UniProtKB-ID’ in the “FROM ID Type’ window. And choose ‘PATRIC Locus Tag’ in the “TO ID Type’ window - Now hit Search button Tip: By default PATRIC displays only top 20 rows in a table. Change this setting to 50 at the bottom of the table. You might have to do this often, as most of the time you want to see the whole table Tip: You can select all the features and save them in your workplace as an eDUF-Vibrio group if you want to keep this list for future use. Use Workspace’ – Add Feature(s) tool right above the table on the left to do this (this option is available only if you have registered earlier are now browsing as a signed-in user). - Find the PATRIC locus Tag for Q9KTK5 and click on it. Tip: use CTRL-F or Command-F of your Browser - You are now on the main page for any gene in PATRIC. It is a hub for multiple features (similar to a PEG page in SEED) - Alternatively you could access the same page by typing “Q9KTK5” into the main search window located at the top left corner of every page in PATRIC. Please try it now. Make sure to keep quotation marks around the query - From this gene page you can explore different links. Some will bring you back to pages and resources you know, like the String, CDD or SEED. The Tabs bring you to Transcriptomics data - With the goal of predicting a function for this unknown protein (Q9KTK5) please explore two strategies 1) Through the SEED Viewer link go to the SEED database and find the genes that physically cluster with this one (You should know how to do this after the two days of workshop! And of course, never trust any annotations). Tip: look not only in V.cholerae genomes, but in also other Vibrio species, e.g. V. furnissii, V. coralliilyticus, V. parahaemolyticus, etc 2) Through the Transcriptomics Tab find conditions under which the expression of this gene is induced. You can also find the top correlated genes (but this might take time) >> Using the leads from the gene clustering and transcriptomics data try to make a functional prediction for this eDUF. We will discuss this it in class UniProt protein IDs Q9KM02 Q9KKS9 Q9KRP9 Q9KTK5 Q9KTL5 Q9KTF7 Q9KMK7 Q9KPS4 H9L4N8 Q9KMA8 Q9KKY0 Q9KRZ3 Q9KQF8 Q9K2J6 Q9KRB8 Q9KQZ4 Q9KQ55 Q9KLF5 Q9KS70 Q9KQY9 Q9K315 Q9KKX9 Q9KSK9 Q9KSE7 Q9KQ57 Q9KRV3 Q9KMC9 Q9K2R8 Q9KKS0 Q9KN40 Q9KLT9 Q9KRJ6 Q9KQ50 Q9KTL7 Q9KRW6 Q9KN27 Q9KM18 Answer Key: Please DO NOT open until instructed to do so Top Secret