COMPARATIVE GENOMICS WORKSHOP – Days 1 & 2 USING COMPARATIVE GENOMICS RESOURCES - SEED SEED is a comparative genomics database and workbench. Unlike STRING it is not rigidly pre-computed; the user has a lot of control. 1. Logging on to SEED as a user UChicago SEED http://theseed.uchicago.edu/FIG/index.cgi (PubSEED http://pubseed.theseed.org/seedviewer.cgi ) * On Entry Page, open ‘Use the new Subsystem Editor’ http://theseed.uchicago.edu/FIG/SubsysEditor.cgi in a new window or new tab * At top right, enter username in the left box (and if needed a password in right box) and click ‘login’ (use your RAST user name) * Typical starting points for a comparative genomics project are a protein sequence, a gene name, or an existing subsystem. The following exercises walk you through all 3 routes. 2. Enter SEED starting from a protein sequence, e.g. the plant FolE protein >gi|15231435|ref|NP_187383.1| GTP cyclohydrolase I [Arabidopsis thaliana] MGALDEGCLNLELDIGMKNGCIELAFEHQPETLAIQDAVKLLLQGLHEDVNREGIKKTPFRVAKALREGT RGYKQKVKDYVQSALFPEAGLDEGVGQAGGVGGLVVVRDLDHYSYCESCLLPFHVKCHIGYVPSGQRVLG LSKFSRVTDVFAKRLQDPQRLADDICSALQHWVKPAGVAVVLECSHIHFPSLDLDSLNLSSHRGFVKLLV SSGSGVFEDESSNLWGEFQSFLMFKGVKTQALCRNGSSVKEWCPSVKSSSKLSPEVDPEMVSAVVSILKS LGEDPLRKELIATPTRFLKWMLNFQRTNLEMKLNSFNPAKVNGEVKEKRLHCELNMPFWSMCEHHLLPFY GVVHIGYFCAEGSNPNPVGSSLMKAIVHFYGFKLQVQERMTRQIAETLSPLVGGDVIVVAEAGHTCMISR GIEKFGSSTATIAVLGRFSSDNSARAMFLDKIHTTNALKTESSSPF * Since SEED can only (conveniently) search one genome at a time, if you do not already know which prokaryotes have homologs you first need to do BlastP searches of Bacteria and Archaea at NCBI to find a genome that has a homolog of Arabidopsis FolE and is also in SEED. * For this exercise, suppose that you have done this and find that Propionibacterium acnes KPA171202 meets these criteria * Go to the SEED Entry page * Paste Arabidopsis FolE sequence into the box for Searching DNA or Protein Sequences (in a selected organism). * Select the Propionibacterium acnes KPA171202 genome to search for homologs from the list near the top of the page → Click ‘Search for matches’ * Click on best hit, peg.255 (PEG= Protein Encoding Gene) → takes you to ‘Annotation overview’ page for this gene. Features: - Current assignment (top of page) = current annotation - EC number = link to KEGG - Sequence link → Fasta files of DNA, protein - Feature Evidence tools Visual protein evidence diagrams similarity to other proteins (number can be specified); it is usually preferable to rank by score (= BlastP e-value) rather than identity Multiple homologs in the same organism (if any) are shown next to each other [out of the general sorting order (% identity or E-val)] and are framed in a blue box (if ‘Group By Genome’ box is activated) Tabular protein evidence is the same data in tabular format - Run tool → runs various search tools on the protein, e.g. Psi-Blast (default) - List of Subsystems that include the protein (in this case about 12) - Compare Regions tool → Displays chromosomal region around your gene plus 4 closest genomes, color coded for frequency of occurrence, mouse-over to see annotation Gene arrows denote direction of transcription/translation Offset genes denote overlapping ORFs (translational coupling) - To see more genomes, Click ‘Advanced’ → Type in ‘50’, click ‘Draw’ (takes some time to load) → see many genomes, and clustering patterns with other folate-related genes e.g. DHPS, DHNA, HPPK (fused in some cases) - To see more very similar genomes, unclick ‘Collapse close genomes’ (by clicking ‘Show all’) , click ‘Draw’ - [To see more distant homologs, lower both cutoffs to 1e-05, click ‘Draw’] - To ‘rank’ homologs by similarity to the P. acnes protein, raise the ‘coloring’ cutoff to 1e-75, click ‘Draw’ → Note that more distant homologs now have different colors - To pick and choose genomes for display – check or uncheck radio buttons on the left, and Click ‘update with selected’ (not ‘Draw’ in this case). If you need to return to original display – reload the page - The ‘Compare Regions’ tool is the heart of the SEED - Trees tool → For some PEG pages (not all) a phylogenetic tree tool is available, click ‘Trees’ - From the Tree page, keep all the default parameters and click on a radio button for a specific tree (they are often very similar), then click ‘Update’. This is a valuable tool to identify fusions, subfamilies and annotation problems. As you see for this tree, bright colors around radio buttons guide you to fusions. 3. Enter SEED starting from a gene name, e.g. ylgG, encoding DHNTP pyrophosphohydrolase, an early (and until recently missing) enzyme of folate biosynthesis * Switch to the SEED Entry page * Enter ylgG in the Searching for Genes or Functional Roles Using Text Search Pattern box, select Lactococcus lactis subsp. lactis, click ‘Search genome selected below’ * Click on peg.1190 → takes you to ‘Annotation Overview page’ → proceed as above * Further exploration of Compare Regions tool: - Click on ‘Tabular Region Information’ → Dihydroneopterin triphosphate pyrophosphohydolase highlighted in red, plus flanking genes. - Click on Dihydroneopterin triphosphate pyrophosphohydolase ‘cluster’ button → will show precomputed gene clusters (if any exist). In this case, none do. - Click instead on ‘cluster’ button for dihydrofolate reductase. Note "display ___ items per page' control on the results page. Enter '50' to see all the potential gene clusters found → one cluster includes thymidylate synthase (the main source of dihydrofolate) - Click on ‘Sequences’ tab → Select ‘Protein’ → Click ‘Align Sequences’ → Displays Clustal protein alignment → Use radio button to select ‘NJTree’ (Neighbor-joining tree), reload - Displays protein (or DNA) alignment and a tree – note the two clades (currently annotated differently) 4. Enter SEED via Subsystems. To view subsystems in SEED: * Go to Subsystem Overview page: http://theseed.uchicago.edu/FIG/SubsysEditor.cgi Displays a table of subsystems, generally with self-explanatory names * Use browser search tool to find ‘Folate Biosynthesis’. This is a typical ‘populated’ subsystem for a metabolic pathway and related enzymes. Click on link → Opens to ‘Subsystem Info’ tab - ‘Subsystem Info’ contains notes on the pathway - ‘Functional Roles’ lists the enzymes (= functional roles) and their abbreviations - Open ‘Spreadsheet’ in a new tab → default is to color genes by clustering - Click on ‘Color Spreadsheet’ tab → to color essential genes, click ‘Color by attribute’ button - Select ‘Essential_Gene_Sets_Bacterial’ → click ‘Update’ → Colors essential/non-essential genes in organisms for which comprehensive knockout studies are available (takes time to load) - Click on ‘Limit Display’ tab → Can show various subsets of data, e.g. just Firmicute genomes, or genomes with essentiality data available (SvetaG_Ess-ty_data_available in ‘User sets’) * Find Lactococcus lactis subsp. lactis Il1403, Click on FolE → takes you to ‘Annotation Overview’ page for fig|272623.1.peg.1188 → proceed as above * Further exploration of Feature Evidence tools: - Click on ‘Feature Evidence’ (blue link), then activate ‘Visual Protein Evidence’ tab. Recall that L. lactis FolE is fused to another folate synthesis enzyme, FolK - ‘Visual Protein Evidence’ shows diagrammatically how similar proteins align with the L. lactis fusion protein: Color is overall similarity, white bars are the matching regions - Note that there are two sets of hits, to C-terminal (FolE) and N-terminal (FolK) domains. To see this clearly – activate ‘Group by Genome’ radio button, Click ‘Resubmit’. Multiple homologs within the same organism (if any) are now shown next to each other [out of the general sorting order (% identity or E-val)] and are framed in a blue box - Note that domain structure of the query protein is shown near the top of the page. When you hover over each domain, additional information and a link to the corresponding page in Pfam database is displayed. - To see additional homologs, enter ‘200’ in ‘Max Sims’, click ‘Resubmit’. You can also change the sorting order (% identity, Score, etc) If you hover over any colored bar representing a query protein or one of its homologs, a floating window will be displayed, describing this PEG and the level of homology in this pair (E-value, % identity, length of alignment, etc). Clicking (once) on any colored homolog bar activates a different floating window, which provides links to (i) the corresponding PEG page, (ii) all Subsystems associated with this protein, and (iii) the blast alignment (the very last link). Use them to navigate. - Switch to ‘Tabular Protein Evidence’ → Explore the table. Query sequence appears on top and is highlighted greenish-brown: > Association with Subsystem(s) is shown for the query and each homolog in the ‘Associated Subsystem’ column on the right > Note also ‘Evidence Codes’ column on the right – hover over its heading and click on [?] sign for explanations. Evidence Codes in SEED differ from that in other databases: They are based on gene clustering (= ‘functional coupling’) on the chromosome. Literature references (‘direct’ or ‘indirect’) are shown here also > User can select sequences here (as targets for annotation or for downloading in Fasta format) by activating radio buttons in the left-most column of the table. Please select several homologs of the query protein. Note that if you switch to ‘Visual Protein Evidence’ view now, this selection will be carried over (the opposite is also true). Switching to the graphic view can also be achieved by clicking on a small icon of an alignment near each PEG ID. - Explore the links at the top left corner of the page, above the Similarity Table: > ‘Functionally coupled’ link takes you to the list of gene families that tend to cluster with the query family on the chromosome, the ‘strength of functional coupling’ is shown (‘Score’ column) - Can click on links to each gene in the table → Takes you to the ‘Annotation Overview’ page 5. Annotating genes. Many genes in SEED have already been annotated by experts, and are included in subsystems. However, many have not yet been annotated. As a user, you can annotate genes yourself. However, you may only do this for genes: (a) That have not already been annotated by someone else and entered into an existing subsystem, or (b) That are not closely related to genes that are in an existing subsystem. It is easy to tell whether a gene is in a subsystem, e.g. from the Annotation Overview page Basic Rule: If a gene is in a subsystem, leave it alone. If and only if the gene is not annotated or not similar to genes that are annotated, you can make your own annotation as follows: * In L. lactis FolE Annotation Overview, click to move left, then click on peg.1177, ‘integral membrane protein’. Click on this gene to re-center the viewer. Alternatively, to quickly jump to a different protein page with the same genome – copy the full ID of the query gene (fig|272623.1.peg.1188), paste it into the small window ‘find’ at the very top right corner of the page, change the gene number to ‘1177’ and Click ‘find’ * In ‘Compare Regions’ request 40 similar sequences – check that none has a definite annotation or is in any subsystems * Run Psi-Blast → Conserved Domain search shows that peg.1177 is COG4478, predicted membrane protein * To replace current annotation with ‘COG4478, predicted membrane protein’ (Instructor alone does the annotation): - Copy ‘COG4478, predicted membrane protein’ from Conserved Domain search, paste into ‘New Assignment’ box (avoid white spaces), click ‘Change’ → Changes assignment - To propagate this annotation to similar genes → Click ‘Annotate Clusters’, activate round radio button on desired name (at top of table), check square boxes of sequences to be renamed - Click on ‘annotate’ → All checked sequences are now annotated ‘COG4478, predicted membrane protein’ * There are many other ways to annotate genes in SEED, but this is the basic operation 6. Building a subsystem. As an example, we will each build a small subsystem using the folE gene and genes associated with it. * In Subsystem Editor, click ‘Home’ (OR: from any PEG page click ‘Curate Subsystems’ under ‘Navigate’) → Click ‘Create new subsystem → Opens a new window * Name your subsystem (your initials)_Folate (e.g. ADH_Folate), write ‘Test’ in Description and in Notes → Click ‘Save Changes’ → Click on link at top of page to enter new subsystem * By default your subsystem will be classified as ‘Experimental’ – Leave it that way * Can change Description and Notes, press red ‘Save Changes’ button. Note: Save every page before you leave it, otherwise it is lost * First we will add a gene to the subsystem, then some genomes. To add a gene: * Open ‘Functional Roles’ in a new tab. To add an existing functional role, copy-paste from elsewhere in SEED. To copy-paste FolE’s functional role: - Go to Subsystem Overview, find Folate Biosynthesis subsystem, Click on ‘Functional Roles’ tab → copy ‘GTP cyclohydrolase I (EC 3.5.4.16) type 1’ → paste into your subsystem * Add an abbreviation (this is the gene name that will appear in the spreadsheet), e.g. FolE, Press ‘Add Role’ → new box to add another role appears. Press ‘Save Changes’ * To add genomes: Open Spreadsheet in a new tab → Select E. coli K12 MG1655, press ‘add selected genome(s) to spreadsheet’ → Have mini-spreadsheet with one gene and one genome * To find other genomes that have FolE and to add them to the spreadsheet: - Click on E. coli gene (peg.2221) → On Annotation Overview page, use Compare Regions tool with ‘collapse close genomes’ to get 100 related genomes - Note that FolE occurs in many diverse bacteria, therefore add all bacterial genomes to spreadsheet - Return to Subsystem Spreadsheet → Check only Bacteria in Domains box → Highlight all genomes in list using Shift key → press ‘add selected genome(s) to spreadsheet’ * To add relevant genes, search first for genes that may be clustered with FolE. Return to FolE Annotation Overview, search for genes often found near FolE - Note 6-pyruvoyl tetrahydrobiopterin synthase (EC 4.2.3.12) quite often translationally coupled to FolE - Therefore copy-paste this gene into the Functional Roles page, add an abbreviation (e.g. PTPS), press ‘Add Role’, ‘Save Changes’ - Return to Spreadsheet, reload page → Shows FolE and PTPS columns → Press ‘Fill All Genomes’ → Populates PTPS column (Note: ‘Refill All Genomes’ also works but is more radical) - To see clustering between FolE and PTPS, click on ‘Color Spreadsheet’ tab, activate color by cluster radio button * Instructor alone does the annotation. We will use a different annotation method to the one above. * To add an as-yet unannotated gene, return to FolE annotation overview → Gene adjacent to folE in E. coli K-12 MG1655 is annotated ‘putative inner membrane protein’, not in subsystems * To annotate this gene: Click on the gene (fig|511145.6.peg.2220) E. coli to recenter display - In Compare Genomes tool, ask for 100 genomes with both cutoffs at 1e-05 → Run Psi-Blast → Conserved domain search gives ‘COG2311, Predicted membrane protein’ - Copy this annotation, paste it into new assignment box in Annotation Overview, press ‘Change’ → Press ‘Feature Evidence’, ‘Tabular Protein Evidence’ - Activate radio button ‘Assign from’ for E. coli, check boxes for 10 very similar proteins → Press ‘Assign Function’ → Boxes for these genes change to white * To add to spreadsheet: copy-paste annotation ‘COG2311, Predicted membrane protein’ into Functional Roles page, add an abbreviation (e.g. COG2311), press ‘Add Role’, ‘Save Changes’ * Go to Spreadsheet, click ‘Fill All Genomes’ → adds new gene and new genomes * Saving the subsystem: Close all subsystem windows. On Subsystems Overview page, click ‘Manage my subsystems’, Check the name of your subsystem, Click ‘Publish Subsystems to Clearinghouse’ * Deleting a subsystem: As above but select ‘Delete selected subsystems’ 7. Annotation syntax, copying a subsystem, subsystem spreadsheet tools * Annotation syntax. Genes can if necessary be given more than one annotation. The syntax and conventions for this are: - Role1 @ Role2 – A monofunctional [‘single domain’] enzyme that can catalyze two different reactions, both role 1 and role 2 - Role1 / Role2 – A bifunctional [often ‘two domain’] enzyme that does two unrelated roles, role 1 and role 2 - Role1 ; Role2 – Does either role 1 or role 2 but it is not clear which (can be used to double-annotate) * When building subsystems, enter Role 1 and Role 2 separately in the Functional Roles table # after annotation is a comment (does not affect annotation) * To copy a subsystem – Go to Subsystems Overview - Locate subsystem of interest, in far-right column click ‘copy’ - In ‘Copy To’ box enter name of new subsystem, click ‘Copy Now’ → Click on link at top of page to access new subsystem * To survey clustering patterns, at top of Spreadsheet click on triangles in ‘# clustered’ box → arranges genes in ascending or descending order of clustering frequency * Displaying all genomes in which a gene is present, or in which it is absent - In dark blue header bar of table, for gene of interest change option in window from ‘All’ to ‘non-empty’ or ‘empty’ (table takes a few seconds to update) * To jump to any given organism, type all or part of its name in the ‘Organism’ box in the header, hit return * To delete a genome, check the box next to genome(s) name(s) and hit ‘Delete selected genomes’ 8. ‘Edit empty cells’ & ‘Show missing with matches’ tools Note: These tools are not supported in UChicago SEED, only in PubSEED * An early step in constructing a subsystem is to homogeneously annotate all the genes encoding each function so that they can all be imported into the spreadsheet. Annotation text string is the only mechanism in SEED that associates genes with Subsystems. Annotations need to match the name of the corresponding Functional Role in a Subsystem perfectly for a gene to be included in Subsystem spreadsheet * When several genes for a given function have been carefully annotated (preferably in diverse organisms), the ‘Show missing with matches’ and ‘Edit empty cells’ tools can be used to propagate the annotation * Edit empty cells tool A strategy to further populate Subsystem spreadsheet – one cell at a time: - In the Spreadsheet page Actions box click ‘Edit empty cells’ button. Question marks [?] are displayed in every unpopulated cell – each of them is an individual entryway into a specialized Search tool for potentially missing genes: - Locate Bacillus anthracis str. 'Ames Ancestor' genome, note empty cell in the FolE column - Click on [?] within it once – a floating window will appear. Click on ‘Find candidates’ menu option - Program will perform a 4-step search for this missing gene, and a report page will be generated. It explains why that particular cell has not been populated: > A clear homolog is present in the genome, but not yet annotated with the required string of text > A homologous ORF, although potentially present, has not been called in this genome (ORF-caller mistake) > No homologs can be identified – neither on protein nor on DNA level - If the search (i) for a matching annotation and (ii) for homologous protein sequences within this genome failed, the program will attempt to run a tblastn of a template gene (the default template can be changed manually) against this genome. If successful, an arrow labeled ‘Q’ will be displayed on the report page: - Click on this arrow → DNA sequence around the hit is displayed, possible start codons, stop codons, Shine-Dalgarno sequences are color coded. Inspect possible ORF: - Inspect suggested locus and help the machine to choose start codon: clicking a round start radio button activates it (note that checking a square stop box inactivates it) - Once a start is selected, press ‘BlastP’ (in the ‘Action’ box) → program will search complete genomes in SEED using the proposed ORF as a query. This takes time, since a live BLAST search is performed against all genomes in SEED (~1000) - as opposed to Similarity Tables, for example, where precomputed homologs are instantly displayed - Sequence is highlighted in yellow. Number immediately above each residue is the number of hits with a residue (any residue) at this position - Number above this is the number of hits where the residue is the same as your query sequence * This is a fast way to see the ‘consensus’ length and sequence of the protein, and hence to judge whether the start you have called is canonical. If necessary – select a different start codon and rerun BlastP * When satisfied, activate a radio button in ‘Select function’ column within ‘BLAST Results’ table – in order to associate annotation with the newly created ORF – and click ‘Create’. A new gene page is created in SEED database and the link to it is provided on the report page. Note, that Similarities for the newly created feature might not be available for several days * You can delete the newly created ORF (or any ORF in SEED) from the corresponding Annotation overview page by using ‘delete feature’ button. Use responsibly! * Press ‘Do not Edit empty cells’ when finished with this tool. No need to press ‘Save edit for empty cells’ (except in special cases, not covered by this introductory tutorial) 9. ‘Show genes in column’ tool * ‘Show genes in column’ tool allows download of all the protein sequences in any column (for further analysis outside SEED, e.g. phylogeny) - In the ‘Columns’ box, highlight the protein family of interest, click ‘Show genes in column’ - Displays a table of all proteins in the column → Check All → Press ‘Show Selected Sequences → Outputs FASTA sequences - Can also make TCoffee or Clustal alignments 10. Searching for genes whose distributions are correlated or anti-correlated with that of a gene of interest Example 1: A gene that replaces the folate metabolism enzyme YgfA Gene tested experimentally hu tH hu , h tG utU ,h fo rI , ut I fo nfo rC D CO G y g 364 fA 3 Gene present Gene absent Firmicutes Streptococcus pneumoniae Clostridium tetani Thermoanaerobacter tengcongensis. Streptococcus pyogenes M5 Streptococcus pyogenes SS1-1 Bacillus halodurans Lactococcus lactis Actinobacteria Corynebacterium glutamicum Streptomyces coelicolor Cyanobacteria Synechocystis sp. Acidobacterium sp. Acidobacteria Solibacter usitatus /δ-Proteobacteria Bacteriovorax marinus Helicobacter pylori Syntrophus aciditrophicus Desulfovibrio vulgaris Sorangium cellulosum Bartonella henselae α-Proteobacteria Nitrobacter winogradskyi Sulfitobacter sp. EE-36 Roseobacter sp. Rhizobium leguminosarum Oceanicaulis alexandrii β-Proteobacteria Neisseria meningitidis Burkholderia mallei Chromobacterium violaceum γ-Proteobacteria Escherichia coli Haemophilus influenzae Yersinia pestis Legionella pneumophila Leptospira interrogans Spirochaetes Treponema pallidum Planctomycetes Blastopirellula marina Chlamydophila pneumoniae Chlamydiales Chlorobi Chlorobium phaebacteroides Croceibacter atlanticus Bacteroidetes Bacteroides fragilis Deinococcus/Thermus Thermus thermophilus Deinococcus geothermalis Chloroflexi Herpetosiphon aurantiacus Thermotogae Thermotoga maritima Thermosipho melanesiensis Aquificae Aquifex aeolicus Elusimicrobium minutum Unclassified Bacteria Fungi Saccharomyces cerevisiae Schizosaccharomyces pombe Plants Arabidopsis thaliana Oryza sativa Animals Homo sapiens Mus musculus Comparative genomics of the histidine degradation pathway. The distribution of histidine degradation genes among representative bacterial and eukaryal genomes in relation to that of the ygfA gene for 5-formyl-THF disposal. Gene colors correspond to the different parts of the pathway. Note that COG3643 and forC genes are fused in mammals and some bacteria, shown by . YgfA (5-formyltetrahydrofolate cycloligase) carries out an essential function in folate metabolism, yet is missing from diverse bacterial genomes (we now know that it is replaced by glutamate formiminotransferase). To find candidates that may replace ygfA: Using the Phylogenetic Profiler tool at JGI http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=PhylogenProfiler&page=phyloProfileForm * To search for genes that show an opposite distribution to ygfA (and so may substitute for it): - In ‘Find Genes In’ column select Thermotoga maritima (or any other genome in which YgfA is absent) - In ‘With Homologs In’ column select the 4 other genomes above that do not have YgfA, i.e. Streptococcus pyogenes M1 GAS Streptococcus pyogenes SSI-1 Syntrophus aciditrophicus Elusimicrobium minutum - In ‘Without Homologs In’ column select the 5 genomes above that have YgfA, i.e. Escherichia coli K12 Streptococcus pneumoniae D39 Streptomyces coelicolor Aquifex aeolicus Treponema pallidum SS14 - Press ‘Go’ button * Program searches for genes in Thermotoga maritima that are present in the ‘with’ genomes (which do not have YgfA) and absent in the ‘without’ genomes (which have YgfA) → Glutamate formiminotransferase (EC 2.1.2.5) is the only gene found Example 2: Finding the missing folE2 gene in folate synthesis Generating Phylogenetic distribution profiles for specific gene families in IMG *Go to the IMG database http://img.jgi.doe.gov/cgi-bin/w/main.cgi *The first step is to identify the input gene and add it to the cart. From the homepage, hover over the Find Genes and select ‘Gene Search’ from the pull down menu. *On the Gene Search page, a Keyword search can be performed using the gene family name → type in folE. The Filters pull down menu allows narrowing the search. Also, the genome of interest can be chosen from the pull down menu at the bottom of this page → choose Escherichia coli K12 . Then → click GO redirects to your gene page if a genome was selected. If not, you will have to select a row(s) (displaying the Gene Object ID, the Locus Tag, the Gene Product Name, the Gene symbol, and the organism). In all cases, add the gene(s) to your cart by clicking Add Selected to Gene Cart (top left). *Redo the same operations for folK and folP * Then go to the Analysis Cart Tab and select the three genes → Click on the Profile & Alignment tab (upper right), and near the bottom of the page click on Phylogenetic Occurrence Profile to be redirected to the result page. There, A (for Archaea), B (for Bacteria) and E (for Eukarya) indicate that the gene is present, while the absence of letter and presence of a dot (.) instead shows its absence in a genome. The identity of the organisms is acquired by hovering over the letter or dot. Note that they are grouped by phylogeny. This information is summarized below for a subset of genomes. We will see later how building Subsystems in SEED allows phylogenetic distribution information to be extracted quite easily. This is the table from the El Yacoubi paper (PMID: 17032654) We are now going to use this information to find specific genes families that follow a profile using IMG phylogenetic profiler tool. *From the IMG home page hover over the Find Genes tab and select “Phylogenetic Profiler” and “Single Gene” from the drop down menus to direct to the page to build a phylogenetic profile. The genome for which the result will be given is selected in the Find Genes In column. The sets of genomes to be included or exluded are chosen according to the phylogenetic profile generated above. - In ‘Find Genes In’ column select Thermotoga maritima MSB8 - In ‘With Homologs In’ column select the genomes below Neisseria meningitidis FAM18 Bordetella bronchiseptica RB50 Geobacter sulfurreducens PCA Staphylococcus aureus aureus - In ‘Without Homologs In’ column select the 2 genomes above below Escherichia coli str. K-12 substr. MG1655 S. cerevisiae - Press ‘Go’ button The search parameters can be modified, particularly the Min-Taxon Percent with Homologs or Min-Taxon Percent without homologs (bottom of the page), to relax the stringency of your phylogenetic distribution criteria. *After selecting GO, the analysis produces a summary table that can be downloaded. The genes that the requested phylogetic patterns be captured by selecting the row and clicking on Add Selected to the Gene Cart for further analysis. *You should find the COG1469 family or TM0039 family in the list, → click on it and→ Add to cart, *Go the cart and select the folE, TM0039 and folK genes *Then → Click on “Profile & Alignment” Tab and then the “occurrence profile” button . Analyze the distribution. It is important to keep track of the genomes used for the analysis in case the search needs to be repeated at a later time. As a positive control for your genome choices, your gene of interest should come up in the result table as it was initially used to generate your profile. 11. Creating and using subsets There are two types of subsets: (i) Sets of genes in a subsystem that you may wish to display separately from the rest, e.g. just one branch of a metabolic pathway (ii) Sets of genes in a subsystem that you may wish not to distinguish from each other, e.g. nonorthologous protein families that carry out the same function. These subset names start with * * To create the first type of subset, in any subsystem, e.g. ‘Folate Biosynthesis’, click on ‘Subsets’ tab in green bar * Name the subset (short name, e.g. FolEQP for the first 3 genes of folate synthesis), check the genes to include in the subset → Click ‘Create Subset’ → Click ‘Save Editing to Subsets’ button under table * Open spreadsheet in another tab, click on ‘Limit Display’ tab → Highlight your subset in ‘Show Subsets’ window → Click ‘Limit Display’ button → Displays just the genes in the subset * To create the second type of subset, proceed as above but start the subsystem name with * e.g. *FolEQP * Spreadsheet will collapse all the subset genes in a single column → to uncollapse and show the genes separately → Highlight the subset in the ‘Uncollapse’ column, press ‘Limit Display’ 12. Creating and using User Sets of genomes If you are interested in a particular subset of genomes, grouped on parameter(s) other than taxonomy (e.g. Extremophiles), it might be worth the effort to create your own set. * Under ‘Limit Display’ Tab check that the set you need is not yet available under ‘Specific Sets’ or ‘User Sets’. If not – proceed to ‘Add Genomes’ Tab - Under the window with all available genomes locate a smaller ‘User Sets’ window. Highlight any existing genome set and Click ‘Show selected genome list’ - A special WEB page will open that allows creation/editing of subsets of genomes – follow the cues. Note that there is a bug on this page unfortunately: small arrows between two windows ‘List to choose from’ and ‘Created genome list’ are REVERSED. Hence, to add a genome to one's list, one needs to press the [ < ] arrow (pointing leftwards) 13. Variant codes Variants are typically groups of genomes that share a feature, e.g. they lack a pathway (-1 variants) or have a particular version of a pathway. * To create variants, in any subsystem spreadsheet → Click on ‘Show Variants’ tab * In ‘Actions’ box, click on ‘Edit Variants in this table’ - Can assign variants manually in the variant column of the spreadsheet, e.g. -1 for genomes where a pathway is absent - Click on red ‘Save Variants’ button - Default is not to display -1 variants in spreadsheet. To display → Click ‘Show -1 variants’ * Can also edit variants in ‘Variant Overview’ → Click ‘Variant Overview’ button in ‘Actions’ box - Displays a table of all combinations of genes that occur in the subsystem - Assign a variant number in the ‘Set To’ column of the table, click ‘Set Variants’ button below the table - Add a description of the variant to the box above the table, press ‘Add/Save Variants * For more on variants click ‘Check Variants’ button in ‘Actions’ box 14. Go to a previous version of a subsystem * From the SEED subsystem editor go to a given subsystem * Go to Subsystem info Tab * Click on ‘Reset to Previous Timestamp’ → Choose the version you want to reset to 15. Change name for all the proteins for a given Functional role * Go to the Functional role tab of your subsystem * Next to the Functional role click on the ‘ShowFR’ button * On this summary page there is a ‘Change Functional Role’ box → Use with caution 16. Automatically adding new genomes to a subsystem, adding publications to a subsystem * To add new genomes to a subsystem (‘extend’) automatically as they become available (instead of adding them manually from the ‘Add Genomes’ tab in the Subsystem Editor): - Click on ‘Home’ in green bar at top of Subsystem Editor pages - Click ‘Manage my subsystems’ → Check subsystems that you wish to make automatically extendable → Click ‘Make Autom. Extendable’ button * To add publications on a gene to a subsystem: - Click ‘Functional roles’ tab → Click ‘Literature’ column next to gene of interest - May contain ‘proposed’ publications (found by literature mining) that can be accepted or rejected - Can add single or multiple PubMed IDs 17. Changing a gene’s start or stop site Note: These tools are not supported in UChicago SEED, only in PubSEED * From any Annotation Overview page, copy gene ID, e.g. fig|64091.1.peg.1577 * From ‘Comparative Tools’ in toolbar select ‘Find a gene in this organism’ → Paste ID into box, Click ‘Submit Query’ → Click on Q arrow * Displays DNA sequence around the hit, possible start codons, stop codons, Shine-Dalgarno sequences are color coded * Clicking a round start radio button activates it, checking a square stop box inactivates it * When sites are selected, press ‘BlastP’ → searches completed genomes with selected ORF. 160 hits obtained. - Sequence is highlighted in yellow. Number immediately above each residue is the number of hits with a residue (any residue) at this position - Number above this is the number of hits where the residue is the same as your query sequence * This is a fast way to see the ‘consensus’ length and sequence of the protein, and hence to judge whether the start or stop you have called is canonical * When satisfied, return to editing page, click ‘Create’ and follow logic 18. Additional tools to check for presence or absence of a role in multiple genomes (may be omitted) * Show missing with matches tool (very slow) In the Spreadsheet page Actions box: - In the ‘Columns’ box, highlight the protein family (Function) of interest, click ‘Show missing with matches for columns’ (columns in the spreadsheet correspond to functions) - Program will search by BlastP all genomes where that protein is absent from the spreadsheet column for homologs of any of the proteins that are present in the column (‘all by any’) - Output is a table for each organism in which candidates are found; on the left is the hit and its current annotation and on the right is the query protein and its annotation - After inspecting the hits, check the ‘Assign’ box for those you deem correct → Press ‘Process assignments’ to change current annotation to the annotation of the query - A confirmation page appears → Close window - Update spreadsheet by clicking ‘Refill All Genomes’ (or check desired genomes and press ‘Refill Selected Genomes’) 19. Exercises around the “Folate biosynthesis” Subsystem Note: These exercises are not as detailed as above so the students can find how to redo operations we have already done together with other examples. -Adding the FolE2 family to the folate subsystem In UChicago SEED * Open (your initials)_Folate subsystem in UChicago SEED in one tab * Find the annotation page for TM0039 in another TAB * Add the corresponding functional role to your Folate Subsystem, put FolE2 in the abbreviation box (remember this is done in two steps, adding the functional role to the Functional Role table, then updating from the Spreadsheet page). * In the Spreadsheet column headings, choose ‘empty’ for FolE and ‘non-empty’ for FolK → will give the list of genomes that have a missing FolE that we generated (painfully) with IMG and that can be a start for a phylogenetic query. As you see many (but not all) of them have a folE2 We now want to merge the FolE and FolE2 in one column * Create a *FolE subset that includes FolE and FolE2 (see section 12 for instructions) * In the limit your SubSystem Spreadsheet/Limit display TAB by default you are uncollapsed, choose *FolE subset → limit display. One column will now cover the two roles. -Finding a candidate for the missing phosphatase involved in the hydrolysis of DHN-P * Add the Dihydroneopterin aldolase (EC 4.1.2.25) reaction role (abbreviation FolB) to your Folate Subsystem * Make sure you have added the Akkermansia muciniphila ATCC BAA-835 genome to your Subsystem → Go the annotation page of FolB in this specific genome * Using the different tools we have already covered (feature evidence, psi Blast, Compare genomes) what is your interpretation of the function of this protein? Hint: To see more genomes in the Compare Region tool, lower the ‘Evalue cutoff for selection of pinned CDSs’ to 0-5 and also try with and without collapsing close genomes. *Find the exact same peg in PubSEED (from the PubSEED entry page: paste the full Peg that you found in UChicago SEED) and → Search *Explore the Compare genomes tools in PubSEED. You can compare with the view you had in UChicago SEED and see that as more genomes are available the tool is much more informative and is more powerful to make functional predictions. Discussion Pause: How do you think this peg should be annotated: We will discuss this together in class and the Instructor will change the annotation in UChicago SEED. Updated 4/16/13