METABOLIC PATHWAYS & GENOMICS File: genome&p

advertisement
COMPARATIVE GENOMICS WORKSHOP – Days 1 & 2
USING COMPARATIVE GENOMICS RESOURCES - SEED
SEED is a comparative genomics database and workbench. Unlike STRING it is not rigidly pre-computed; the user has a lot of
control.
1. Logging on to SEED as a user
UChicago SEED http://theseed.uchicago.edu/FIG/index.cgi
(PubSEED http://pubseed.theseed.org/seedviewer.cgi )
* On Entry Page, open ‘Use the new Subsystem Editor’ http://theseed.uchicago.edu/FIG/SubsysEditor.cgi in a new window or new
tab
* At top right, enter username in the left box (and if needed a password in right box) and click ‘login’ (use your RAST user name)
* Typical starting points for a comparative genomics project are a protein sequence, a gene name, or an existing subsystem. The
following exercises walk you through all 3 routes.
2. Enter SEED starting from a protein sequence, e.g. the plant FolE protein
>gi|15231435|ref|NP_187383.1| GTP cyclohydrolase I [Arabidopsis thaliana]
MGALDEGCLNLELDIGMKNGCIELAFEHQPETLAIQDAVKLLLQGLHEDVNREGIKKTPFRVAKALREGT
RGYKQKVKDYVQSALFPEAGLDEGVGQAGGVGGLVVVRDLDHYSYCESCLLPFHVKCHIGYVPSGQRVLG
LSKFSRVTDVFAKRLQDPQRLADDICSALQHWVKPAGVAVVLECSHIHFPSLDLDSLNLSSHRGFVKLLV
SSGSGVFEDESSNLWGEFQSFLMFKGVKTQALCRNGSSVKEWCPSVKSSSKLSPEVDPEMVSAVVSILKS
LGEDPLRKELIATPTRFLKWMLNFQRTNLEMKLNSFNPAKVNGEVKEKRLHCELNMPFWSMCEHHLLPFY
GVVHIGYFCAEGSNPNPVGSSLMKAIVHFYGFKLQVQERMTRQIAETLSPLVGGDVIVVAEAGHTCMISR
GIEKFGSSTATIAVLGRFSSDNSARAMFLDKIHTTNALKTESSSPF
* Since SEED can only (conveniently) search one genome at a time, if you do not already know which prokaryotes have homologs
you first need to do BlastP searches of Bacteria and Archaea at NCBI to find a genome that has a homolog of Arabidopsis FolE
and is also in SEED.
* For this exercise, suppose that you have done this and find that Propionibacterium acnes KPA171202 meets these criteria
* Go to the SEED Entry page
* Paste Arabidopsis FolE sequence into the box for Searching DNA or Protein Sequences (in a selected organism).
* Select the Propionibacterium acnes KPA171202 genome to search for homologs from the list near the top of the page → Click
‘Search for matches’
* Click on best hit, peg.255 (PEG= Protein Encoding Gene) → takes you to ‘Annotation overview’ page for this gene. Features:
- Current assignment (top of page) = current annotation
- EC number = link to KEGG
- Sequence link → Fasta files of DNA, protein
- Feature Evidence tools

Visual protein evidence diagrams similarity to other proteins (number can be specified); it is usually preferable to
rank by score (= BlastP e-value) rather than identity

Multiple homologs in the same organism (if any) are shown next to each other [out of the general sorting order (%
identity or E-val)] and are framed in a blue box (if ‘Group By Genome’ box is activated)

Tabular protein evidence is the same data in tabular format
- Run tool → runs various search tools on the protein, e.g. Psi-Blast (default)
- List of Subsystems that include the protein (in this case about 12)
- Compare Regions tool → Displays chromosomal region around your gene plus 4 closest genomes, color coded for
frequency of occurrence, mouse-over to see annotation

Gene arrows denote direction of transcription/translation

Offset genes denote overlapping ORFs (translational coupling)
- To see more genomes, Click ‘Advanced’ → Type in ‘50’, click ‘Draw’ (takes some time to load) → see many genomes, and
clustering patterns with other folate-related genes e.g. DHPS, DHNA, HPPK (fused in some cases)
- To see more very similar genomes, unclick ‘Collapse close genomes’ (by clicking ‘Show all’) , click ‘Draw’
- [To see more distant homologs, lower both cutoffs to 1e-05, click ‘Draw’]
- To ‘rank’ homologs by similarity to the P. acnes protein, raise the ‘coloring’ cutoff to 1e-75, click ‘Draw’ → Note that more
distant homologs now have different colors
- To pick and choose genomes for display – check or uncheck radio buttons on the left, and Click ‘update with selected’ (not
‘Draw’ in this case). If you need to return to original display – reload the page
- The ‘Compare Regions’ tool is the heart of the SEED
- Trees tool → For some PEG pages (not all) a phylogenetic tree tool is available, click ‘Trees’
- From the Tree page, keep all the default parameters and click on a radio button for a specific tree (they are often very similar),
then click ‘Update’. This is a valuable tool to identify fusions, subfamilies and annotation problems. As you see for this tree,
bright colors around radio buttons guide you to fusions.
3. Enter SEED starting from a gene name, e.g. ylgG, encoding DHNTP pyrophosphohydrolase, an early (and until
recently missing) enzyme of folate biosynthesis
* Switch to the SEED Entry page
* Enter ylgG in the Searching for Genes or Functional Roles Using Text Search Pattern box, select Lactococcus lactis subsp.
lactis, click ‘Search genome selected below’
* Click on peg.1190 → takes you to ‘Annotation Overview page’ → proceed as above
* Further exploration of Compare Regions tool:
- Click on ‘Tabular Region Information’ → Dihydroneopterin triphosphate pyrophosphohydolase highlighted in red, plus flanking
genes.
- Click on Dihydroneopterin triphosphate pyrophosphohydolase ‘cluster’ button → will show precomputed gene clusters (if any
exist). In this case, none do.
- Click instead on ‘cluster’ button for dihydrofolate reductase. Note "display ___ items per page' control on the results page.
Enter '50' to see all the potential gene clusters found → one cluster includes thymidylate synthase (the main source of
dihydrofolate)
- Click on ‘Sequences’ tab → Select ‘Protein’ → Click ‘Align Sequences’ → Displays Clustal protein alignment → Use radio
button to select ‘NJTree’ (Neighbor-joining tree), reload
- Displays protein (or DNA) alignment and a tree – note the two clades (currently annotated differently)
4. Enter SEED via Subsystems. To view subsystems in SEED:
* Go to Subsystem Overview page: http://theseed.uchicago.edu/FIG/SubsysEditor.cgi Displays a table of subsystems, generally
with self-explanatory names
* Use browser search tool to find ‘Folate Biosynthesis’. This is a typical ‘populated’ subsystem for a metabolic pathway and
related enzymes. Click on link → Opens to ‘Subsystem Info’ tab
- ‘Subsystem Info’ contains notes on the pathway
- ‘Functional Roles’ lists the enzymes (= functional roles) and their abbreviations
- Open ‘Spreadsheet’ in a new tab → default is to color genes by clustering
- Click on ‘Color Spreadsheet’ tab → to color essential genes, click ‘Color by attribute’ button
- Select ‘Essential_Gene_Sets_Bacterial’ → click ‘Update’ → Colors essential/non-essential genes in organisms for which
comprehensive knockout studies are available (takes time to load)
- Click on ‘Limit Display’ tab → Can show various subsets of data, e.g. just Firmicute genomes, or genomes with essentiality
data available (SvetaG_Ess-ty_data_available in ‘User sets’)
* Find Lactococcus lactis subsp. lactis Il1403, Click on FolE → takes you to ‘Annotation Overview’ page for fig|272623.1.peg.1188
→ proceed as above
* Further exploration of Feature Evidence tools:
- Click on ‘Feature Evidence’ (blue link), then activate ‘Visual Protein Evidence’ tab. Recall that L. lactis FolE is fused to another
folate synthesis enzyme, FolK
- ‘Visual Protein Evidence’ shows diagrammatically how similar proteins align with the L. lactis fusion protein: Color is overall
similarity, white bars are the matching regions
- Note that there are two sets of hits, to C-terminal (FolE) and N-terminal (FolK) domains. To see this clearly – activate ‘Group
by Genome’ radio button, Click ‘Resubmit’. Multiple homologs within the same organism (if any) are now shown next to each
other [out of the general sorting order (% identity or E-val)] and are framed in a blue box
- Note that domain structure of the query protein is shown near the top of the page. When you hover over each domain,
additional information and a link to the corresponding page in Pfam database is displayed.
- To see additional homologs, enter ‘200’ in ‘Max Sims’, click ‘Resubmit’. You can also change the sorting order (% identity,
Score, etc)
If you hover over any colored bar representing a query protein or one of its homologs, a floating window will be displayed,
describing this PEG and the level of homology in this pair (E-value, % identity, length of alignment, etc). Clicking (once) on
any colored homolog bar activates a different floating window, which provides links to (i) the corresponding PEG page, (ii) all
Subsystems associated with this protein, and (iii) the blast alignment (the very last link). Use them to navigate.
- Switch to ‘Tabular Protein Evidence’ → Explore the table. Query sequence appears on top and is highlighted greenish-brown:
> Association with Subsystem(s) is shown for the query and each homolog in the ‘Associated Subsystem’ column on the
right
> Note also ‘Evidence Codes’ column on the right – hover over its heading and click on [?] sign for explanations. Evidence
Codes in SEED differ from that in other databases: They are based on gene clustering (= ‘functional coupling’) on the
chromosome. Literature references (‘direct’ or ‘indirect’) are shown here also
> User can select sequences here (as targets for annotation or for downloading in Fasta format) by activating radio buttons
in the left-most column of the table. Please select several homologs of the query protein. Note that if you switch to ‘Visual
Protein Evidence’ view now, this selection will be carried over (the opposite is also true). Switching to the graphic view can
also be achieved by clicking on a small icon of an alignment near each PEG ID.
- Explore the links at the top left corner of the page, above the Similarity Table:
> ‘Functionally coupled’ link takes you to the list of gene families that tend to cluster with the query family on the
chromosome, the ‘strength of functional coupling’ is shown (‘Score’ column)
- Can click on links to each gene in the table → Takes you to the ‘Annotation Overview’ page
5. Annotating genes.
Many genes in SEED have already been annotated by experts, and are included in subsystems. However, many have not yet
been annotated. As a user, you can annotate genes yourself.
However, you may only do this for genes:
(a) That have not already been annotated by someone else and entered into an existing subsystem, or
(b) That are not closely related to genes that are in an existing subsystem. It is easy to tell whether a gene is in a subsystem, e.g.
from the Annotation Overview page
Basic Rule: If a gene is in a subsystem, leave it alone.
If and only if the gene is not annotated or not similar to genes that are annotated, you can make your own annotation as follows:
* In L. lactis FolE Annotation Overview, click to move left, then click on peg.1177, ‘integral membrane protein’. Click on this gene
to re-center the viewer.
Alternatively, to quickly jump to a different protein page with the same genome – copy the full ID of the query gene
(fig|272623.1.peg.1188), paste it into the small window ‘find’ at the very top right corner of the page, change the gene number
to ‘1177’ and Click ‘find’
* In ‘Compare Regions’ request 40 similar sequences – check that none has a definite annotation or is in any subsystems
* Run Psi-Blast → Conserved Domain search shows that peg.1177 is COG4478, predicted membrane protein
* To replace current annotation with ‘COG4478, predicted membrane protein’ (Instructor alone does the annotation):
- Copy ‘COG4478, predicted membrane protein’ from Conserved Domain search, paste into ‘New Assignment’ box (avoid white
spaces), click ‘Change’ → Changes assignment
- To propagate this annotation to similar genes → Click ‘Annotate Clusters’, activate round radio button on desired name (at top
of table), check square boxes of sequences to be renamed
- Click on ‘annotate’ → All checked sequences are now annotated ‘COG4478, predicted membrane protein’
* There are many other ways to annotate genes in SEED, but this is the basic operation
6. Building a subsystem.
As an example, we will each build a small subsystem using the folE gene and genes associated with it.
* In Subsystem Editor, click ‘Home’ (OR: from any PEG page click ‘Curate Subsystems’ under ‘Navigate’) → Click ‘Create new
subsystem → Opens a new window
* Name your subsystem (your initials)_Folate (e.g. ADH_Folate), write ‘Test’ in Description and in Notes → Click ‘Save Changes’
→ Click on link at top of page to enter new subsystem
* By default your subsystem will be classified as ‘Experimental’ – Leave it that way
* Can change Description and Notes, press red ‘Save Changes’ button. Note: Save every page before you leave it, otherwise it
is lost
* First we will add a gene to the subsystem, then some genomes. To add a gene:
* Open ‘Functional Roles’ in a new tab. To add an existing functional role, copy-paste from elsewhere in SEED. To copy-paste
FolE’s functional role:
- Go to Subsystem Overview, find Folate Biosynthesis subsystem, Click on ‘Functional Roles’ tab → copy ‘GTP cyclohydrolase
I (EC 3.5.4.16) type 1’ → paste into your subsystem
* Add an abbreviation (this is the gene name that will appear in the spreadsheet), e.g. FolE, Press ‘Add Role’ → new box to add
another role appears. Press ‘Save Changes’
* To add genomes: Open Spreadsheet in a new tab → Select E. coli K12 MG1655, press ‘add selected genome(s) to
spreadsheet’ → Have mini-spreadsheet with one gene and one genome
* To find other genomes that have FolE and to add them to the spreadsheet:
- Click on E. coli gene (peg.2221) → On Annotation Overview page, use Compare Regions tool with ‘collapse close genomes’
to get 100 related genomes
- Note that FolE occurs in many diverse bacteria, therefore add all bacterial genomes to spreadsheet
- Return to Subsystem Spreadsheet → Check only Bacteria in Domains box → Highlight all genomes in list using Shift key →
press ‘add selected genome(s) to spreadsheet’
* To add relevant genes, search first for genes that may be clustered with FolE. Return to FolE Annotation Overview, search for
genes often found near FolE
- Note 6-pyruvoyl tetrahydrobiopterin synthase (EC 4.2.3.12) quite often translationally coupled to FolE
- Therefore copy-paste this gene into the Functional Roles page, add an abbreviation (e.g. PTPS), press ‘Add Role’, ‘Save
Changes’
- Return to Spreadsheet, reload page → Shows FolE and PTPS columns → Press ‘Fill All Genomes’ → Populates PTPS
column (Note: ‘Refill All Genomes’ also works but is more radical)
- To see clustering between FolE and PTPS, click on ‘Color Spreadsheet’ tab, activate color by cluster radio button
* Instructor alone does the annotation. We will use a different annotation method to the one above.
* To add an as-yet unannotated gene, return to FolE annotation overview → Gene adjacent to folE in E. coli
annotated ‘putative inner membrane protein’, not in subsystems
K-12 MG1655 is
* To annotate this gene: Click on the gene (fig|511145.6.peg.2220) E. coli to recenter display
- In Compare Genomes tool, ask for 100 genomes with both cutoffs at 1e-05 → Run Psi-Blast → Conserved domain search
gives ‘COG2311, Predicted membrane protein’
- Copy this annotation, paste it into new assignment box in Annotation Overview, press ‘Change’ → Press ‘Feature Evidence’,
‘Tabular Protein Evidence’
- Activate radio button ‘Assign from’ for E. coli, check boxes for 10 very similar proteins → Press ‘Assign Function’ → Boxes for
these genes change to white
* To add to spreadsheet: copy-paste annotation ‘COG2311, Predicted membrane protein’ into Functional Roles page, add an
abbreviation (e.g. COG2311), press ‘Add Role’, ‘Save Changes’
* Go to Spreadsheet, click ‘Fill All Genomes’ → adds new gene and new genomes
* Saving the subsystem: Close all subsystem windows. On Subsystems Overview page, click ‘Manage my subsystems’, Check
the name of your subsystem, Click ‘Publish Subsystems to Clearinghouse’
* Deleting a subsystem: As above but select ‘Delete selected subsystems’
7. Annotation syntax, copying a subsystem, subsystem spreadsheet tools
* Annotation syntax. Genes can if necessary be given more than one annotation. The syntax and conventions for this are:
- Role1 @ Role2 – A monofunctional [‘single domain’] enzyme that can catalyze two different reactions, both role 1 and role 2
- Role1 / Role2 – A bifunctional [often ‘two domain’] enzyme that does two unrelated roles, role 1 and role 2
- Role1 ; Role2 – Does either role 1 or role 2 but it is not clear which (can be used to double-annotate)
* When building subsystems, enter Role 1 and Role 2 separately in the Functional Roles table
# after annotation is a comment (does not affect annotation)
* To copy a subsystem – Go to Subsystems Overview
- Locate subsystem of interest, in far-right column click ‘copy’
- In ‘Copy To’ box enter name of new subsystem, click ‘Copy Now’ → Click on link at top of page to access new subsystem
* To survey clustering patterns, at top of Spreadsheet click on triangles in ‘# clustered’ box → arranges genes in ascending or
descending order of clustering frequency
* Displaying all genomes in which a gene is present, or in which it is absent
- In dark blue header bar of table, for gene of interest change option in window from ‘All’ to ‘non-empty’ or ‘empty’ (table takes
a few seconds to update)
* To jump to any given organism, type all or part of its name in the ‘Organism’ box in the header, hit return
* To delete a genome, check the box next to genome(s) name(s) and hit ‘Delete selected genomes’
8. ‘Edit empty cells’ & ‘Show missing with matches’ tools
Note: These tools are not supported in UChicago SEED, only in PubSEED
* An early step in constructing a subsystem is to homogeneously annotate all the genes encoding each function so that they can
all be imported into the spreadsheet. Annotation text string is the only mechanism in SEED that associates genes with
Subsystems. Annotations need to match the name of the corresponding Functional Role in a Subsystem perfectly for a gene to
be included in Subsystem spreadsheet
* When several genes for a given function have been carefully annotated (preferably in diverse organisms), the ‘Show missing
with matches’ and ‘Edit empty cells’ tools can be used to propagate the annotation
* Edit empty cells tool A strategy to further populate Subsystem spreadsheet – one cell at a time:
- In the Spreadsheet page Actions box click ‘Edit empty cells’ button. Question marks [?] are displayed in every unpopulated
cell – each of them is an individual entryway into a specialized Search tool for potentially missing genes:
- Locate Bacillus anthracis str. 'Ames Ancestor' genome, note empty cell in the FolE column
- Click on [?] within it once – a floating window will appear. Click on ‘Find candidates’ menu option
- Program will perform a 4-step search for this missing gene, and a report page will be generated. It explains why that particular
cell has not been populated:
> A clear homolog is present in the genome, but not yet annotated with the required string of text
> A homologous ORF, although potentially present, has not been called in this genome (ORF-caller mistake)
> No homologs can be identified – neither on protein nor on DNA level
- If the search (i) for a matching annotation and (ii) for homologous protein sequences within this genome failed, the program
will attempt to run a tblastn of a template gene (the default template can be changed manually) against this genome. If
successful, an arrow labeled ‘Q’ will be displayed on the report page:
- Click on this arrow → DNA sequence around the hit is displayed, possible start codons, stop codons, Shine-Dalgarno
sequences are color coded. Inspect possible ORF:
- Inspect suggested locus and help the machine to choose start codon: clicking a round start radio button activates it (note
that checking a square stop box inactivates it)
- Once a start is selected, press ‘BlastP’ (in the ‘Action’ box) → program will search complete genomes in SEED using the
proposed ORF as a query. This takes time, since a live BLAST search is performed against all genomes in SEED (~1000) - as
opposed to Similarity Tables, for example, where precomputed homologs are instantly displayed
- Sequence is highlighted in yellow. Number immediately above each residue is the number of hits with a residue (any residue)
at this position
- Number above this is the number of hits where the residue is the same as your query sequence
* This is a fast way to see the ‘consensus’ length and sequence of the protein, and hence to judge whether the start you have
called is canonical. If necessary – select a different start codon and rerun BlastP
* When satisfied, activate a radio button in ‘Select function’ column within ‘BLAST Results’ table – in order to associate annotation
with the newly created ORF – and click ‘Create’. A new gene page is created in SEED database and the link to it is provided on
the report page. Note, that Similarities for the newly created feature might not be available for several days
* You can delete the newly created ORF (or any ORF in SEED) from the corresponding Annotation overview page by using ‘delete
feature’ button. Use responsibly!
* Press ‘Do not Edit empty cells’ when finished with this tool. No need to press ‘Save edit for empty cells’ (except in special cases,
not covered by this introductory tutorial)
9. ‘Show genes in column’ tool
* ‘Show genes in column’ tool allows download of all the protein sequences in any column (for further analysis outside SEED,
e.g. phylogeny)
- In the ‘Columns’ box, highlight the protein family of interest, click ‘Show genes in column’
- Displays a table of all proteins in the column → Check All → Press ‘Show Selected Sequences → Outputs FASTA sequences
- Can also make TCoffee or Clustal alignments
10. Searching for genes whose distributions are correlated or anti-correlated with that of a gene of
interest
Example 1: A gene that replaces the folate metabolism enzyme YgfA
Gene tested experimentally
hu
tH
hu , h
tG utU
,h
fo
rI ,
ut
I
n
fo
rC foD
CO
G
y g 364
fA 3
Gene present
Gene absent
Firmicutes
Streptococcus pneumoniae
Clostridium tetani
Thermoanaerobacter tengcongensis.
Streptococcus pyogenes M5
Streptococcus pyogenes SS1-1
Bacillus halodurans
Lactococcus lactis
Actinobacteria
Corynebacterium glutamicum
Streptomyces coelicolor
Cyanobacteria
Synechocystis sp.
Acidobacterium sp.
Acidobacteria
Solibacter usitatus
/δ-Proteobacteria
Bacteriovorax marinus
Helicobacter pylori
Syntrophus aciditrophicus
Desulfovibrio vulgaris
Sorangium cellulosum
Bartonella henselae
α-Proteobacteria
Nitrobacter winogradskyi
Sulfitobacter sp. EE-36
Roseobacter sp.
Rhizobium leguminosarum
Oceanicaulis alexandrii
β-Proteobacteria
Neisseria meningitidis
Burkholderia mallei
Chromobacterium violaceum
γ-Proteobacteria
Escherichia coli
Haemophilus influenzae
Yersinia pestis
Legionella pneumophila
Leptospira interrogans
Spirochaetes
Treponema pallidum
Planctomycetes
Blastopirellula marina
Chlamydophila pneumoniae
Chlamydiales
Chlorobi
Chlorobium phaebacteroides
Croceibacter atlanticus
Bacteroidetes
Bacteroides fragilis
Deinococcus/Thermus Thermus thermophilus
Deinococcus geothermalis
Chloroflexi
Herpetosiphon aurantiacus
Thermotogae
Thermotoga maritima
Thermosipho melanesiensis
Aquificae
Aquifex aeolicus
Elusimicrobium minutum
Unclassified Bacteria
Fungi
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Plants
Arabidopsis thaliana
Oryza sativa
Animals
Homo sapiens
Mus musculus
Comparative genomics of the histidine degradation pathway.
The distribution of histidine degradation
genes among representative bacterial and
eukaryal genomes in relation to that of the
ygfA gene for 5-formyl-THF disposal. Gene
colors
correspond to the different
parts of the pathway. Note that COG3643
and forC genes are fused in mammals and
some bacteria, shown by
.
YgfA (5-formyltetrahydrofolate cycloligase) carries out an essential function in folate metabolism, yet is missing from diverse
bacterial genomes (we now know that it is replaced by glutamate formiminotransferase). To find candidates that may replace ygfA:
Using the Phylogenetic Profiler tool at JGI
https://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=PhylogenProfiler&page=phyloProfileForm
* To search for genes that show an opposite distribution to ygfA (and so may substitute for it):
- In ‘Find Genes In’ column select Thermotoga maritima (or any other genome in which YgfA is absent)
- In ‘With Homologs In’ column select the 4 other genomes above that do not have YgfA, i.e.
Streptococcus pyogenes M1 GAS
Streptococcus pyogenes SSI-1
Syntrophus aciditrophicus
Elusimicrobium minutum
- In ‘Without Homologs In’ column select the 5 genomes above that have YgfA, i.e.
Escherichia coli K12
Streptococcus pneumoniae D39
Streptomyces coelicolor
Aquifex aeolicus
Treponema pallidum SS14
- Press ‘Go’ button
* Program searches for genes in Thermotoga maritima that are present in the ‘with’ genomes (which do not have YgfA) and
absent in the ‘without’ genomes (which have YgfA) → Glutamate formiminotransferase (EC 2.1.2.5) is the only gene found
Example 2: Finding the missing folE2 gene in folate synthesis
Generating Phylogenetic distribution profiles for specific gene families in IMG
*Go to the IMG database http://img.jgi.doe.gov/cgi-bin/edu/main.cgi
*The first step is to input your selected genomes. From the homepage, hover over the FIND GENOMES and select ‘Genome
Search from the pull down menu.
Find the
Thermotoga maritima MSB8 genome and add it to the cart ( if you have several sequences just add one)
Add all the other genomes one by one:
Bordetella bronchiseptica RB50
Neisseria meningitidis FAM18
Geobacter sulfurreducens PCA
Staphylococcus aureus aureus RF122
Escherichia coli str. K-12 substr. MG1655
Saccharomyces cerevisiae S288C
Bacillus subtilis subtilis 168
*The second step is to identify the input functions and add them to the cart. From the homepage, hover over the FIND FUNCTION
and select ‘Search Pathways’ from the pull down menu.
*On the ‘Search Pathways’ page, select only the KEGG box in the pathway type and type in ‘Folate’ in the Keyword search and
then hit the “Search all pathways” button. Click on the OID number for the “folate synthesis pathway. This will bring you to all the
enzymes of the folate pathway. Select the following EC numbers: EC:3.5.4.16, EC:2.7.6.3, EC:4.1.2.25 and “ADD these function
to the analysis cart”. Go the “Analysis Cart/Function” Tab. You should see the three EC numbers, select them all if they are not
already selected.
* Go the “Profile and Alignment” Tab. In the Domain pull-down menu, select Genome Cart and hit the Show button.
* Go and view “Functions versus Genomes” or the “Phylogenetic Occurrence Profile”. These are two ways to see the same
information. What does it mean?
This information is summarized below for a larger subset of genomes. We will see later how building Subsystems in SEED allows
phylogenetic distribution information to be extracted quite easily.
This is the table from the El Yacoubi paper (PMID: 17032654)
We are now going to use this information to find specific genes families that follow a profile using IMG phylogenetic profiler tool.
*From the IMG home page hover over the Find Genes tab and select “Phylogenetic Profiler” and “Single Gene” from the drop
down menus to direct to the page to build a phylogenetic profile.
- ADD Thermotoga maritima MSB8 in the ‘Find Genes In’ box
- ADD to the ‘With Homologs In’ box the genomes below
Neisseria meningitidis FAM18
Bordetella bronchiseptica RB50
Geobacter sulfurreducens PCA
Staphylococcus aureus aureus RF122
- Add to the ‘Without Homologs In’ box the 2 genomes below
Escherichia coli str. K-12 substr. MG1655
Saccharomyces cerevisiae S288C
The search parameters can be modified, particularly the Min-Taxon Percent with Homologs or Min-Taxon Percent without
homologs (bottom of the page), to relax the stringency of your phylogenetic distribution criteria. Select the COG number in the
“Function display options”
- Press ‘SUBMIT’ button
*After selecting GO, the analysis produces a summary table that can be downloaded.
*You should find the COG1469 family or TM0039 family in the list, → click on the gene object ID. It will bring you to the gene
page. Scroll down to the “conserved neighborhood” box. Click on the link to “top neighborhood regions” (the first one), what
does it tell you?
11. Creating and using subsets
There are two types of subsets:
(i) Sets of genes in a subsystem that you may wish to display separately from the rest, e.g. just one branch of a metabolic
pathway
(ii) Sets of genes in a subsystem that you may wish not to distinguish from each other, e.g. non-orthologous protein families that
carry out the same function. These subset names start with *
* To create the first type of subset, in any subsystem, e.g. ‘Folate Biosynthesis’, click on ‘Subsets’ tab in green bar
* Name the subset (short name, e.g. FolEQP for the first 3 genes of folate synthesis), check the genes to include in the subset →
Click ‘Create Subset’ → Click ‘Save Editing to Subsets’ button under table
* Open spreadsheet in another tab, click on ‘Limit Display’ tab → Highlight your subset in ‘Show Subsets’ window → Click ‘Limit
Display’ button → Displays just the genes in the subset
* To create the second type of subset, proceed as above but start the subsystem name with * e.g. *FolEQP
* Spreadsheet will collapse all the subset genes in a single column → to uncollapse and show the genes separately → Highlight
the subset in the ‘Uncollapse’ column, press ‘Limit Display’
12. Creating and using User Sets of genomes
If you are interested in a particular subset of genomes, grouped on parameter(s) other than taxonomy (e.g. Extremophiles), it
might be worth the effort to create your own set.
* Under ‘Limit Display’ Tab check that the set you need is not yet available under ‘Specific Sets’ or ‘User Sets’. If not – proceed to
‘Add Genomes’ Tab
- Under the window with all available genomes locate a smaller ‘User Sets’ window. Highlight any existing genome set and
Click ‘Show selected genome list’
- A special WEB page will open that allows creation/editing of subsets of genomes – follow the cues. Note that there is a bug
on this page unfortunately: small arrows between two windows ‘List to choose from’ and ‘Created genome list’ are REVERSED.
Hence, to add a genome to one's list, one needs to press the [ < ] arrow (pointing leftwards)
13. Variant codes
Variants are typically groups of genomes that share a feature, e.g. they lack a pathway (-1 variants) or have a particular version of
a pathway.
* To create variants, in any subsystem spreadsheet → Click on ‘Show Variants’ tab
* In ‘Actions’ box, click on ‘Edit Variants in this table’
- Can assign variants manually in the variant column of the spreadsheet, e.g. -1 for genomes where a pathway is absent
- Click on red ‘Save Variants’ button
- Default is not to display -1 variants in spreadsheet. To display → Click ‘Show -1 variants’
* Can also edit variants in ‘Variant Overview’ → Click ‘Variant Overview’ button in ‘Actions’ box
- Displays a table of all combinations of genes that occur in the subsystem
- Assign a variant number in the ‘Set To’ column of the table, click ‘Set Variants’ button below the table
- Add a description of the variant to the box above the table, press ‘Add/Save Variants
* For more on variants click ‘Check Variants’ button in ‘Actions’ box
14. Go to a previous version of a subsystem
* From the SEED subsystem editor go to a given subsystem
* Go to Subsystem info Tab
* Click on ‘Reset to Previous Timestamp’ → Choose the version you want to reset to
15. Change name for all the proteins for a given Functional role
* Go to the Functional role tab of your subsystem
* Next to the Functional role click on the ‘ShowFR’ button
* On this summary page there is a ‘Change Functional Role’ box → Use with caution
16. Automatically adding new genomes to a subsystem, adding publications to a subsystem
* To add new genomes to a subsystem (‘extend’) automatically as they become available (instead of adding them manually from
the ‘Add Genomes’ tab in the Subsystem Editor):
- Click on ‘Home’ in green bar at top of Subsystem Editor pages
- Click ‘Manage my subsystems’ → Check subsystems that you wish to make automatically extendable → Click ‘Make Autom.
Extendable’ button
* To add publications on a gene to a subsystem:
- Click ‘Functional roles’ tab → Click ‘Literature’ column next to gene of interest
- May contain ‘proposed’ publications (found by literature mining) that can be accepted or rejected
- Can add single or multiple PubMed IDs
17. Changing a gene’s start or stop site
Note: These tools are not supported in UChicago SEED, only in PubSEED
* From any Annotation Overview page, copy gene ID, e.g. fig|64091.1.peg.1577
* From ‘Comparative Tools’ in toolbar select ‘Find a gene in this organism’ → Paste ID into box, Click ‘Submit Query’ → Click on Q
arrow
* Displays DNA sequence around the hit, possible start codons, stop codons, Shine-Dalgarno sequences are color coded
* Clicking a round start radio button activates it, checking a square stop box inactivates it
* When sites are selected, press ‘BlastP’ → searches completed genomes with selected ORF. 160 hits obtained.
- Sequence is highlighted in yellow. Number immediately above each residue is the number of hits with a residue (any residue)
at this position
- Number above this is the number of hits where the residue is the same as your query sequence
* This is a fast way to see the ‘consensus’ length and sequence of the protein, and hence to judge whether the start or stop you
have called is canonical
* When satisfied, return to editing page, click ‘Create’ and follow logic
18. Additional tools to check for presence or absence of a role in multiple genomes
(may be omitted)
* Show missing with matches tool (very slow)
In the Spreadsheet page Actions box:
- In the ‘Columns’ box, highlight the protein family (Function) of interest, click ‘Show missing with matches for columns’
(columns in the spreadsheet correspond to functions)
- Program will search by BlastP all genomes where that protein is absent from the spreadsheet column for homologs of any of
the proteins that are present in the column (‘all by any’)
- Output is a table for each organism in which candidates are found; on the left is the hit and its current annotation and on the
right is the query protein and its annotation
- After inspecting the hits, check the ‘Assign’ box for those you deem correct → Press ‘Process assignments’ to change current
annotation to the annotation of the query
- A confirmation page appears → Close window
- Update spreadsheet by clicking ‘Refill All Genomes’ (or check desired genomes and press ‘Refill Selected Genomes’)
19. Exercises around the “Folate biosynthesis” Subsystem
Note: These exercises are not as detailed as above so the students can find how to redo operations we have already done
together with other examples.
-Adding the FolE2 family to the folate subsystem
In UChicago SEED
* Open (your initials)_Folate subsystem in UChicago SEED in one tab
* Find the annotation page for TM0039 in another TAB
* Add the corresponding functional role to your Folate Subsystem, put FolE2 in the abbreviation box (remember this is done in two
steps, adding the functional role to the Functional Role table, then updating from the Spreadsheet page).
* In the Spreadsheet column headings, choose ‘empty’ for FolE and ‘non-empty’ for FolK → will give the list of genomes that have
a missing FolE that we generated (painfully) with IMG and that can be a start for a phylogenetic query. As you see many (but not
all) of them have a folE2
We now want to merge the FolE and FolE2 in one column
* Create a *FolE subset that includes FolE and FolE2 (see section 12 for instructions)
* In the limit your SubSystem Spreadsheet/Limit display TAB by default you are uncollapsed, choose *FolE subset → limit
display. One column will now cover the two roles.
-Finding a candidate for the missing phosphatase involved in the hydrolysis of DHN-P
* Add the Dihydroneopterin aldolase (EC 4.1.2.25) reaction role (abbreviation FolB) to your Folate Subsystem
* Make sure you have added the Akkermansia muciniphila ATCC BAA-835 genome to your Subsystem
→ Go the annotation page of FolB in this specific genome
* Using the different tools we have already covered (feature evidence, psi Blast, Compare genomes) what is your interpretation of
the function of this protein?
Hint: To see more genomes in the Compare Region tool, lower the ‘Evalue cutoff for selection of pinned CDSs’ to 0-5 and also try
with and without collapsing close genomes.
*Find the exact same peg in PubSEED (from the PubSEED entry page: paste the full Peg that you found in UChicago SEED) and
→ Search
*Explore the Compare genomes tools in PubSEED. You can compare with the view you had in UChicago SEED and see that as
more genomes are available the tool is much more informative and is more powerful to make functional predictions.
Discussion Pause: How do you think this peg should be annotated: We will discuss this together in class and the Instructor will
change the annotation in UChicago SEED.
Updated 6/1/14
Download