COMPARATIVE GENOMICS WORKSHOP – Days 1 & 2

advertisement
COMPARATIVE GENOMICS WORKSHOP – Days 1 & 2
USING COMPARATIVE GENOMICS RESOURCES - SEED
SEED is a comparative genomics database and workbench. Unlike STRING it is not rigidly pre-computed;
the user has a lot of control.
1. Logging on to SEED as a user
UChicago SEED http://theseed.uchicago.edu/FIG/index.cgi
(PubSEED http://pubseed.theseed.org/seedviewer.cgi )
* On Entry Page, open ‘Use the new Subsystem Editor’ http://theseed.uchicago.edu/FIG/SubsysEditor.cgi in
a new window or new tab
* At top right, enter username in the left box (and if needed a password in right box) and click ‘login’ (use
your RAST user name)
* Typical starting points for a comparative genomics project are a protein sequence, a gene name, or an
existing subsystem. The following exercises walk you through all 3 routes.
2. Enter SEED starting from a protein sequence, e.g. the plant FolE protein
>gi|15231435|ref|NP_187383.1| GTP cyclohydrolase I [Arabidopsis thaliana]
MGALDEGCLNLELDIGMKNGCIELAFEHQPETLAIQDAVKLLLQGLHEDVNREGIKKTPFRVAKALREGT
RGYKQKVKDYVQSALFPEAGLDEGVGQAGGVGGLVVVRDLDHYSYCESCLLPFHVKCHIGYVPSGQRVLG
LSKFSRVTDVFAKRLQDPQRLADDICSALQHWVKPAGVAVVLECSHIHFPSLDLDSLNLSSHRGFVKLLV
SSGSGVFEDESSNLWGEFQSFLMFKGVKTQALCRNGSSVKEWCPSVKSSSKLSPEVDPEMVSAVVSILKS
LGEDPLRKELIATPTRFLKWMLNFQRTNLEMKLNSFNPAKVNGEVKEKRLHCELNMPFWSMCEHHLLPFY
GVVHIGYFCAEGSNPNPVGSSLMKAIVHFYGFKLQVQERMTRQIAETLSPLVGGDVIVVAEAGHTCMISR
GIEKFGSSTATIAVLGRFSSDNSARAMFLDKIHTTNALKTESSSPF
* Since SEED can only (conveniently) search one genome at a time, if you do not already know which
prokaryotes have homologs you first need to do BlastP searches of Bacteria and Archaea at NCBI to find a
genome that has a homolog of Arabidopsis FolE and is also in SEED.
* For this exercise, suppose that you have done this and find that Propionibacterium acnes KPA171202
meets these criteria
* Go to the SEED Entry page
* Paste Arabidopsis FolE sequence into the box for Searching DNA or Protein Sequences (in a selected
organism).
* Select the Propionibacterium acnes KPA171202 genome to search for homologs from the list near the top
of the page → Click ‘Search for matches’
* Click on best hit, peg.255 (PEG= Protein Encoding Gene) → takes you to ‘Annotation overview’ page for
this gene. Features:
- Current assignment (top of page) = current annotation
- EC number = link to KEGG
- Sequence link → Fasta files of DNA, protein
- Feature Evidence tools

Visual protein evidence diagrams similarity to other proteins (number can be specified); it is
usually preferable to rank by score (= BlastP e-value) rather than identity

Multiple homologs in the same organism (if any) are shown next to each other [out of the
general sorting order (% identity or E-val)] and are framed in a blue box (if ‘Group By Genome’
box is activated)

Tabular protein evidence is the same data in tabular format
- Run tool → runs various search tools on the protein, e.g. Psi-Blast (default)
- List of Subsystems that include the protein (in this case about 12)
- Compare Regions tool → Displays chromosomal region around your gene plus 4 closest genomes,
color coded for frequency of occurrence, mouse-over to see annotation

Gene arrows denote direction of transcription/translation

Offset genes denote overlapping ORFs (translational coupling)
- To see more genomes, Click ‘Advanced’ → Type in ‘50’, click ‘Draw’ (takes some time to load) → see
many genomes, and clustering patterns with other folate-related genes e.g. DHPS, DHNA, HPPK (fused
in some cases)
- To see more very similar genomes, unclick ‘Collapse close genomes’ (by clicking ‘Show all’) , click
‘Draw’
- [To see more distant homologs, lower both cutoffs to 1e-05, click ‘Draw’]
- To ‘rank’ homologs by similarity to the P. acnes protein, raise the ‘coloring’ cutoff to 1e-75, click
‘Draw’ → Note that more distant homologs now have different colors
- To pick and choose genomes for display – check or uncheck radio buttons on the left, and Click ‘update
with selected’ (not ‘Draw’ in this case). If you need to return to original display – reload the page
- The ‘Compare Regions’ tool is the heart of the SEED
- Trees tool → For some PEG pages (not all) a phylogenetic tree tool is available, click ‘Trees’
- From the Tree page, keep all the default parameters and click on a radio button for a specific tree (they
are often very similar), then click ‘Update’. This is a valuable tool to identify fusions, subfamilies and
annotation problems. As you see for this tree, bright colors around radio buttons guide you to fusions.
3. Enter SEED starting from a gene name, e.g. ylgG, encoding DHNTP pyrophosphohydrolase, an
early (and until recently missing) enzyme of folate biosynthesis
* Switch to the SEED Entry page
* Enter ylgG in the Searching for Genes or Functional Roles Using Text Search Pattern box, select
Lactococcus lactis subsp. lactis, click ‘Search genome selected below’
* Click on peg.1190 → takes you to ‘Annotation Overview page’ → proceed as above
* Further exploration of Compare Regions tool:
- Click on ‘Tabular Region Information’ → Dihydroneopterin triphosphate pyrophosphohydolase
highlighted in red, plus flanking genes.
- Click on Dihydroneopterin triphosphate pyrophosphohydolase ‘cluster’ button → will show
precomputed gene clusters (if any exist). In this case, none do.
- Click instead on ‘cluster’ button for dihydrofolate reductase. Note "display ___ items per page' control
on the results page. Enter '50' to see all the potential gene clusters found → one cluster includes
thymidylate synthase (the main source of dihydrofolate)
- Click on ‘Sequences’ tab → Select ‘Protein’ → Click ‘Align Sequences’ → Displays Clustal protein
alignment → Use radio button to select ‘NJTree’ (Neighbor-joining tree), reload
- Displays protein (or DNA) alignment and a tree – note the two clades (currently annotated differently)
4. Enter SEED via Subsystems. To view subsystems in SEED:
* Go to Subsystem Overview page: http://theseed.uchicago.edu/FIG/SubsysEditor.cgi Displays a table of
subsystems, generally with self-explanatory names
* Use browser search tool to find ‘Folate Biosynthesis’. This is a typical ‘populated’ subsystem for a
metabolic pathway and related enzymes. Click on link → Opens to ‘Subsystem Info’ tab
- ‘Subsystem Info’ contains notes on the pathway
- ‘Functional Roles’ lists the enzymes (= functional roles) and their abbreviations
- Open ‘Spreadsheet’ in a new tab → default is to color genes by clustering
- Click on ‘Color Spreadsheet’ tab → to color essential genes, click ‘Color by attribute’ button
- Select ‘Essential_Gene_Sets_Bacterial’ → click ‘Update’ → Colors essential/non-essential genes in
organisms for which comprehensive knockout studies are available (takes time to load)
- Click on ‘Limit Display’ tab → Can show various subsets of data, e.g. just Firmicute genomes, or
genomes with essentiality data available (SvetaG_Ess-ty_data_available in ‘User sets’)
* Find Lactococcus lactis subsp. lactis Il1403, Click on FolE → takes you to ‘Annotation Overview’ page
for fig|272623.1.peg.1188 → proceed as above
* Further exploration of Feature Evidence tools:
- Click on ‘Feature Evidence’ (blue link), then activate ‘Visual Protein Evidence’ tab. Recall that L. lactis
FolE is fused to another folate synthesis enzyme, FolK
- ‘Visual Protein Evidence’ shows diagrammatically how similar proteins align with the L. lactis fusion
protein: Color is overall similarity, white bars are the matching regions
- Note that there are two sets of hits, to C-terminal (FolE) and N-terminal (FolK) domains. To see this
clearly – activate ‘Group by Genome’ radio button, Click ‘Resubmit’. Multiple homologs within the
same organism (if any) are now shown next to each other [out of the general sorting order (% identity or
E-val)] and are framed in a blue box
- Note that domain structure of the query protein is shown near the top of the page. When you hover over
each domain, additional information and a link to the corresponding page in Pfam database is displayed.
- To see additional homologs, enter ‘200’ in ‘Max Sims’, click ‘Resubmit’. You can also change the
sorting order (% identity, Score, etc)
If you hover over any colored bar representing a query protein or one of its homologs, a floating window
will be displayed, describing this PEG and the level of homology in this pair (E-value, % identity, length
of alignment, etc). Clicking (once) on any colored homolog bar activates a different floating window,
which provides links to (i) the corresponding PEG page, (ii) all Subsystems associated with this protein,
and (iii) the blast alignment (the very last link). Use them to navigate.
- Switch to ‘Tabular Protein Evidence’ → Explore the table. Query sequence appears on top and is
highlighted greenish-brown:
> Association with Subsystem(s) is shown for the query and each homolog in the ‘Associated
Subsystem’ column on the right
> Note also ‘Evidence Codes’ column on the right – hover over its heading and click on [?] sign for
explanations. Evidence Codes in SEED differ from that in other databases: They are based on gene
clustering (= ‘functional coupling’) on the chromosome. Literature references (‘direct’ or ‘indirect’)
are shown here also
> User can select sequences here (as targets for annotation or for downloading in Fasta format) by
activating radio buttons in the left-most column of the table. Please select several homologs of the
query protein. Note that if you switch to ‘Visual Protein Evidence’ view now, this selection will be
carried over (the opposite is also true). Switching to the graphic view can also be achieved by
clicking on a small icon of an alignment near each PEG ID.
- Explore the links at the top left corner of the page, above the Similarity Table:
> ‘Functionally coupled’ link takes you to the list of gene families that tend to cluster with the query
family on the chromosome, the ‘strength of functional coupling’ is shown (‘Score’ column)
- Can click on links to each gene in the table → Takes you to the ‘Annotation Overview’ page
5. Annotating genes.
Many genes in SEED have already been annotated by experts, and are included in subsystems. However,
many have not yet been annotated. As a user, you can annotate genes yourself.
However, you may only do this for genes:
(a) That have not already been annotated by someone else and entered into an existing subsystem, or
(b) That are not closely related to genes that are in an existing subsystem. It is easy to tell whether a gene is
in a subsystem, e.g. from the Annotation Overview page
Basic Rule: If a gene is in a subsystem, leave it alone.
If and only if the gene is not annotated or not similar to genes that are annotated, you can make your own
annotation as follows:
* In L. lactis FolE Annotation Overview, click to move left, then click on peg.1177, ‘integral membrane
protein’. Click on this gene to re-center the viewer.
Alternatively, to quickly jump to a different protein page with the same genome – copy the full ID of the
query gene (fig|272623.1.peg.1188), paste it into the small window ‘find’ at the very top right corner of
the page, change the gene number to ‘1177’ and Click ‘find’
* In ‘Compare Regions’ request 40 similar sequences – check that none has a definite annotation or is in any
subsystems
* Run Psi-Blast → Conserved Domain search shows that peg.1177 is COG4478, predicted membrane
protein
* To replace current annotation with ‘COG4478, predicted membrane protein’ (Instructor alone does the
annotation):
- Copy ‘COG4478, predicted membrane protein’ from Conserved Domain search, paste into ‘New
Assignment’ box (avoid white spaces), click ‘Change’ → Changes assignment
- To propagate this annotation to similar genes → Click ‘Annotate Clusters’, activate round radio button
on desired name (at top of table), check square boxes of sequences to be renamed
- Click on ‘annotate’ → All checked sequences are now annotated ‘COG4478, predicted membrane
protein’
* There are many other ways to annotate genes in SEED, but this is the basic operation
6. Building a subsystem.
As an example, we will each build a small subsystem using the folE gene and genes associated with it.
* In Subsystem Editor, click ‘Home’ (OR: from any PEG page click ‘Curate Subsystems’ under ‘Navigate’)
→ Click ‘Create new subsystem → Opens a new window
* Name your subsystem (your initials)_Folate (e.g. ADH_Folate), write ‘Test’ in Description and in Notes
→ Click ‘Save Changes’ → Click on link at top of page to enter new subsystem
* By default your subsystem will be classified as ‘Experimental’ – Leave it that way
* Can change Description and Notes, press red ‘Save Changes’ button. Note: Save every page before you
leave it, otherwise it is lost
* First we will add a gene to the subsystem, then some genomes. To add a gene:
* Open ‘Functional Roles’ in a new tab. To add an existing functional role, copy-paste from elsewhere in
SEED. To copy-paste FolE’s functional role:
- Go to Subsystem Overview, find Folate Biosynthesis subsystem, Click on ‘Functional Roles’ tab →
copy ‘GTP cyclohydrolase I (EC 3.5.4.16) type 1’ → paste into your subsystem
* Add an abbreviation (this is the gene name that will appear in the spreadsheet), e.g. FolE, Press ‘Add Role’
→ new box to add another role appears. Press ‘Save Changes’
* To add genomes: Open Spreadsheet in a new tab → Select E. coli K12 MG1655, press ‘add selected
genome(s) to spreadsheet’ → Have mini-spreadsheet with one gene and one genome
* To find other genomes that have FolE and to add them to the spreadsheet:
- Click on E. coli gene (peg.2221) → On Annotation Overview page, use Compare Regions tool with
‘collapse close genomes’ to get 100 related genomes
- Note that FolE occurs in many diverse bacteria, therefore add all bacterial genomes to spreadsheet
- Return to Subsystem Spreadsheet → Check only Bacteria in Domains box → Highlight all genomes in
list using Shift key → press ‘add selected genome(s) to spreadsheet’
* To add relevant genes, search first for genes that may be clustered with FolE. Return to FolE Annotation
Overview, search for genes often found near FolE
- Note 6-pyruvoyl tetrahydrobiopterin synthase (EC 4.2.3.12) quite often translationally coupled to FolE
- Therefore copy-paste this gene into the Functional Roles page, add an abbreviation (e.g. PTPS), press
‘Add Role’, ‘Save Changes’
- Return to Spreadsheet, reload page → Shows FolE and PTPS columns → Press ‘Fill All Genomes’ →
Populates PTPS column (Note: ‘Refill All Genomes’ also works but is more radical)
- To see clustering between FolE and PTPS, click on ‘Color Spreadsheet’ tab, activate color by cluster
radio button
* Instructor alone does the annotation. We will use a different annotation method to the one above.
* To add an as-yet unannotated gene, return to FolE annotation overview → Gene adjacent to folE in E. coli
K-12 MG1655 is annotated ‘putative inner membrane protein’, not in subsystems
* To annotate this gene: Click on the gene (fig|511145.6.peg.2220) E. coli to recenter display
- In Compare Genomes tool, ask for 100 genomes with both cutoffs at 1e-05 → Run Psi-Blast →
Conserved domain search gives ‘COG2311, Predicted membrane protein’
- Copy this annotation, paste it into new assignment box in Annotation Overview, press ‘Change’ → Press
‘Feature Evidence’, ‘Tabular Protein Evidence’
- Activate radio button ‘Assign from’ for E. coli, check boxes for 10 very similar proteins → Press ‘Assign
Function’ → Boxes for these genes change to white
* To add to spreadsheet: copy-paste annotation ‘COG2311, Predicted membrane protein’ into Functional
Roles page, add an abbreviation (e.g. COG2311), press ‘Add Role’, ‘Save Changes’
* Go to Spreadsheet, click ‘Fill All Genomes’ → adds new gene and new genomes
* Saving the subsystem: Close all subsystem windows. On Subsystems Overview page, click ‘Manage my
subsystems’, Check the name of your subsystem, Click ‘Publish Subsystems to Clearinghouse’
* Deleting a subsystem: As above but select ‘Delete selected subsystems’
7. Annotation syntax, copying a subsystem, subsystem spreadsheet tools
* Annotation syntax. Genes can if necessary be given more than one annotation. The syntax and
conventions for this are:
- Role1 @ Role2 – A monofunctional [‘single domain’] enzyme that can catalyze two different reactions,
both role 1 and role 2
- Role1 / Role2 – A bifunctional [often ‘two domain’] enzyme that does two unrelated roles, role 1 and
role 2
- Role1 ; Role2 – Does either role 1 or role 2 but it is not clear which (can be used to double-annotate)
* When building subsystems, enter Role 1 and Role 2 separately in the Functional Roles table
# after annotation is a comment (does not affect annotation)
* To copy a subsystem – Go to Subsystems Overview
- Locate subsystem of interest, in far-right column click ‘copy’
- In ‘Copy To’ box enter name of new subsystem, click ‘Copy Now’ → Click on link at top of page to
access new subsystem
* To survey clustering patterns, at top of Spreadsheet click on triangles in ‘# clustered’ box → arranges
genes in ascending or descending order of clustering frequency
* Displaying all genomes in which a gene is present, or in which it is absent
- In dark blue header bar of table, for gene of interest change option in window from ‘All’ to ‘non-empty’
or ‘empty’ (table takes a few seconds to update)
* To jump to any given organism, type all or part of its name in the ‘Organism’ box in the header, hit
return
* To delete a genome, check the box next to genome(s) name(s) and hit ‘Delete selected genomes’
8. ‘Edit empty cells’ & ‘Show missing with matches’ tools
Note: These tools are not supported in UChicago SEED, only in PubSEED
* An early step in constructing a subsystem is to homogeneously annotate all the genes encoding each
function so that they can all be imported into the spreadsheet. Annotation text string is the only mechanism
in SEED that associates genes with Subsystems. Annotations need to match the name of the corresponding
Functional Role in a Subsystem perfectly for a gene to be included in Subsystem spreadsheet
* When several genes for a given function have been carefully annotated (preferably in diverse organisms),
the ‘Show missing with matches’ and ‘Edit empty cells’ tools can be used to propagate the annotation
* Edit empty cells tool A strategy to further populate Subsystem spreadsheet – one cell at a time:
- In the Spreadsheet page Actions box click ‘Edit empty cells’ button. Question marks [?] are displayed in
every unpopulated cell – each of them is an individual entryway into a specialized Search tool for
potentially missing genes:
- Locate Bacillus anthracis str. 'Ames Ancestor' genome, note empty cell in the FolE column
- Click on [?] within it once – a floating window will appear. Click on ‘Find candidates’ menu option
- Program will perform a 4-step search for this missing gene, and a report page will be generated. It
explains why that particular cell has not been populated:
> A clear homolog is present in the genome, but not yet annotated with the required string of text
> A homologous ORF, although potentially present, has not been called in this genome (ORF-caller
mistake)
> No homologs can be identified – neither on protein nor on DNA level
- If the search (i) for a matching annotation and (ii) for homologous protein sequences within this genome
failed, the program will attempt to run a tblastn of a template gene (the default template can be changed
manually) against this genome. If successful, an arrow labeled ‘Q’ will be displayed on the report page:
- Click on this arrow → DNA sequence around the hit is displayed, possible start codons, stop codons,
Shine-Dalgarno sequences are color coded. Inspect possible ORF:
- Inspect suggested locus and help the machine to choose start codon: clicking a round start radio button
activates it (note that checking a square stop box inactivates it)
- Once a start is selected, press ‘BlastP’ (in the ‘Action’ box) → program will search complete genomes
in SEED using the proposed ORF as a query. This takes time, since a live BLAST search is performed
against all genomes in SEED (~1000) - as opposed to Similarity Tables, for example, where precomputed
homologs are instantly displayed
- Sequence is highlighted in yellow. Number immediately above each residue is the number of hits with a
residue (any residue) at this position
- Number above this is the number of hits where the residue is the same as your query sequence
* This is a fast way to see the ‘consensus’ length and sequence of the protein, and hence to judge whether the
start you have called is canonical. If necessary – select a different start codon and rerun BlastP
* When satisfied, activate a radio button in ‘Select function’ column within ‘BLAST Results’ table – in order
to associate annotation with the newly created ORF – and click ‘Create’. A new gene page is created in
SEED database and the link to it is provided on the report page. Note, that Similarities for the newly created
feature might not be available for several days
* You can delete the newly created ORF (or any ORF in SEED) from the corresponding Annotation
overview page by using ‘delete feature’ button. Use responsibly!
* Press ‘Do not Edit empty cells’ when finished with this tool. No need to press ‘Save edit for empty cells’
(except in special cases, not covered by this introductory tutorial)
9. ‘Show genes in column’ tool
* ‘Show genes in column’ tool allows download of all the protein sequences in any column (for further
analysis outside SEED, e.g. phylogeny)
- In the ‘Columns’ box, highlight the protein family of interest, click ‘Show genes in column’
- Displays a table of all proteins in the column → Check All → Press ‘Show Selected Sequences →
Outputs FASTA sequences
- Can also make TCoffee or Clustal alignments
10. Searching for genes whose distributions are correlated or anti-correlated with that
of a gene of interest
Example 1: A gene that replaces the folate metabolism enzyme YgfA
Gene tested experimentally
hu
tH
hu , h
tG utU
,h
fo
rI ,
ut
I
fo nfo
rC D
CO
G
y g 364
fA 3
Gene present
Gene absent
Firmicutes
Streptococcus pneumoniae
Clostridium tetani
Thermoanaerobacter tengcongensis.
Streptococcus pyogenes M5
Streptococcus pyogenes SS1-1
Bacillus halodurans
Lactococcus lactis
Actinobacteria
Corynebacterium glutamicum
Streptomyces coelicolor
Cyanobacteria
Synechocystis sp.
Acidobacterium sp.
Acidobacteria
Solibacter usitatus
/δ-Proteobacteria
Bacteriovorax marinus
Helicobacter pylori
Syntrophus aciditrophicus
Desulfovibrio vulgaris
Sorangium cellulosum
Bartonella henselae
α-Proteobacteria
Nitrobacter winogradskyi
Sulfitobacter sp. EE-36
Roseobacter sp.
Rhizobium leguminosarum
Oceanicaulis alexandrii
β-Proteobacteria
Neisseria meningitidis
Burkholderia mallei
Chromobacterium violaceum
γ-Proteobacteria
Escherichia coli
Haemophilus influenzae
Yersinia pestis
Legionella pneumophila
Leptospira interrogans
Spirochaetes
Treponema pallidum
Planctomycetes
Blastopirellula marina
Chlamydophila pneumoniae
Chlamydiales
Chlorobi
Chlorobium phaebacteroides
Croceibacter atlanticus
Bacteroidetes
Bacteroides fragilis
Deinococcus/Thermus Thermus thermophilus
Deinococcus geothermalis
Chloroflexi
Herpetosiphon aurantiacus
Thermotogae
Thermotoga maritima
Thermosipho melanesiensis
Aquificae
Aquifex aeolicus
Elusimicrobium minutum
Unclassified Bacteria
Fungi
Saccharomyces cerevisiae
Schizosaccharomyces pombe
Plants
Arabidopsis thaliana
Oryza sativa
Animals
Homo sapiens
Mus musculus
Comparative genomics of the histidine degradation pathway.
The distribution of histidine degradation
genes among representative bacterial and
eukaryal genomes in relation to that of the
ygfA gene for 5-formyl-THF disposal. Gene
colors
correspond to the different
parts of the pathway. Note that COG3643
and forC genes are fused in mammals and
some bacteria, shown by
.
YgfA (5-formyltetrahydrofolate cycloligase) carries out an essential function in folate metabolism, yet is
missing from diverse bacterial genomes (we now know that it is replaced by glutamate
formiminotransferase). To find candidates that may replace ygfA:
Using the Phylogenetic Profiler tool at JGI
http://img.jgi.doe.gov/cgi-bin/w/main.cgi?section=PhylogenProfiler&page=phyloProfileForm
* To search for genes that show an opposite distribution to ygfA (and so may substitute for it):
- In ‘Find Genes In’ column select Thermotoga maritima (or any other genome in which YgfA is absent)
- In ‘With Homologs In’ column select the 4 other genomes above that do not have YgfA, i.e.
Streptococcus pyogenes M1 GAS
Streptococcus pyogenes SSI-1
Syntrophus aciditrophicus
Elusimicrobium minutum
- In ‘Without Homologs In’ column select the 5 genomes above that have YgfA, i.e.
Escherichia coli K12
Streptococcus pneumoniae D39
Streptomyces coelicolor
Aquifex aeolicus
Treponema pallidum SS14
- Press ‘Go’ button
* Program searches for genes in Thermotoga maritima that are present in the ‘with’ genomes (which do not
have YgfA) and absent in the ‘without’ genomes (which have YgfA) → Glutamate formiminotransferase
(EC 2.1.2.5) is the only gene found
Example 2: Finding the missing folE2 gene in folate synthesis
Generating Phylogenetic distribution profiles for specific gene families in IMG
*Go to the IMG database http://img.jgi.doe.gov/cgi-bin/w/main.cgi
*The first step is to identify the input gene and add it to the cart. From the homepage, hover over the Find
Genes and select ‘Gene Search’ from the pull down menu.
*On the Gene Search page, a Keyword search can be performed using the gene family name → type in folE.
The Filters pull down menu allows narrowing the search. Also, the genome of interest can be chosen from
the pull down menu at the bottom of this page → choose Escherichia coli K12 . Then → click GO redirects
to your gene page if a genome was selected. If not, you will have to select a row(s) (displaying the Gene
Object ID, the Locus Tag, the Gene Product Name, the Gene symbol, and the organism). In all cases, add the
gene(s) to your cart by clicking Add Selected to Gene Cart (top left).
*Redo the same operations for folK and folP
* Then go to the Analysis Cart Tab and select the three genes → Click on the Profile & Alignment tab
(upper right), and near the bottom of the page click on Phylogenetic Occurrence Profile to be redirected to
the result page. There, A (for Archaea), B (for Bacteria) and E (for Eukarya) indicate that the gene is present,
while the absence of letter and presence of a dot (.) instead shows its absence in a genome. The identity of
the organisms is acquired by hovering over the letter or dot. Note that they are grouped by phylogeny.
This information is summarized below for a subset of genomes. We will see later how building Subsystems in
SEED allows phylogenetic distribution information to be extracted quite easily.
This is the table from the El Yacoubi paper (PMID: 17032654)
We are now going to use this information to find specific genes families that follow a profile using IMG
phylogenetic profiler tool.
*From the IMG home page hover over the Find Genes tab and select “Phylogenetic Profiler” and “Single
Gene” from the drop down menus to direct to the page to build a phylogenetic profile. The genome for which
the result will be given is selected in the Find Genes In column. The sets of genomes to be included or
exluded are chosen according to the phylogenetic profile generated above.
- In ‘Find Genes In’ column select Thermotoga maritima MSB8
- In ‘With Homologs In’ column select the genomes below
Neisseria meningitidis FAM18
Bordetella bronchiseptica RB50
Geobacter sulfurreducens PCA
Staphylococcus aureus aureus
- In ‘Without Homologs In’ column select the 2 genomes above below
Escherichia coli str. K-12 substr. MG1655
S. cerevisiae
- Press ‘Go’ button
The search parameters can be modified, particularly the Min-Taxon Percent with Homologs or Min-Taxon
Percent without homologs (bottom of the page), to relax the stringency of your phylogenetic distribution
criteria.
*After selecting GO, the analysis produces a summary table that can be downloaded. The genes that the
requested phylogetic patterns be captured by selecting the row and clicking on Add Selected to the Gene
Cart for further analysis.
*You should find the COG1469 family or TM0039 family in the list, → click on it and→ Add to cart,
*Go the cart and select the folE, TM0039 and folK genes
*Then → Click on “Profile & Alignment” Tab and then the “occurrence profile” button . Analyze the
distribution.
It is important to keep track of the genomes used for the analysis in case the search needs to be repeated at a
later time. As a positive control for your genome choices, your gene of interest should come up in the
result table as it was initially used to generate your profile.
11. Creating and using subsets
There are two types of subsets:
(i) Sets of genes in a subsystem that you may wish to display separately from the rest, e.g. just one branch
of a metabolic pathway
(ii) Sets of genes in a subsystem that you may wish not to distinguish from each other, e.g. nonorthologous protein families that carry out the same function. These subset names start with *
* To create the first type of subset, in any subsystem, e.g. ‘Folate Biosynthesis’, click on ‘Subsets’ tab in
green bar
* Name the subset (short name, e.g. FolEQP for the first 3 genes of folate synthesis), check the genes to
include in the subset → Click ‘Create Subset’ → Click ‘Save Editing to Subsets’ button under table
* Open spreadsheet in another tab, click on ‘Limit Display’ tab → Highlight your subset in ‘Show Subsets’
window → Click ‘Limit Display’ button → Displays just the genes in the subset
* To create the second type of subset, proceed as above but start the subsystem name with * e.g. *FolEQP
* Spreadsheet will collapse all the subset genes in a single column → to uncollapse and show the genes
separately → Highlight the subset in the ‘Uncollapse’ column, press ‘Limit Display’
12. Creating and using User Sets of genomes
If you are interested in a particular subset of genomes, grouped on parameter(s) other than taxonomy (e.g.
Extremophiles), it might be worth the effort to create your own set.
* Under ‘Limit Display’ Tab check that the set you need is not yet available under ‘Specific Sets’ or ‘User
Sets’. If not – proceed to ‘Add Genomes’ Tab
- Under the window with all available genomes locate a smaller ‘User Sets’ window. Highlight any
existing genome set and Click ‘Show selected genome list’
- A special WEB page will open that allows creation/editing of subsets of genomes – follow the cues.
Note that there is a bug on this page unfortunately: small arrows between two windows ‘List to choose from’
and ‘Created genome list’ are REVERSED. Hence, to add a genome to one's list, one needs to press the [ < ]
arrow (pointing leftwards)
13. Variant codes
Variants are typically groups of genomes that share a feature, e.g. they lack a pathway (-1 variants) or have a
particular version of a pathway.
* To create variants, in any subsystem spreadsheet → Click on ‘Show Variants’ tab
* In ‘Actions’ box, click on ‘Edit Variants in this table’
- Can assign variants manually in the variant column of the spreadsheet, e.g. -1 for genomes where a
pathway is absent
- Click on red ‘Save Variants’ button
- Default is not to display -1 variants in spreadsheet. To display → Click ‘Show -1 variants’
* Can also edit variants in ‘Variant Overview’ → Click ‘Variant Overview’ button in ‘Actions’ box
- Displays a table of all combinations of genes that occur in the subsystem
- Assign a variant number in the ‘Set To’ column of the table, click ‘Set Variants’ button below the table
- Add a description of the variant to the box above the table, press ‘Add/Save Variants
* For more on variants click ‘Check Variants’ button in ‘Actions’ box
14. Go to a previous version of a subsystem
* From the SEED subsystem editor go to a given subsystem
* Go to Subsystem info Tab
* Click on ‘Reset to Previous Timestamp’ → Choose the version you want to reset to
15. Change name for all the proteins for a given Functional role
* Go to the Functional role tab of your subsystem
* Next to the Functional role click on the ‘ShowFR’ button
* On this summary page there is a ‘Change Functional Role’ box → Use with caution
16. Automatically adding new genomes to a subsystem, adding publications to a
subsystem
* To add new genomes to a subsystem (‘extend’) automatically as they become available (instead of adding
them manually from the ‘Add Genomes’ tab in the Subsystem Editor):
- Click on ‘Home’ in green bar at top of Subsystem Editor pages
- Click ‘Manage my subsystems’ → Check subsystems that you wish to make automatically extendable
→ Click ‘Make Autom. Extendable’ button
* To add publications on a gene to a subsystem:
- Click ‘Functional roles’ tab → Click ‘Literature’ column next to gene of interest
- May contain ‘proposed’ publications (found by literature mining) that can be accepted or rejected
- Can add single or multiple PubMed IDs
17. Changing a gene’s start or stop site
Note: These tools are not supported in UChicago SEED, only in PubSEED
* From any Annotation Overview page, copy gene ID, e.g. fig|64091.1.peg.1577
* From ‘Comparative Tools’ in toolbar select ‘Find a gene in this organism’ → Paste ID into box, Click
‘Submit Query’ → Click on Q arrow
* Displays DNA sequence around the hit, possible start codons, stop codons, Shine-Dalgarno sequences are
color coded
* Clicking a round start radio button activates it, checking a square stop box inactivates it
* When sites are selected, press ‘BlastP’ → searches completed genomes with selected ORF. 160 hits
obtained.
- Sequence is highlighted in yellow. Number immediately above each residue is the number of hits with a
residue (any residue) at this position
- Number above this is the number of hits where the residue is the same as your query sequence
* This is a fast way to see the ‘consensus’ length and sequence of the protein, and hence to judge whether the
start or stop you have called is canonical
* When satisfied, return to editing page, click ‘Create’ and follow logic
18. Additional tools to check for presence or absence of a role in multiple genomes
(may be omitted)
* Show missing with matches tool (very slow)
In the Spreadsheet page Actions box:
- In the ‘Columns’ box, highlight the protein family (Function) of interest, click ‘Show missing with
matches for columns’ (columns in the spreadsheet correspond to functions)
- Program will search by BlastP all genomes where that protein is absent from the spreadsheet column for
homologs of any of the proteins that are present in the column (‘all by any’)
- Output is a table for each organism in which candidates are found; on the left is the hit and its current
annotation and on the right is the query protein and its annotation
- After inspecting the hits, check the ‘Assign’ box for those you deem correct → Press ‘Process
assignments’ to change current annotation to the annotation of the query
- A confirmation page appears → Close window
- Update spreadsheet by clicking ‘Refill All Genomes’ (or check desired genomes and press ‘Refill
Selected Genomes’)
19. Exercises around the “Folate biosynthesis” Subsystem
Note: These exercises are not as detailed as above so the students can find how to redo operations we have
already done together with other examples.
-Adding the FolE2 family to the folate subsystem
In UChicago SEED
* Open (your initials)_Folate subsystem in UChicago SEED in one tab
* Find the annotation page for TM0039 in another TAB
* Add the corresponding functional role to your Folate Subsystem, put FolE2 in the abbreviation box
(remember this is done in two steps, adding the functional role to the Functional Role table, then updating
from the Spreadsheet page).
* In the Spreadsheet column headings, choose ‘empty’ for FolE and ‘non-empty’ for FolK → will give the
list of genomes that have a missing FolE that we generated (painfully) with IMG and that can be a start for a
phylogenetic query. As you see many (but not all) of them have a folE2
We now want to merge the FolE and FolE2 in one column
* Create a *FolE subset that includes FolE and FolE2 (see section 12 for instructions)
* In the limit your SubSystem Spreadsheet/Limit display TAB by default you are uncollapsed, choose *FolE
subset → limit display. One column will now cover the two roles.
-Finding a candidate for the missing phosphatase involved in the hydrolysis of DHN-P
* Add the Dihydroneopterin aldolase (EC 4.1.2.25) reaction role (abbreviation FolB) to your Folate
Subsystem
* Make sure you have added the Akkermansia muciniphila ATCC BAA-835 genome to your Subsystem
→ Go the annotation page of FolB in this specific genome
* Using the different tools we have already covered (feature evidence, psi Blast, Compare genomes) what is
your interpretation of the function of this protein?
Hint: To see more genomes in the Compare Region tool, lower the ‘Evalue cutoff for selection of pinned
CDSs’ to 0-5 and also try with and without collapsing close genomes.
*Find the exact same peg in PubSEED (from the PubSEED entry page: paste the full Peg that you found in
UChicago SEED) and → Search
*Explore the Compare genomes tools in PubSEED. You can compare with the view you had in UChicago
SEED and see that as more genomes are available the tool is much more informative and is more powerful to
make functional predictions.
Discussion Pause: How do you think this peg should be annotated: We will discuss this together in class and
the Instructor will change the annotation in UChicago SEED.
Updated 4/16/13
Download