Subsystem Approach to Genome Annotation

advertisement
Subsystem Approach to
Genome Annotation
National Microbial Pathogen Data Resource
www.nmpdr.org
Claudia Reich
NCSA, University of Illinois, Urbana
Complete Microbial Genomes
• 464 complete microbial genomes in NCBI as of 3-1-07
• 691 microbial genomes in progress as of 3-1-07
www.nmpdr.org
Making Sense of Genome Data
• Locate Genes: identify ORFs automatically




GeneMark
NCBI’s ORF Finder
Glimmer
Critica
• Assign Function: by sequence similarity to
experimentally characterized proteins
 BLAST family of sequence comparison tools
www.nmpdr.org
Problems with Assignments by
Similarity
• When ORF is a member of a protein family
• Paralogous genes
• ORFs encoding similar proteins acting on
different substrates
• Assignments can be transitive, and many
times removed from experimental data
www.nmpdr.org
Other Factors Can Aid in Function
Assignments
•
•
•
•
•
Molecular phylogeny
Paralogous and orthologous families
Conserved gene neighborhood
Metabolic context
Bidirectional best hit matches across
multiple genomes
www.nmpdr.org
Incorporating Information Other Than
Similarity
• KEGG: manually curated pathway and
metabolic maps
• GO: vocabularies that describe ORFs as
associated with
 biological processes
 cellular components
 molecular function
• MetaCyc: experimentally elucidated metabolic
pathways
www.nmpdr.org
What is Needed:
• A system that:
 integrates all the above concepts
 organizes genomic data in structured idioms
 allows high-throughput annotation of newly
sequenced genomes
 resolves discrepancies in different annotation tools
 informs experimental research
www.nmpdr.org
Enter the SEED*
• Database and annotation environment
• Underlies, and accessible through, NMPDR
(www.nmpdr.org)
• Expert annotation via subsystems building
• Provides the most accurate genome
annotations available
*Argonne National Lab, University of Chicago, UIUC, FIG
www.nmpdr.org
What is a Subsystem?
• Any organizing biological principle:
 metabolic pathway
• amino acid biosynthesis, nitrogen fixation, glycolysis
 complex structure
• ribosome, flagellum
 set of defining features
• virulome, pathogenicity islands
 functional concept
• bacterial sigma factors, DNA binding proteins
www.nmpdr.org
Subsystems are:
• Sets of functional roles, which are functions,
or abstractions of functions (such as an EC
number), that together implement a specific
biological process or concept
• Created manually by expert curators
• Experts annotate single subsystems over the
complete collection of genomes, thus
contributing and sharing their expertise with
the scientific community
www.nmpdr.org
How Subsystems are Built
• Create a subsystem for the biological concept,
and define the functional roles
• In one (or a few) key organisms that include
the subsystem, find the genes and assign
meaningful functional names
• Project the annotations to orthologous genes
• Expand to more genomes, creating a
Populated Subsystem
www.nmpdr.org
Populated Subsystems
• Are Spreadsheets where:
 Columns: functional roles
 Rows: specific genomes
 Cells: genes in the organism that implement the
functional role
www.nmpdr.org
How to Access Subsystems
• From Search menu
• From Organism pages
• From search results when found protein is
included in a subsystem
• From Annotation Overview pages
www.nmpdr.org
Subsystem Pages in NMPDR
•
•
•
•
•
•
Table of Functional Roles
Subsystem diagram (if appropriate)
Populated subsystem spreadsheet
Customizable spreadsheet viewing options
Functional variants and subsets of roles
Curator’s notes
www.nmpdr.org
Benefits of Subsystems
• More accurate annotations
• Annotation of protein families
• Analysis of sets of functionally related
proteins
• Less error-prone to automatic projections to
novel genomes
www.nmpdr.org
Subsystems Reveal Interesting
• Pathway variants:
 Are they clustered by phylogeny?
• Delta subunit of RNA polymerase only Bacillales
 Are they clustered by functional niche?
 Horizontal gene transfer?
• Fused genes:
  and ’ subunit of RNA polymerase fused in
Helicobacter
• Fissioned genes:
 ’ subunit of RNA polymerase is fissioned in
Cyanobacteria
www.nmpdr.org
Subsystems Reveal Interesting
• Duplicate assignments
 More than one gene for one functional role?
• Alpha subunit of RNA polymerase in Magnetococcus
and Francisella
 Same sequenced region in more than one contig
in partially assembled genomes?
 Frameshifts or other sequencing errors?
 Annotation errors?
www.nmpdr.org
Subsystems Reveal Interesting
• Missing genes:
 Is the function essential?
 Is the function conserved?
 Does the missing gene cluster with homologs in
other organisms?
 Is the function performed by a newly recruited
gene?
 Has a gene been acquired by horizontal gene
transfer and now performs that function?
www.nmpdr.org
Synthesis of Selenocysteinyl-tRNA
• Two known pathway variants
 One step in Bacteria
• SelA is annotated
 Two steps in Archaea and Eucarya
• PSTK was missing until very recently
www.nmpdr.org
Explore Selenocysteine Usage
• Start by searching for gene name, selA, in an organism known
to use Sec, E. coli K12
• Start from subsystem tree; expand category of "Protein
metabolism," expand subcategory of "Selenoproteins"
• Open "Selenocysteine metabolism" subsystem from protein
page or SS tree
Genomes arranged phylogenetically
Roles defined on mouse-over
What genes are missing in which organisms?
Are there Sec metabolism genes present in any organisms that do not
have proteins that need Sec?
 Are there organisms known to need Sec for certain proteins, but that
do not have a complete Sec biosynthesis pathway?
 Why is there a hypothetical protein included in this subsystem?




www.nmpdr.org
Download