ORNL Genome Annotations

advertisement
Young, Ned 1-3
ORNL Genome Annotations
After talking with Frank Larimer and reading the paper he suggested, I composed the
following description:
ORNL Analysis and Annotation Process:
1. The gene finders Glimmer, Generation and Critica are applied independently
and are set to allow overlapping gene models (ORFs).
a. Glimmer uses interpolated Markov models (IMMs) to identify the
coding regions and uses ATG, GTG and TTG as potential starts.
i. Glimmer is the most sensitive
b. Generation uses predominantly 6-mer statistics to recognize coding
regions and a proximity rule-base start call with ATG and GTG as
potential starts.
c. Critica uses BLASTN to produce alignments from the entire dataset
and derives dicodon statistics to recognize coding sequences.
i. It recognizes ATG, GTG and TTG as potential starts.
ii. Critica is the most accurate.
2. The Generation and Glimmer training set selected consists of non-overlapping
ORFs >900bp long.
3. Each ORF is scored from 1 (only Glimmer found it) to 5.5 (all 3 methods
agree).
a. Scores of 3.5 or better are considered accurate.
4. If two gene models overlap over a significant part of their length, only one is
kept.
a. In the initial, automated curation phase, any place where there is >10%
overlap is resolved.
i. If one has a blast hit and one does not, the one without the hit is
deleted.
ii. If more than one has blast hits, the ORFs and there blast hits
are aligned together.
5. The first valid start upstream of the aligned region is chosen as the start
a. Note this may be revised later.
6. If it is a draft sequence, a search is made for parts of genes in low quality
sequence or in contigs that are very short.
7. RNA genes and repeats are identified.
a. They assign 16S, 23S, 5S, tRNA, ffs, rnpB, and tmRNA, and others if
possible (using Rfam).
8. ORF calls that conflict with RNA genes are generally deleted as bogus.
9. Product description:
a. The surviving list of ORFs is submitted to several tools for
determining the gene product.
b. They are listed according to the tool pecking order – roughly
c. Description is acquired in descending order, using an internal set of
curated descriptions that reflects a defined vocabulary accepted by
NCBI:
Young, Ned 2-3
i. TIGRFams - trusted cutoff
ii. PRIAM - 1e-30 cutoff
iii. Pfam (trusted cutoff) to identify protein families and domains
within a protein
iv. Interpro (internal rules) to recognize functional motifs and
domains
v. Uniprot (blast; 1e-5)
vi. KEGG (blast; 1e-5) for EC# classification scheme
vii. COGs (rpsblast; 1e-10) to obtain family-based classification
blastp against nr
d. In automated annotation, all of the tools noted are run for each gene
e. The product description is generated by rule from those results.
10. MANUAL CURATION:
a. Only finished genomes get manual curation.
i. This has not been done for G.ura, G.bem, G.FRC-32, or P.pro
b. The following questions are dealt with manually, and in the process,
cases where ORFs overlap (by more than the max. allowable) are
resolved, choosing just one ORF.
i. Whether the predicted start codon is correct.
Whether the gene has the correct gene product name assigned
to it.
ii. Start sites are changed, but only if a ribosomal binding site
supports the change.
iii. Some rules for start placement include:
1. When the modeling algorithms disagree, the Critica
ORFs are the most reliable.
2. Overlap of 1bp and 4bp with upstream gene (in same
operon) is preferred.
3. More then 4bp overlap is an indication of incorrect start
placement.
4. If, relative to blast hits, there are 20 or more extra
amino acids at the amino terminal, the start placement
is probably wrong.
5. Genes transcribed in opposite directions should have no
5’ overlap. Moreover, there needs to be room for the
promoters.
6. If the 5’ end of the ORF contains the end of the contig,
the start site may be wrong.
c. The gene product descriptions are also reviewed and edited.
d. Product conflicts:
i. Generally Tfam or PRIAM wins, but not always.
ii. Sometimes they get a high Pfam score (to a family) and a
marginal Tfam score (to a specific function).
iii. If the other tools don't back up the Tfam (and do back the
Pfam), they go with Pfam. They are looking for consistency.
Young, Ned 3-3
e. Generally, the blast db's only come into play when there are no
Tfam|PRIAM|Pfam|Interpro results.
i. The alignment is observed for >30% identity over 80% of the
match to make an assignment.
ii. Because COGs is based on orthology and not function, they
tend to use it only if it's the only result.
iii. SignalP is used to recognize signal peptides.
iv. TMHMM is used to identify the transmembrane structure of
transporters.
f. They no longer assign gene names (without a guest expert) - it's too
contentious.
Download