Young, Ned 1-3 ORNL Genome Annotations After talking with Frank Larimer and reading the paper he suggested, I composed the following description: ORNL Analysis and Annotation Process: 1. The gene finders Glimmer, Generation and Critica are applied independently and are set to allow overlapping gene models (ORFs). a. Glimmer uses interpolated Markov models (IMMs) to identify the coding regions and uses ATG, GTG and TTG as potential starts. i. Glimmer is the most sensitive b. Generation uses predominantly 6-mer statistics to recognize coding regions and a proximity rule-base start call with ATG and GTG as potential starts. c. Critica uses BLASTN to produce alignments from the entire dataset and derives dicodon statistics to recognize coding sequences. i. It recognizes ATG, GTG and TTG as potential starts. ii. Critica is the most accurate. 2. The Generation and Glimmer training set selected consists of non-overlapping ORFs >900bp long. 3. Each ORF is scored from 1 (only Glimmer found it) to 5.5 (all 3 methods agree). a. Scores of 3.5 or better are considered accurate. 4. If two gene models overlap over a significant part of their length, only one is kept. a. In the initial, automated curation phase, any place where there is >10% overlap is resolved. i. If one has a blast hit and one does not, the one without the hit is deleted. ii. If more than one has blast hits, the ORFs and there blast hits are aligned together. 5. The first valid start upstream of the aligned region is chosen as the start a. Note this may be revised later. 6. If it is a draft sequence, a search is made for parts of genes in low quality sequence or in contigs that are very short. 7. RNA genes and repeats are identified. a. They assign 16S, 23S, 5S, tRNA, ffs, rnpB, and tmRNA, and others if possible (using Rfam). 8. ORF calls that conflict with RNA genes are generally deleted as bogus. 9. Product description: a. The surviving list of ORFs is submitted to several tools for determining the gene product. b. They are listed according to the tool pecking order – roughly c. Description is acquired in descending order, using an internal set of curated descriptions that reflects a defined vocabulary accepted by NCBI: Young, Ned 2-3 i. TIGRFams - trusted cutoff ii. PRIAM - 1e-30 cutoff iii. Pfam (trusted cutoff) to identify protein families and domains within a protein iv. Interpro (internal rules) to recognize functional motifs and domains v. Uniprot (blast; 1e-5) vi. KEGG (blast; 1e-5) for EC# classification scheme vii. COGs (rpsblast; 1e-10) to obtain family-based classification blastp against nr d. In automated annotation, all of the tools noted are run for each gene e. The product description is generated by rule from those results. 10. MANUAL CURATION: a. Only finished genomes get manual curation. i. This has not been done for G.ura, G.bem, G.FRC-32, or P.pro b. The following questions are dealt with manually, and in the process, cases where ORFs overlap (by more than the max. allowable) are resolved, choosing just one ORF. i. Whether the predicted start codon is correct. Whether the gene has the correct gene product name assigned to it. ii. Start sites are changed, but only if a ribosomal binding site supports the change. iii. Some rules for start placement include: 1. When the modeling algorithms disagree, the Critica ORFs are the most reliable. 2. Overlap of 1bp and 4bp with upstream gene (in same operon) is preferred. 3. More then 4bp overlap is an indication of incorrect start placement. 4. If, relative to blast hits, there are 20 or more extra amino acids at the amino terminal, the start placement is probably wrong. 5. Genes transcribed in opposite directions should have no 5’ overlap. Moreover, there needs to be room for the promoters. 6. If the 5’ end of the ORF contains the end of the contig, the start site may be wrong. c. The gene product descriptions are also reviewed and edited. d. Product conflicts: i. Generally Tfam or PRIAM wins, but not always. ii. Sometimes they get a high Pfam score (to a family) and a marginal Tfam score (to a specific function). iii. If the other tools don't back up the Tfam (and do back the Pfam), they go with Pfam. They are looking for consistency. Young, Ned 3-3 e. Generally, the blast db's only come into play when there are no Tfam|PRIAM|Pfam|Interpro results. i. The alignment is observed for >30% identity over 80% of the match to make an assignment. ii. Because COGs is based on orthology and not function, they tend to use it only if it's the only result. iii. SignalP is used to recognize signal peptides. iv. TMHMM is used to identify the transmembrane structure of transporters. f. They no longer assign gene names (without a guest expert) - it's too contentious.