Stage I: Trace File Processing - the Genome Database for Rosaceae

advertisement
EST data processed at CUGI utilizes publicly available software incorporated in a fully
automated in-house developed script (ProEST). The processing occurs in three stages:
Stage I: Trace File Processing
Sequence trace files are converted into a fasta file and a quality score file using the phred
(Ewing et al, 1998) base-calling program. Vector and host contamination are identified
and masked using the sequence comparison program cross_match (Gordon, et al, 1998).
Following vector trimming which identifies the longest non-masked sequence, a further
round of trimming removes low quality bases at both ends of a read, where low quality is
defined as those bases having a phred score less than 20. Sequences are discarded if they
have fewer than 100 bases with a minimum phred score value of 20, have greater than
5% ambiguous bases or more than 40 Poly A or T bases in the sequence. At this point the
script uploads all the data to the CUGI oracle database (CUGIdb). and stAt this stage of
the processing the script generates an overall summary report file, clone report tables, a
Genbank submission file and library files in fasta format of the successful trimmed
sequences and associated quality values. The library file of successful trimmed sequences
is further filtered to remove reads having significant similarity with any species specific
mitochondial, rRNA, tRNA or snoRNA sequences downloaded from the Genbank
nucleotide database.
Stage II: Assembly of High Quality Sequences
In stage II the filtered library file is assembled using the contig assembly program CAP3
(Huang and Madan, 1999). More stringent parameters (- p 95. –d 60) are typically used to
prevent over assembly and help identify potential paralogs.
Stage III: Annotation
Annotation consists of pairwise comparison of the filtered library and the contig
consensus library file against the Genbank nr protein database using the fastx3.4
algorithm (Pearson and Lipman, 1988). The 10 most significant matches (EXP < 1e –9)
for are recorded. The script generates a web page which displays the best protein match
for each contig as well as the clones that comprise the contig. Users can view the data at
varying levels of detail including viewing alignments of each cloneto the level of the ten
most significant matches to individual clones. The best match for each contig and
singleton in displayed on a web site
Ewing, B., Hiller, L., Wendl, M. and Green, P. (1998). Basecalling of automated
sequencee traces using phred. I. Accuracy assessment. Genome Research 8, 175-185.
Gordon, D. Abanjian, C., and Green, P. (1998). Consed: A graphical tool for sequence
finishing. Genome Research 8, 195-202.
Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome
Research, 9, 868-877.
Pearson, J.D. and lipman, D.J. (1988). Improved tools fro biological sequence
comparison. Proceedings of the National Academy of Science, USA 85, 2444-2448.
Download