EST data processed at CUGI utilizes publicly available software incorporated in a fully automated in-house developed script (ProEST). The processing occurs in three stages: Stage I: Trace File Processing Sequence trace files are converted into a fasta file and a quality score file using the phred (Ewing et al, 1998) base-calling program. Vector and host contamination are identified and masked using the sequence comparison program cross_match (Gordon, et al, 1998). Following vector trimming which identifies the longest non-masked sequence, a further round of trimming removes low quality bases at both ends of a read, where low quality is defined as those bases having a phred score less than 20. Sequences are discarded if they have fewer than 100 bases with a minimum phred score value of 20, have greater than 5% ambiguous bases or more than 40 Poly A or T bases in the sequence. At this point the script uploads all the data to the CUGI oracle database (CUGIdb). and stAt this stage of the processing the script generates an overall summary report file, clone report tables, a Genbank submission file and library files in fasta format of the successful trimmed sequences and associated quality values. The library file of successful trimmed sequences is further filtered to remove reads having significant similarity with any species specific mitochondial, rRNA, tRNA or snoRNA sequences downloaded from the Genbank nucleotide database. Stage II: Assembly of High Quality Sequences In stage II the filtered library file is assembled using the contig assembly program CAP3 (Huang and Madan, 1999). More stringent parameters (- p 95. –d 60) are typically used to prevent over assembly and help identify potential paralogs. Stage III: Annotation Annotation consists of pairwise comparison of the filtered library and the contig consensus library file against the Genbank nr protein database using the fastx3.4 algorithm (Pearson and Lipman, 1988). The 10 most significant matches (EXP < 1e –9) for are recorded. The script generates a web page which displays the best protein match for each contig as well as the clones that comprise the contig. Users can view the data at varying levels of detail including viewing alignments of each cloneto the level of the ten most significant matches to individual clones. The best match for each contig and singleton in displayed on a web site Ewing, B., Hiller, L., Wendl, M. and Green, P. (1998). Basecalling of automated sequencee traces using phred. I. Accuracy assessment. Genome Research 8, 175-185. Gordon, D. Abanjian, C., and Green, P. (1998). Consed: A graphical tool for sequence finishing. Genome Research 8, 195-202. Huan, X. and Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Research, 9, 868-877. Pearson, J.D. and lipman, D.J. (1988). Improved tools fro biological sequence comparison. Proceedings of the National Academy of Science, USA 85, 2444-2448.