Using secondary database sequences to select cDNA clones for full-length sequencing: Schistosoma mansoni as a model Faria-Campos, A. C.1; Mudado, M. A.1; Franco, G. R.1; Campos, S. V. A.2 and Ortega, J. M.1* 1 Dep. Bioquímica e Imunologia, ICB-UFMG and 2Dep. Ciência da Computação, ICEX-UFMG Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270-901, Brazil Running head Selection of cDNA clones for full-length sequencing *To whom correspondence should be addressed Lab. Biodados, ICB - UFMG Av. Antonio Carlos 6627, Belo Horizonte, MG 31270-901, Brazil ABSTRACT Motivation Using high quality and classifyied sequences to help mining, selecting and dedicated sequencing can be an innovative step in transcriptome projects. In this context, the clones' selection for full-length sequencing is an important procedure since the complete protein biological characterization produces a more reliable information than by assembled ESTs. Results We selected Schistosoma mansoni clones for fulllength sequencing using the initial methionine presence prediction in the query sequence based on codonbalance and the position of alignment start. BLAST searches have been performed against KOG sequences - a database of classifyied proteins from model organisms. Using 28,541 ESTs from S. mansoni available in dbEST (ORESTES excluded), we were able to select 1861 S. mansoni putative full-length clones. Reliability tests have been performed using sequences from S. mansoni, Arabdopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens. The method's trustwhorthiness could be demonstrated by these means. Conclusion: Results indicate that such a procedure can be efficiently used to select clones for fulllength sequencing and contributing significantly to complete CDS' projects. Contact miguel@icb.ufmg.br. Supplementary information http://biodados.icb.ufmg.br.INTRODUCTION The scientific community is witnessing, eversince the last decade, whole genome sequentiations, from microorganisms to eukaryotic species - including our own. The resulting boom of data acquire since then is being mostly deposited into public databases (The C. elegans consortium, 1998; Celniker, 2000; Collins et al., 2001; Venter et al., 2001). In addition to genome projects, a number of complementary approaches are being undertaken attempting to characterize encoded genes, e.g., the production of Expressed Sequence Tags (ESTs) - single pass, low fidelity sequences from the expressed portion of the genome which, in turn, add more sequences to the databases (Adams et al., 1991). The challenge nowadays, therefore, rests on transforming the great amount of sequencing data generated into biologically meaningful information, i.e., to find the genes and the proteins encoded; to place the encoded data into cellular context; to define the biochemical pathways and/or functional belonging categories; and other features. Gene annotation is the main process one should use in order to accomplish such a task once the aim of high quality annotation is to identify the key features genomes are bounded to – genes and their products (Stein, 2001). New sequences annotation's quality depends critically on the database's reliability and completeness harboring the sequences used as annotation reference. The use of inadequately annotated sequences as the basis for annotation will inextrincably produce inaccurate and/or incomplete new information which will, in turn, generate more low quality annotation sources (Natale et al., 2000). The use of secondary databases as source of reference annotation sequences represents an improvement in the annotation quality since they bear valuable biological information and usually present curated data covering protein structure, function and evolution. The use of sequences from secondary databases as reference for the investigation of unknown sequences has been previously reported with success by several authors (Natale et al., 2000; Camon, et al. 2003, Nishikawa et al., 2003, Faria-Campos et al., 2003). The use of high quality sequences to aid annotation and direct sequencing can be a innovative step in transcriptome projects for its main goal is to identify new genes in a given organism. In such projects, the selection of clones to be sequenced follows essentially a random approach and the sequences generated – ESTs – represent snapshots from the transcribed genes, in a specific point in time in that organism's lifetime. However, a neater picture from the organism's proteome is provided when the full-length sequence from specific genes is known. For that, it is essential to select clones that potentially have the complete sequence of given genes and then sequenc such clones. The selection of ESTs for further characterization can be made through the use of software seekin complete Open Reading Frame (ORF) such as ESTscan (Iseli et al, 1999) or by using similarity searches with the well-known algorithm BLAST (Altschul et al., 1990). The initial step in the characterization is to determine the completeness of the cDNA coding region (CDS) from which the ESTs were generated and annotate. Several authors developed systems to attain this goal which use, among other tools, similarity searches, statistical information and genome mapping (Salamov et al, 1998; Nishikawa et al, 2000; Del Val et al, 2003; Furuno et al., 2003; Hotz-Wagenblatt et al., 2003). We devised that using biological descriptions from sequence clusters contained in function-oriented databases as queries might result on a quick mining and selection procedure. One of the organisms that has been an EST project subject is the trematode Schistosoma mansoni, an mportant human parasite whose genome is estimated to be 270 megabases long. A great number of ESTs has already been sequenced from this parasite, and many more are underway, as an outcome from several world wide initiatives – including the S. mansoni projects in Brazil, developed by two main groups: Minas Gerais Genome Network and São Paulo Genome Network. However, the number of characterized proteins available for this organism is still small and studies aiming to change this picture and enrich S. mansoni protein databases are currently necessary. Once it is evolutively distant from organisms present in secondary databases, S. mansoni is strikingly atractive as a model for this sort of study. In this work we selected S. mansoni clones for full-length sequencing by using similarity searches against sequences from the secondary database KOG. We have been able to select 1861 S. mansoni clones for full-length sequencing corresponding to 1171 different proteins. All selected clones have been searched for Kosak consensus in the near predicted initial aminoacid's vicinity and the their presence has been confirmed in 66% of all clones. Reliability tests of the procedure have also been performed and shown that the method is reliable and that the observed error's size during the prediction was small. Once the selected clones represent enzymes of well-characterized metabolic pathways in eukaryotes, we find that the knowledge retrieved from their full-length sequences shall contribute to the organism's biological characterization.