Using secondary database sequences to select cDNA clones for full

advertisement
Using secondary database sequences to select cDNA clones for full-length sequencing: Schistosoma mansoni
as a model
Faria-Campos, A. C.1; Mudado, M. A.1; Franco, G. R.1; Campos, S. V. A.2 and Ortega, J. M.1*
1
Dep. Bioquímica e Imunologia, ICB-UFMG and 2Dep. Ciência da Computação, ICEX-UFMG
Universidade Federal de Minas Gerais, Belo Horizonte, MG, 31270-901, Brazil
Running head
Selection of cDNA clones for full-length sequencing
*To whom correspondence should be addressed
Lab. Biodados, ICB - UFMG
Av. Antonio Carlos 6627, Belo Horizonte, MG
31270-901, Brazil
ABSTRACT
Motivation Using high quality and classifyied sequences to help mining, selecting and dedicated sequencing
can be an innovative step in transcriptome projects. In this context, the clones' selection for full-length
sequencing is an important procedure since the complete protein biological characterization produces a more
reliable information than by assembled ESTs. Results We selected Schistosoma mansoni clones for fulllength sequencing using the initial methionine presence prediction in the query sequence based on codonbalance and the position of alignment start. BLAST searches have been performed against KOG sequences - a
database of classifyied proteins from model organisms. Using 28,541 ESTs from S. mansoni available in
dbEST (ORESTES excluded), we were able to select 1861 S. mansoni putative full-length clones. Reliability
tests have been performed using sequences from S. mansoni, Arabdopsis thaliana, Caenorhabditis elegans,
Drosophila melanogaster and Homo sapiens. The method's trustwhorthiness could be demonstrated by these
means. Conclusion: Results indicate that such a procedure can be efficiently used to select clones for fulllength sequencing and contributing significantly to complete CDS' projects.
Contact
miguel@icb.ufmg.br.
Supplementary
information
http://biodados.icb.ufmg.br.INTRODUCTION
The scientific community is witnessing, eversince the last decade, whole genome sequentiations,
from microorganisms to eukaryotic species - including our own. The resulting boom of data acquire since
then is being mostly deposited into public databases (The C. elegans consortium, 1998; Celniker, 2000;
Collins et al., 2001; Venter et al., 2001). In addition to genome projects, a number of complementary
approaches are being undertaken attempting to characterize encoded genes, e.g., the production of Expressed
Sequence Tags (ESTs) - single pass, low fidelity sequences from the expressed portion of the genome which,
in turn, add more sequences to the databases (Adams et al., 1991). The challenge nowadays, therefore, rests
on transforming the great amount of sequencing data generated into biologically meaningful information, i.e.,
to find the genes and the proteins encoded; to place the encoded data into cellular context; to define the
biochemical pathways and/or functional belonging categories; and other features. Gene annotation is the main
process one should use in order to accomplish such a task once the aim of high quality annotation is to
identify the key features genomes are bounded to – genes and their products (Stein, 2001).
New sequences annotation's quality depends critically on the database's reliability and completeness
harboring the sequences used as annotation reference. The use of inadequately annotated sequences as the
basis for annotation will inextrincably produce inaccurate and/or incomplete new information which will, in
turn, generate more low quality annotation sources (Natale et al., 2000). The use of secondary databases as
source of reference annotation sequences represents an improvement in the annotation quality since they bear
valuable biological information and usually present curated data covering protein structure, function and
evolution. The use of sequences from secondary databases as reference for the investigation of unknown
sequences has been previously reported with success by several authors (Natale et al., 2000; Camon, et al.
2003, Nishikawa et al., 2003, Faria-Campos et al., 2003).
The use of high quality sequences to aid annotation and direct sequencing can be a innovative step in
transcriptome projects for its main goal is to identify new genes in a given organism. In such projects, the
selection of clones to be sequenced follows essentially a random approach and the sequences generated –
ESTs – represent snapshots from the transcribed genes, in a specific point in time in that organism's lifetime.
However, a neater picture from the organism's proteome is provided when the full-length sequence from
specific genes is known. For that, it is essential to select clones that potentially have the complete sequence of
given genes and then sequenc such clones.
The selection of ESTs for further characterization can be made through the use of software seekin
complete Open Reading Frame (ORF) such as ESTscan (Iseli et al, 1999) or by using similarity searches with
the well-known algorithm BLAST (Altschul et al., 1990). The initial step in the characterization is to
determine the completeness of the cDNA coding region (CDS) from which the ESTs were generated and
annotate. Several authors developed systems to attain this goal which use, among other tools, similarity
searches, statistical information and genome mapping (Salamov et al, 1998; Nishikawa et al, 2000; Del Val et
al, 2003; Furuno et al., 2003; Hotz-Wagenblatt et al., 2003). We devised that using biological descriptions
from sequence clusters contained in function-oriented databases as queries might result on a quick mining and
selection procedure.
One of the organisms that has been an EST project subject is the trematode Schistosoma mansoni, an
mportant human parasite whose genome is estimated to be 270 megabases long. A great number of ESTs has
already been sequenced from this parasite, and many more are underway, as an outcome from several world
wide initiatives – including the S. mansoni projects in Brazil, developed by two main groups: Minas Gerais
Genome Network and São Paulo Genome Network. However, the number of characterized proteins available
for this organism is still small and studies aiming to change this picture and enrich S. mansoni protein
databases are currently necessary. Once it is evolutively distant from organisms present in secondary
databases, S. mansoni is strikingly atractive as a model for this sort of study.
In this work we selected S. mansoni clones for full-length sequencing by using similarity searches
against sequences from the secondary database KOG. We have been able to select 1861 S. mansoni clones for
full-length sequencing corresponding to 1171 different proteins. All selected clones have been searched for
Kosak consensus in the near predicted initial aminoacid's vicinity and the their presence has been confirmed
in 66% of all clones. Reliability tests of the procedure have also been performed and shown that the method is
reliable and that the observed error's size during the prediction was small. Once the selected clones represent
enzymes of well-characterized metabolic pathways in eukaryotes, we find that the knowledge retrieved from
their full-length sequences shall contribute to the organism's biological characterization.
Download