text as m$ word doc

advertisement
Contents
1
Introduction and goals
1.1
1.2
Introduction
Goals
4
4
4
2
Background
5
2.1
2.2
ESTs, contigs, and SNPs
Ontologies
5
7
3
Database and interface design
3.1
3.2
3.6
Overview
Data collection
3.2.1 EMBL EST data
3.2.2 UniLib EST library data
3.2.3 Gramene tissue ontology data
Data storage
3.3.1 EMBL EST data
3.3.2 UniLib EST library data
3.3.3 Gramene tissue ontology data
3.3.4 Sequence mask data
3.3.5 Contig and SNP data
3.3.6 Indices
3.3.7 Redundancy versus speed
Data import
3.4.1 EMBL EST data
3.4.2 UniLib EST library data
3.4.3 Gramene tissue ontology data
Data processing
3.5.1 Overview
3.5.2 Vector contamination screening
3.5.3 Contig building
3.5.4 SNP detection
3.5.5 Results
Data presentation
8
8
10
10
10
11
12
12
12
13
14
14
15
16
17
17
18
18
19
19
20
20
21
22
24
4
Conclusion and future work
4.1
4.2
Conclusion
Future work
3.3
3.4
3.5
28
28
28
Literature References
29
Acknowledgements
30
Appendices
Appendix 1: Overview of plant EST databases on the internet
Appendix 2: Entities in the plantEST database
Appendix 3: updateSpecies.sql – updates to tables SPECIES and EST_LIB
2
Summary
Expressed Sequence Tags (ESTs) are single-read sequence fragments produced from reverse-transcribed
mRNA molecules. They are especially important in plant genetics, as the draft genome sequences of only
two plant species are available. This low number is accountable to the fact that plant genomes are generally
large and contain many repeats. During the last few years an increasing number of EST sequencing projects
have been undertaken to cover the expressed part of the genome for a large number of plant species. The
data that is generated in these projects is usually stored in large public sequence databases such as NCBI
dbEST and EMBL.
The significant sequence redundancy in the EST datasets, in combination with the presence of sequences
from many different varieties of a broad range of plant species, makes it attractive to screen plant ESTs for
sequence features such as Single Nucleotide Polymorphisms (SNPs). SNPs are the most abundant sequence
variations in most genomes, and can be used in association studies, in linkage analyses, and for generating
high-density genetic maps that can be used in the study of disease-susceptibility genes. In most plant
genomes SNPs are found at a frequency of roughly 1 in every 60 to 600 nucleotides.
The EST data from the public databases are available as flat text files, and although this is a suitable format
for data distribution, it does not allow for convenient processing. The plantEST relational database was
designed to store these EST data from the EMBL database in a data structure that allows easy retrieval,
modification, and processing of the data therein. Data selection was improved through the addition of
annotation data from the UniLib database, and ontology terms from the Gramene Cereal Plant Anatomy
Ontology. To evaluate the functionality of the plantEST database in the processing of EST sequences, a
sample EST data set was extracted from the database and used to detect SNPs. To obtain these SNPs the
EST sequences were screened for vector contamination sequences and clustered into contigs. The plantEST
relational database structure was designed to allow storage of the data that was generated in the contig
building process, in addition to the EST data. Additionally, interfaces were designed to view and extract data
from the database in a user-friendly manner.
3
1
Introduction and goals
1.1
Introduction
This report describes the construction and evaluation of plantEST, a relational database system for plant
EST, contig, and polymorphism data. The database was developed during a six-month thesis for a
Bioinformatics Master degree at the Centre for Molecular and Biomolecular Informatics
(http://www.cmbi.kun.nl/) in Nijmegen. The plantEST database design was based on a request from
KeyGene (http://www.keygene.com/) to develop a plant EST database for use in their EST processing
pipelines. The database definitions of the plantEST database, as well as the source code for all the plantEST
Explorer program modules described in this report, are available from the author on request.
1.2
Goals
The research and database development that is described in this report was performed during a six-month
thesis for a Bioinformatics Master degree. The goals of this research are (1) to create a data structure that
allows fast and flexible access to plant EST data, (2) refine the annotation of this EST data, and (3) prove the
functionality of this data structure in EST processing. The limited time span of the project narrows these
global goals to the concrete goals outlined below.
Realization of the first goal must result in a data structure, called the plantEST database, that allows the data
to be searched on many criteria. This structure must be easily adaptable to store additional data types such as
the contig and polymorphism data that will be produced during the realization of the third goal. The data
must be accessible through a user-friendly interface that can be easily adapted to display the data in a
different fashion, and to display other data types. The EST data that will be stored in this data structure
consists of all plant EST sequences from the EMBL Nucleotide Sequence Database.
To achieve the second goal, the EMBL data must be complemented with additional annotation information
from the UniLib database. Furthermore, the tissue type annotation from the EMBL data must be refined
through the use of the Gramene Cereal Plant Anatomy Ontology.
The third goal must be realized through the creation of contigs from a part of the EST dataset and the
detection of SNPs from these contig data. The data structure from the first goal must be adapted towards
storing these data, and user interfaces must be developed to display the contig and polymorphism data.
4
2
Background
2.1
ESTs, contigs, and SNPs
Because the genome sequences of many plant species are not available, genetic analysis in these organisms
relies on genetic markers. The most abundant markers that can be used for genetic mapping are single
nucleotide polymorphisms (SNPs),. A SNP is the substitution of a single nucleotide for another between to
individuals of the same species. In important crop species such as maize and soybean, SNPs occur at
frequencies in the order of magnitude of 1 SNP per 60 - 600 nucleotides (Kota et al, 2003). Next to their
high abundance, advantages of SNPs over other molecular markers are their phenotypical neutrality, and the
fact that they exist predominantly in only two variants, a property that is highly valuable in linkage analysis
and genotyping (Torjek et al, 2003). Moreover, compared to other genetic markers such as tandem repeats,
SNPs have a higher stability (Picoult-Newberg et al, 1999). The development of high-throughput detection
methods for SNPs has led to their use as one of the most important genetic markers in agricultural breeding
(Batley et al, 2003).
Traditional identification of SNPs is predominantly based on targeted DNA amplification and sequencing of
single genes. These methods require an ab initio knowledge of the sequence that contains the polymorphism
(Torjek et al, 2003). Recently developed approaches employ the scanning of complete genomes for this kind
of polymorphisms to generate high-density genetic maps. However, this requires the availability of a species'
genome sequence. As these are not available for important crop species such as soybean and maize,
expressed sequence tags (ESTs) have recently gained popularity as a potential source for polymorphism data
such as SNPs.
ESTs are single-read sequences produced from reverse-transcribed mRNA molecules. The typical
preparation of an EST library results in a highly redundant set of nucleotide sequences. These ESTs
represent a sampling of the transcriptome of the organism at a specific developmental stage under specific
conditions (Rudd, 2003). The relative low cost and easy automation of EST sequencing have resulted in the
availability of more than four million plant EST sequences in the three major public databases (NCBI,
EMBL and DDJB). Because the ESTs from the public databases have been obtained from many different
individuals, comparisons of these sequences can lead to the discovery of polymorphisms such as SNPs.
There are two major limitations to EST sequences when compared to genome sequences for use in genetic
analysis. The first involves the distribution of sampling of host genes within an EST library, which, under
normal library construction conditions, represents the ratio of mRNAs present within a specific tissue under
specific environmental conditions. Common household genes are ubiquitously expressed within almost any
cell type and are overrepresented in most unnormalized EST collections, whereas specialized gene products
may not be present at all because they were not expressed in detectable amounts. The second limitation to
ESTs is the overall sequence quality and length. The quality of individual nucleotides within an EST
sequence is partly determined by the biochemistry of the reverse transcriptase polymerase chain reaction
(RT-PCR) and sequencing reaction that are applied to generate these sequences. Another qualitydetermining factor is the interpretation of the electropherogram trace data by automatic sequencer, and the
quality and quantity of the sample from which the nucleotide sequence is determined. Because the sequence
is read single-pass it can contain a number of incorrectly sequenced residues. Additionally, EST sequences
may contain stretches of vector or polylinker contaminations. (Rudd, 2003)
5
The maximum length of reliable sequence of an EST is usually limited to several hundreds of nucleotides,
which is significantly shorter than most full-length plant gene transcripts. Larger continuous stretches of
sequence data can be generated by making use of the sequence redundancy that is present in large EST
collections such as the EMBL database. This is achieved by building contigs from the EST sequences.
The use of the word contig in this context requires a small explanation, as this word is usually used in the
assembly of sequence fragments form genome sequencing projects. In the glossary of the NCBI Handbook
(http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=handbook) a contig is
defined as 'a contiguous segment of the genome made by joining overlapping clones or sequences'. In
plantEST database terms, a contig consists of a collection of overlapping EST sequences which ideally
represents a single gene transcript. The term cluster is not used as this term refers to a more general grouping
of sequences based certain criteria such as sequence homology or expression data.
A large set of methods to build contigs from EST collections is currently available, and most of these
methods follow the same general outline. To create contigs, the available EST sequences from a species are
first clustered on the basis of sequence similarity. In the next step a multiple alignment of each cluster of
sequences is made, and a consensus sequence is deduced. The consensus sequence is of higher overall
quality and length than the individual ESTs. Generation of the consensus sequence is either determined by
electropherogram trace data or by internal program statistics, or a combination of both. Nucleotide
mismatches between the consensus sequence and the individual EST sequences represent polymorphisms,
sequencing errors or differences between highly homologous gene transcripts.
In most EST contig building approaches, the EST sequences are trimmed of vector contamination prior to
clustering. Contamination of the EST sequence by vector sequence fragments can arise during the molecular
cloning of the EST. During the excision of an EST from its cloning vector, small fragments of the vector
DNA can remain attached to the EST. These short stretches of vector sequence can result in biologically
false overlaps between ESTs during contig building, because two sequences that share a common vector
contamination can be joined together.
Next to the removal of vector contaminations, or vector masking, other strategies can be employed to
facilitate the contig building procedure. In addition to vector sequences, EST libraries can be contaminated
with sequences from the cloning host. Furthermore, the presence of repeats in the EST sequences can
hamper the contig building during the phase when overlaps between sequences are determined. Both these
problems can be solved similarly to the vector contamination problem, namely by masking the undesired
nucleotides in the EST sequences.
The large number of nucleotide mismatches that is usually present within a contig of EST sequences is
partially attributable to the low quality of the EST sequences. A number of different software applications
has been described that focus on separating meaningful sequence polymorphisms from sequencing errors
(Picoult-Newberg et al, 1999; Batley et al, 2003; Kota et al, 2003). Many of these applications make use of
the electropherogram trace data to distinguish between 'true' and 'false' polymorphisms. However, recently
developed algorithms such as SNiPper (Kota et al, 2003) can significantly enrich the pool of nucleotide
mismatches for true SNPs without these trace data.
6
2.2
Ontologies
Clustering and comparing biologically related data can be performed at other levels than grouping
nucleotide sequences. The annotation data that describes the biological origin of the EST sequences and
libraries can be used to compare gene expression in various tissues under various environmental conditions.
However, such comparisons are often hindered by the variability of terms used to describe comparable
objects. Ontologies such as the Zea mays Plant Structure Ontology (Vincent et al, 2003) and the Gramene
Cereal Plant Anatomy Ontology (Ware et al, 2002) have been designed to overcome this problem.
Vincent et al (2003) describe an ontology as 'a classification methodology for formalizing a subject's
knowledge in a structured way'. In general, an ontology can be represented as a tree-like structure, or more
specifically as a rooted and connected Directed Acyclic Graph (DAG) of terms with a defined meaning from
a controlled vocabulary. Each element in the graph can be connected to other elements via hierarchical and
associative relationships such as IS-A and PART-OF. Therefore it is important that the terms in an ontology
are well defined and that the relationship between the terms is accurately represented. Correct and concise
annotation of the controlled vocabulary terms by means of definitions and the inclusion of synonyms
facilitate the consistent use of ontologies.
The Plant Ontology Consortium (http://www.plantontology.org/) curates the development of a number of
controlled vocabularies representing various plant-based knowledge domains. This consortium attempts to
represent the current biological understanding of plant tissues and developmental stages as DAGs of
controlled vocabularies. Each element in these biological ontology graphs represents a concrete biological
term such as 'leaf' or 'epidermis'. The consortium currently focuses on the integration of the taxon- and
species-specific ontologies for grasses, Zea mays, and Arabidopsis thaliana (http://www.gramene.org/,
http://www.maizegdb.org/, http://www.arabidopsis.org/) into common plant tissue and developmental stage
ontologies.
7
3
Database and interface design
3.1
Overview
To realize the goals outlined in section 1.2 the approach represented in figure 1 was followed. This approach
employs four basic steps that are discussed in more detail in the following sections. The first step is covered
in section 3.2 and describes the collection of the EST data from the EMBL, UniLib and Gramene databases.
The second step involves the development of the database structure (section 3.3) and the import of the EST
data into this database (section 3.4). Section 3.5 covers the third step of the approach, which is the
processing of the EST sequences into contig and SNP data. Finally, the fourth step involves the presentation
of the EST, contig, and SNP data from the database to the user. This step in described in section 3.6.
Figure 1: Overview of the methods used
to create and test the plantEST database
and its interface. The first step involves
the collection of EST sequence and
annotation data from the EMBL, UniLIb
and Gramene databases. In the second
step Python and PostgreSQL were used
to develop a data structure to store these
data. Contig and polymorphism data
were derived from the EST data in the
third step using the TGICL and
VecScreen tools. The Python
programming language was used to
create parsers that import these data into
the plantEST database. Interfaces to
access the data were developed in the
fourth step.
8
Figure 2: EMBL record for the EST with accession number BG443252. Each line in the record starts with a two-letter
code that indicates the data attribute, followed by that attribute’s value.
9
3.2
Data collection
The first step in the design of the plantEST database consists of the retrieval of EST sequence and
annotation data from the EMB, UniLib, and Gramene databases. A description of the data sets that were
used to create the plantEST database is discussed below.
3.2.1
EMBL EST data
A large number of plant EST sequencing projects have been undertaken recently. However, most of these
projects produce data that is not publicly available, or the data is deposited in public sequence databases
such as the EMBL Nucleotide Sequence Database (Kulikova et al, 2004). Appendix 1 shows an overview of
public EST databases on the Internet. The EMBL database can be considered the most extensive resource for
plant EST data. The data from the EMBL database, the contents of which are mirrored at the NCBI an
DDBJ, are accessible as downloadable flat files from the EBI FTP server
(ftp://ftp.ebi.ac.uk/pub/databases/embl/release/).
The data from the plant EST division of the EMBL database release 76 (September 2003) were used as the
EST dataset for the plantEST database. These data consist of sequence and annotation data for 3,842,311
ESTs. The annotation data describes the biological source of the EST, such as the tissue type and
developmental stage of the plant from which it was obtained, the environmental stresses that were applied to
the plant, and the experimental conditions under which the EST was produced. An example of an EST
record from the EMBL database is shown in figure 2 on the previous page.
The EST data was copied directly from the plant EST division of the EMBL database, which on closer
inspection contains several non-plant species. Examples of these are the oomycetes Hyaloperonospora
parasitica, which causes the downy mildew disease in Arabidopsis thaliana, and Phytophthora infestans,
which causes the late blight disease in Solanum tuberosum (potato). Because these pathogen species can be
relevant in research on the affected plant species the corresponding ESTs will also be copied.
3.2.2
UniLib EST library data
The records from the EMBL database often contain several cross-references to other databases such as
GOA, MaizeDB, and SWISS-PROT, and almost every record contains a reference to the UniLib database.
UniLib, the Unified Library Database, is maintained at the NCBI and contains biological and experimental
annotation information of EST and SAGE libraries present in the dbEST, UniGene, and SAGEmap
databases (ftp://ftp.ncbi.nih.gov/repository/UniLib/library.report).
Figure 3: UniLib record for the EST library with UniLib number 1285. Each line in the record contains an
attribute:value pair that describes a property of the EST library.
10
The data from the 03-03-2004 release of the UniLib database were used to expand the annotation of the
ESTs from the EMBL database. The UniLib dataset from this date was used instead of the one available at
the time of the EMBL Release 76 dataset (September 2003), because the older UniLib data lacked entries for
approximately 100 EST libraries that were present in the EMBL dataset. Data for these EST libraries is
present in the 03-03-2004 release of the UniLib data.
The annotation of attributes such as tissue type, developmental stage and library treatment that are present
within UniLib can also be found in the EMBL database. However, the information from the UniLib database
is more detailed than the information in the EMBL database for a large number of EST libraries. In addition
to the attributes present in the EMBL database the UniLib database contains distinct attributes for cloning
vector and restriction sites. In the EMBL database these data are not present, or available as part of another
attribute value. These data can be used for example to remove cloning vector sequence contaminations from
the ESTs. Figure 3 displays an example of an EST library record from the UniLib database.
3.2.3 Gramene tissue ontology data
The annotation of most biological and experimental attributes of the EMBL data such as tissue,
developmental stage, and treatment is written in human readable form and often consists of multiple words
or whole sentences. Because the data are submitted by many different researchers, different words and
descriptions are used to refer to a single concept. For example, ESTs from Arabidopsis thaliana that are
expressed in leaf tissue have tissue annotations such as 'leaf', 'leaves', 'sheath', or 'seedling green leaf'. This
variability of terms used to describe a single concept hampers the clustering and comparison of data on the
basis of biological meaningful attributes.
The terms from the Gramene Cereal Plant Anatomy Ontology (ontology revision 2.0, definitions revision
2.1) were used in the plantEST database to describe the tissue type annotations from the EMBL data in a
more consistent way. Gramene (http://www.gramene.org) is a database for comparative genomics of grasses,
and uses a set of ontologies to describe the biological and biochemical properties of sequence data records
such as ESTs. The Cereal Plant Anatomy Ontology is used as a controlled vocabulary of terms that are
linked to tissue and cell type annotation data. This ontology covers a broad range of general plant tissues
such as leaf and root, as well as several taxon-specific terms for the species in the Gramene database.
11
3.3
Data storage
The relational database structure of the plantEST database was designed to store EST, contig, and
polymorphism data. The PostgreSQL Relational Data Base System (http://www.postgresql.org/) was used to
create this database structure. This section discusses the database structure that was designed to store the
EST data from the EMBL, UniLib and Gramene databases.
3.3.1
EMBL EST data
The EST records from the EMBL Nucleotide Sequence Database contain sequence and annotation
information. Because this information is stored as attribute:value pairs, a relational database is a logical data
structure to store these data. As a first consideration, each attribute from the EMBL record can be used to
form a column in a single database table. However, several attributes can have more than one value in a
single EST record. Furthermore, multiple EST records can have an identical value for an attribute common
to each of those records. These types of attributes require the introduction of new tables in the relational
database. The relationships that exist between these attributes define the relationships between the tables.
The relationships that were found between the attributes of the EMBL EST records can be found below. The
black part of the database model from figure 4 was created from these relationships.
- one plant species can have many ESTs
- many ESTs together form one EST library from one plant species
- one EST can have many database cross-references
- one EST can have many literature references
- one literature reference can refer to many ESTs
- one literature reference can have many authors
- one author can contribute to many literature references
Closer inspection of the EMBL data revealed that the tissue_type attribute from the feature table of the
EMBL records can have identical or similar values for several EST records. This is also true for the
dev_stage attribute from the feature table. From the relational database perspective this would mean that
these attributes should be modeled into tables having a one-to-many relationship to the EST_LIB table.
Furthermore, EST libraries can be generated from more than one tissue, and during more than one
developmental stage. Based on these many-to-many relationships the tissue_type and dev_stage attributes
were modeled into separate tables, as displayed in red in figure 4.
3.3.2
UniLib EST library data
The data from the UniLib database consists of the annotation of biological attributes from the EST Libraries
present in the EST division of the EMBL database. The UniLib database contains attributes similar to those
found in the EMBL database. However, the vectors and restriction sites that are used in the molecular of the
ESTs are stored in separate attributes in the UniLib database, whereas these data are part of the note attribute
in the EMBL database, when present. Because these data may be valuable for removing vector
contamination from the EST sequences, these attributes were introduced in the EST table of the plantEST
database.
12
3.3.3 Gramene tissue ontology data
The TIS_ONT table was introduced into the relational database structure as depicted in the green part of
figure 4 to store the tissue ontology terms that will be associated with the tissue type descriptions from the
EMBL files. The table is designed to store all ontology keywords from the Gramene Cereal Plant Anatomy
Ontology, as well as any available definitions and comments.
Each tissue type annotation from an EST can contain many tissue ontology keywords, and each ontology
term can be linked to many tissue descriptions. This would suggest a many-to-many relationship between
the TISSUE and the TIS_ONT tables. However, the TIS_ONT table is linked to the EST_LIB table rather
than to the TISSUE table because other attributes from an EST library can also contain tissue annotation
data. In this way, these annotations can also be used to link ontology keywords to EST libraries.
Figure 4: Database model for the plant EST database, designed to store EST information from the EMBL, UniLib, and
Gramene databases. The EST table in this model stores all data that is unique for an EST. Data that is common to all
ESTs in an EST library is stored in the EST_LIB table. The TISSUE and DEVSTAGE tables (marked in red) store the
tissue type and the developmental stage, respectively, from which the EST library was generated. The TIS_ONT table
(marked in green) stores tissue ontology keywords from the Gramene database. The SPECIES table contains the name
of the species to which the EST in an EST library belong. The DB_XREF table contains database cross-references, and
the LIT_REF and AUTHOR table hold the literature references for each EST record. Tables ELB_TIS, ELB_DEV,
ELB_TIO, and LIT_ATR store the many-to-many relationships that exist between the two tables that are linked to each
of these tables. The legend in the lower right corner of this figure applies to figures 4 - 6
13
3.3.4
Sequence mask data
As outlined in section 1.2, the functionality of the database will be tested by building contigs from part of
the EST dataset and detecting polymorphisms within these contigs. The following two sections describe how
the plantEST database structure was extended to allow storage of the data that is generated during contig
building and SNP detection.
Sequence masks, such as vector and repeat masks, are important when building contigs from EST data, as
they prevent false overlaps between EST sequences. Many sequence masks can be applied to one EST
sequence, as it can contain vector sequences on either end of the sequence, as well as one or more repeat
sequences. Sequence mask data is stored in the SQ_MASK table, as displayed in red in the database model
in figure 5.
3.3.5
Contig and SNP data
To implement these data into the plantEST database, the relationships between the EST data and the contig
and SNP data were examined. The initial relationships that were found are listed below, and the resulting
database structure is displayed in the green area of figure 5.
- one contig consists of many ESTs
- one contig can contain many SNPs
Figure 5: Extension of the plantEST database model to allow storage of sequence mask, contig and polymorphism data.
The SQ_MASK table (marked in red) stores the sequence mask data. Conitg and polymorphism data is stored in tables
CONTIG and P_MORPH (both marked in green) respectively.
There are two drawbacks to the database structure from figure 5. The first drawback is that this structure
allows an EST to belong to only one contig. Using this database structure it is not possible to assign an EST
to multiple contigs, for example when a set of EST sequences has been processed multiple times or by
different contig building tools. The second drawback arises from the fact that SNPs are stored with only a
reference to the contig to which they belong. With this kind of assignment it is unclear on which sequence
within that contig the SNP is located. To resolve these problems, the final database structure from figure 6
was created based on the following relationships:
14
- one contig build consists of many contigs
- one contig consists of many ESTs
- one EST can belong to many contigs
- one EST-contig pair can have many polymorphisms
Figure 6: Final database model for the plantEST database. The relationships between the EST, CONTIG, and
P_MORPH tables have been redefined to represent the biological relationships more accurately, as highlighted in the
figure. The CNT_BLD table holds general information on a set of contigs that was produced during a single contig
building process and can be considered a 'project description' table. The CONTIG table contains the data that is specific
for each contig in a contig build, and table EST_CNT stores the many-to-many relationship between EST and
CONTIG. Polymorphism data such as SNPs are stored in the P_MORPH table.
3.3.6
Indices
The plantEST database was designed to store sequence and annotation data on several million plant ESTs.
Although the data is accessible from every attribute, it is likely that some attributes are used more often as
search criteria than others. To the users of the database it is important that the most common types of
requests to the database are processed as fast as possible. To accelerate the most common types of queries,
indices were created on the attributes that are used most often as search criteria. The attributes that are
indexed, with exception of the primary key attributes, are marked in green in the database overview in
appendix 2. The primary key and unique attributes of each table are automatically indexed by the
PostgreSQL Data Base Management System. These indices are used when searching a table for one specific
record on the sequence number or unique attribute value, respectively. Additional indices have been created
on all foreign key attributes to increase the search speed when two tables are joined. Finally, some specific
data attributes have been indexed to increase the performance of frequently used queries.
15
3.3.7
Redundancy versus speed
Data retrieval from the database has been accelerated through the addition of some redundant data next to
the indices that were discussed before. Although the focus of a relational database system is to reduce
redundancy, the increase of search speed outweighs the increased redundancy. The redundant data that has
been added to the database can be calculated or derived from the original data, as is explained by the three
examples outlined below. Attributes that store redundant data are marked in red in the database overview in
appendix 2.
The first example of redundant data in the database is the seq_count attribute in the EST_LIB table. This
attribute stores the number of ESTs that belong to each EST library, which is often retrieved from the
database. The PostgreSQL implementation of the SQL function that calculates the number of ESTs per
library is rather slow, and calculating these numbers on demand therefore requires a substantial amount of
time. This is significantly reduced by calculating the number of ESTs per library whenever the database is
updated and storing these numbers as an attribute value of the EST_LIB table.
Another example of a redundant attribute that is used to increase database speed is the checksum attribute of
the EST table. For each EST sequence that is stored in the plantEST database an MD5 checksum is
calculated, which is used to detect the presence of duplicate sequences in the database. The MD5 algorithm
is a data encryption algorithm that creates a 128-bit message digest from a text message of arbitrary length.
The RFC 1321 document on the MD5 algorithm (http://www.freesoft.org/CIE/RFC/1321/) states that 'it is
computationally infeasible to produce two messages having the same message digest, or to produce any
message having a given prespecified target message digest'. This means that any nucleotide sequence can be
compressed into a unique 128-bit (or 32 human-readable characters) digest or checksum, which are
significantly smaller in length than the EST nucleotide sequences. The checksums are used in rapid
detection of duplicate sequences in the database, as comparisons between 32-character strings are faster than
comparisons between strings of hundreds of characters in length. Duplicate sequences are not filtered out
during the parsing of the EMBL data. The EST records of duplicate sequences may contain non-overlapping
annotation data. Furthermore, the presence of two identical sequences reduces the chance of sequencing
errors in those sequences.
The third example is the inclusion of the padded_sequence attribute in table EST_CNT. This attribute stores
a possibly modified copy of the original EST sequence. These modifications are created during contig
building to generate sequence alignments, and can consist of gap insertions, deletions of masked residues,
and reverse complementation of the original EST sequence. A more flexible design would store each
modification on the sequence separately, so that modifications can be altered, removed, or added to an EST
sequence. However this would impose two limitations on the data retrieval speed. The first limitation would
be caused by the amount of records that would have to be retrieved from the database. In the current
database model, one record is retrieved for each modified EST sequence. When each modification would be
stored separately, one record would have to be retrieved from the database for each modification. The
second limitation involves the sequence orientation. For many ESTs, the reverse complement of the original
sequence is used in contig building. To store this property as an attribute, the concept of reverse
complementation should be present in the database structure. Furthermore, each of these sequences would
have to be reverse complemented whenever the corresponding contig information is requested.
The drawback of adding these redundant data becomes apparent when the updating of the database is
considered. For example, when new sequences are added to an existing EST Library, the seq_count attribute
value should to be updated to reflect the new number of ESTs in that library. The programs (discussed in
section 3.4) that import the data into the database also update these redundant attributes.
16
3.4
Data import
After the database structure described in section 3.3 was created, the EST and ontology data were inserted
into the plantEST database. The parser programs that are discussed below were written to automate the data
import. The programs were written in the Python programming language (http://www.python.org/), and the
interfaces that accompany the parser programs were created using the Tkinter library, and are discussed in
section 3.6.
A general outline of the steps performed to enter the data into the database is shown in figure 7. The ‘manual
processing’ label in this figure refers to the changes introduced in the species table. The species_name
values from the SPECIES table were updated to a more uniform format that consists of the full scientific
species name without variety or subspecies annotations. Annotations of species subdivisions such as cultivar
or variety were stored in the appropriate attribute of the EST_LIB table. The SQL commands that introduced
these changes in the SPECIES and EST_LIB tables are available in appendix 3.
Figure 7: Overview of data import into
the plantEST database. First, the
EmblParser program parsed the
contents of the EMBL flat files to the
plantEST database. Subsequently the
UniLib data was processed by the
UniLibParser script. After this step
some manual processing of species
data was performed. Finally, the
ontology data were imported using the
TissueOntologyParser script. The data
flow is represented by the horizontally
oriented arrows going from the
EMBL, UniLib and Gramene
databases to the plantEST database.
The time flow is represented by the
gray, vertically oriented arrows.
3.4.1
EMBL EST data
The EmblParser module parses data from EMBL flat files to the plantEST database. Each flat file contains a
maximum of 100,000 EST records ordered by accession number, and each record in the file consists of a
number of formatted lines. A detailed description of the line types in an EMBL flat file can be found in the
EMBL Nucleotide Sequence Database User Manual
(http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html).
17
The EmblParser module reads the header of each line in the EMBL flat file and parses that line accordingly.
All information that is parsed from one EST record is temporarily stored, and at the end of an entry that
entry is written to the database tables. During the writing a redundancy check is performed to determine
whether any data from the currently parsed record is already present in the database. These checks reduce
the time required for the parsing by reducing the number of write transactions to the database. The idea
behind this strategy is as follows: the EST records in the flat file are ordered by accession number, and most
EST libraries consist of one or more series of incrementally assigned accession numbers. This means that
two incremental accession numbers usually belong to the same EST library.
3.4.2
UniLib EST library data
The UniLib flat file is parsed by the UniLibParser, a 'quick-and-dirty' parser script. The UniLibParser
retrieves a list of UniLib numbers from the EST_LIB table of the plantEST database. For each UniLib
number in the list, the parser searches the UniLib file for information on that EST library. The data from the
UniLib record is compared to the data from the plantEST database, and the database is updated where
appropriate.
3.4.3
Gramene tissue ontology data
The ontology data consists of two flat files, an ontology file and a definitions file. The ontology file contains
all tissue ontology terms and the relationships between these terms. The definitions file contains definitions
and comments for part of the ontology terms present in the ontology file. Data from both files were imported
into the TIS_ONT table of the plantEST database through the TissueOntologyParser program.
After the ontology terms were imported into the database, the terms were manually assigned to EST libraries
by examination of the tissue_type values that were associated with each EST library. A file was created that
contains the tissue_no values of each record from the TISSUE table and the matching tis_ont_no values
from the TIS_ONT table. In this way, each tissue_type value from the TISSUE table (referenced though the
corresponding tissue_no) was matched to any number of tissue_ontology values from the TIS_ONT table
(referenced through the corresponding tis_ont_no). The data from this file were parsed to the database by the
TissueOntologyConverter program.
The 690 distinct tissue_type annotations from the TISSUE table were linked to 91 distinct tissue ontology
terms by searching each tissue_type value for words that appeared in the set of tissue ontology keywords.
Tissue descriptions in the tissue_type attribute that had no apparent match to ontology keywords were
searched for synonyms of these keywords using Google (http://google.com).
18
3.5
Data processing
3.5.1
Overview
To determine the functionality of the plantEST database as a resource for EST data, and to extend the
functionality of the database by integrating contig and polymorphism data, all 188,788 A. thaliana EST
sequences from the database were assembled into contigs. The sequences were processed as outlined in
figure 8, and parser modules were written to import the results form these steps into the plantEST database.
The sequences from A. thaliana were chosen as a test dataset for two reasons. First, this plant species is in
use as a model organism for plant molecular biology, and its complete genome sequence is publicly
available. The large amount of available data on this species allows for validation of the contigs and
polymorphisms. Second, the size of the dataset compared to the genome size of this organism suggests the
presence of a sufficiently large amount of redundancy in the dataset, which is required to find overlaps
between sequences during contig building.
Figure 8: Overview of contig and SNP generation from the A. thaliana EST sequences. First, the EST sequences were
extracted from the database and screened for vector contamination. Next, contigs were built from the vector-masked A.
thaliana EST sequences. Finally, putative SNPs were detected from these contigs. The data that was generated in each
step was stored in the plantEST database. The black arrows from and to the plantEST database indicate the data flow,
whereas the vertically oriented gray arrows represent the time flow.
19
3.5.2
Vector contamination screening
Prior to contig building the A. thaliana EST dataset was screened for vector contamination using a BLASTN
sequence identity comparison against the NCBI’s UniVec vector database
(http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html). The UniVec database is a non-redundant set of
vector sequences that are commonly used in molecular cloning experiments. For part of the ESTs from this
dataset a description of the cloning vector was available in the plantEST database, which could be used to
divide the EST dataset in smaller subsets and compare each of those subsets only to the corresponding
vector sequences. However, due to the number of different cloning vectors used in the A. thaliana dataset,
each EST sequence in the A. thaliana dataset was compared to all the vector sequences in the UniVec
database.
The vector masking was performed on a Paracel computer using the BLASTN 1.4.9-Paracel algorithm
(Paracel BLAST User Manual, version 1.3). The parameters (-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12)
that were used for the BLASTN algorithm were copied from the NCBI's VecScreen application
(http://www.ncbi.nlm.nih.gov/VecScreen/). These parameters are optimized to find short nearly-exact
matches while tolerating single-base deletion and insertion errors that are often generated in EST
sequencing. An example of a vector contamination BLASTN result is shown in figure 9.
The Paracel BLASTN software writes the output to a flat file. The ParacelParser module was written to
parse these data to the database. When an alignment between an EST sequence and a vector sequence was
encountered by the parser, a sequence mask was created if the alignment was located within a user-specified
distance from the 5' or 3' terminus of the EST sequence, as explained in figure 10. Subsequently, the
plantEST database was queried for existing sequence masks for that EST sequence. If the new mask
overlapped with an existing mask, that mask was extended; else the new mask was stored in the database.
3.5.3
Contig building
The vector-masked A. thaliana EST sequences were assembled into contigs using the TGICL software
(Pertea et al, 2003) using the default parameters. This software from The Institute for Genomic Research
(TIGR) contains a modified version of the megablast clustering algorithm (Zhang et al, 2000), three custom
clustering algorithms, and the CAP3 cluster assembly software (Huang and Madan, 1999). A command-line
Perl script is wrapped around these programs to automate the contig building procedure.
The contigs were built in two successive steps using the TGICL software. The clustering step consisted of
all-versus-all pairwise similarity searches using the modified megablast program, which divided the large
EST dataset into many smaller sets of homologous sequences, or clusters. In the assembly step, each of these
clusters was submitted to the CAP3 assembly software that built one or more contigs from such a cluster.
The CAP3 software does not require electropherogram trace data for the assembly of the contigs. This is
important, as the trace data for the EST sequences are available in the plantEST database and difficult to
retrieve from other resources.
Clustering the sequences prior to the contig assembly significantly reduces the amount of time required for
this process when compared to direct contig assembly from the complete EST dataset. The sequence
clustering method that is employed in the TGICL software detects similarity between sequences on the
nucleotide level and creates clusters accordingly. With this approach it is possible that two EST sequences
that are derived from distinct, but highly homologous, genes or gene fragments are clustered together.
However, this small loss of sensitivity is outweighed by the increase in speed that is gained from this
clustering step. Furthermore, the exact clustering strategy is not of critical importance to this research as the
goal of generating these contigs lies in testing the plantEST database.
20
The CAP3 software writes the alignment information for the sequences in each contig to an ACE flat file.
The Cap3AceParser module was written to parse the contents of the output file to the plantEST database.
For each contig in the output file, the consensus sequence, the possibly modified instances of the EST
sequences, and the alignment of the EST sequences to the consensus sequence were stored in the database.
Modifications to the EST sequence consist of introduction of gaps, deletion of masked residues, or reverse
complementation of the original sequence.
3.5.4
SNP detection
After contig building was completed, putative SNPs were detected within the contigs. A a putative SNP is
defined as a difference between a nucleotide on an EST sequence and the corresponding nucleotide on the
consensus sequence of the contig to which that EST belongs.
Because the overall quality of EST sequences is low these sequences can contain errors such as incorrectly
assigned residues and insertions or deletions of one or more bases. Several statistical rules can be applied to
estimate the chance that a difference in nucleotide sequence between an EST and the corresponding
consensus sequence is a true SNP or a sequencing error. A small number of software applications has been
developed to use a set of statistical rules to enrich the pool of putative SNPs with true SNPs. However, the
issue of separating true SNPs from sequencing errors was not considered in this project due to time
limitations.
Putative SNP discovery was performed by the SNPSearch module, and the results were stored in the
plantEST database. The SNPSearch module compares the consensus sequence of each contig to all the EST
sequences in that contig. Whenever a difference is found between the consensus sequence and the EST
sequence, the location and type of the aberrant nucleotide on the EST sequence is stored in the database.
21
Figure 9: Paracel BLASTN report for EST with accession number AA585898 against the UniVec database. The EST
sequence shows homology to the pNS1 vector at one end of the sequence (bases 219 – 234).
22
Figure 10: Graphical overview of vector mask validation and extension. The green bar represents an EST sequence, and
the red vertical lines indicate the position of the left and right limit attributes on the EST sequence. The blue bars
(indicated A, B and C) directly below the EST sequence represent regions of the EST show homology to a vector
sequence. Homologous regions A and C are considered vector masks as they are located between the end of the
sequence and the left respectively right limit. Homologous region B is not considered a vector mask because is not
located within one of these regions. Regions A and Care extended towards the nearest end of the EST sequence (shaded
blue bars on the lower row of the figure) as they fall within the limit regions.
3.5.5
Results
The focus of generating the contig and SNP data discussed above was on testing the plantEST dsatabase for
its use in EST processing, rather than generating a set of highly accurate contigs and SNPs. Therefore, no
effort was placed in optimization of the parameters of the contig building programs. The contigs and SNPs
that were generated in this experiment were not investigated in detail; however some of the results from the
contig building are discussed below.
A total of 2,830 vector masks were generated with the left and right limit parameters (as explained in figure
10) both set to 20. 16 EST sequences were masked on both ends of the sequence, and 28 sequences were
completely masked. This last group consists of pure vector sequences and do not contain any plant DNA.
Rather than only comparing each EST sequence to the sequence of the vector in which it was cloned, the
complete A. thaliana EST dataset was compared to the UniVec database. This introduces a possibility of
masking part of an EST that show homology to the sequence of a different vector than the one it was cloned
in.
The 188,772 remaining A. thaliana EST sequences were assembled into 19,138 contigs. 199,783 putative
SNPs were detected from these contigs. Due to the high error rate in EST sequencing this number does not
necessarily reflect the number of true sequence polymorphisms. The overall fraction of true polymorphisms
in this set could be increased by two methods. The first method involves the use electropherogram trace data
during contig building. These data assign a quality score to each base in an EST sequence, and can be used
to discard putataive SNPs that have a low quality score. The second method is based on the frequency of a
polymorphism’s occurrence in a contig. This method is employed by the statistical SNP detection software
mentioned in section 2.1.
23
3.6
Data presentation
The data that is present in the plantEST database can be retrieved in two ways. The first way involves the
use of SQL commands to access the data via the psql interface of the PostgreSQL Relational Data Base
Manager System. This text-only interface is included in the PostgreSQL software and displays database
information as tables. An advantage of this interface is the high flexibility, as the user can retrieve any kind
of information from the database as long as the SQL query used to retrieve the data is syntactically correct.
However, knowledge of SQL is required to use this interface to access data from the plantEST database. A
second disadvantage is that the table format is not convenient to display complex information such as
multiple sequence alignments. As an alternative, the plantEST Explorer modules were developed to access
part of the data in the plantEST database via graphical menus.
The modules that were designed to visualize data from the plantEST database can be accessed through the
buttons from the plantEST Explorer main menu shown in figure 11. The three buttons on the left side of the
plantEST Explorer provide access to data visualization models. The buttons in the middle allow the user to
import EST, vector mask, and contig data, and search the contigs in the database for SNPs. The right part of
the interface contains buttons to access the help module, read the manual (this report) and exit the program.
Figure 11: Main interface of the plantEST Explorer program. Each button opens an interface to the corresponding
module. The menu bar (top) provides shortcuts to several extra functions.
The Quick Help interface displayed in figure 12 can be accessed through the Help button of any plantEST
module and informs the user about the usage of the plantEST Explorer interfaces. Each of the other
interfaces are briefly discussed below. For a description of the usage of each interface, start the plantEST
Explorer program and consult the Quick Help module.
The EstSearch module searches the database for detailed information on an EST, based on a user-supplied
accession number and detail options. The current implementation distinguishes between four levels of detail:
EST, EST library, literature references, and contig information. The module retrieves data on the submitted
accession number from the database based on the selected detail options, formats the data into human
readable text and displays the result on the screen. The output can be saved to a file. Figure 13 displays the
interface to the EstSearch module.
The LibrarySearch module was designed to display information on the EST libraries in the database. The
interface shown in figure 14 allows the user to select a set of EST libraries on the basis of identical
characteristics such as tissue type and developmental stage, and display detailed information on the
properties of these libraries. Additionally, the nucleotide sequences from the ESTs in the selected libraries
can be filtered on properties such as sequence length and quality, and the final selection can be written to a
FASTA file.
24
Figure 12: Quick Help interface of the plantEST Explorer.
Figure 13: EstSearch interface of the plantEST Explorer.
25
The interface to the ContigSearch module (figure 15) presents contig and polymorphism data to the user. It
allows the user to search the database for contigs matching criteria such as consensus sequence length,
number of sequences in the contig, or presence of a specific EST. The contigs that match the criteria are
displayed as a multiple alignment of EST sequences, and putative SNPs are highlighted in red.
Several dialog interfaces were created to accompany the parser programs that import data into the database.
The programs behind these interfaces were discussed in section 3.4. These dialog interfaces mainly consist
of entryfields to enter the required data. Figure 16 shows the dialog interfaces for the EmblParser,
ParacelParser and Cap3AceParser programs which import EST, vector mask, and contig data, respectively.
Figure 14: LibrarySearch interface of the plantEST Explorer.
Figure 15: ContigSearch interface of the plantEST Explorer.
26
4
Conclusion and future work
Figure 16: From top to bottom, from left to right:
EmblParser, ParacelParser, and Cap3AceParser dialog
interfaces of the plantEST Explorer.
27
4
Conclusion and future work
4.1
Conclusion
The plantEST database was created as a tool for fast and flexible access to EST data. It contains EST
sequence and annotation data from the EMBL and UniLib databases. These data have been refined through
the addition of tissue ontology terms from the Gramene database. The plantEST Explorer program models
were developed to provide easy access to part of the data in the plantEST database through user interfaces,
and to import data from various flat file formats into the plantEST database. The functionality of the
database in EST processing was proven by building contigs from part of the EST data, and detecting SNPs
in these contigs. The data structure of the plantEST database was modified to store these data, proving the
flexibility of the database in the respect of adding new data types.
4.2
Future work
Several features were considered during the development of the plantEST database that were not included in
its final design due to time limitations. This section discusses the idea behind several of those features, and
how they can be incorporated in the plantEST database structure.
The first example of data that was considered valuable for the plantEST database, but was not included
therein, is public EST data from other resources than the EMBL database. While the EMBL database
contains the largest publicly available collection of plant EST data, some of the databases mentioned in
appendix 1 contain several thousands of EST sequences that are not available in the EMBL database. To
include these data in the plantEST database, parser modules should be written that extract the EST data from
the files in which this data is distributed, without creating duplicate records.
The second example of data to include in the plantEST database is additional ontology data. The annotation
of the EMBL and UniLib data is extremely heterogeneous, limiting the possibilities to make biologically
meaningful comparisons between the data. The Gramene Cereal Plant Anatomy Ontology that was
introduced into the plantEST database has improved the annotation of the EST data; however, additional
ontologies are available to further refine this annotation. An example of this is the Gramene Cereal Growth
Stages Ontology, which can be used to annotate the developmental stages of the plants from which the ESTs
were generated. These ontology terms should be linked to the EST libraries either through manual
annotation or by a parser that contains a dictionary of ontology terms and synonyms. Additionally, the DAG
structure of these ontologies can be included in or linked to the plantEST database. This can be used in the
retrieval of EST libraries from the database on the basis of ontology terms by automatically including ‘child’
ontology terms when selecting a ‘parent’ term. For example, when all EST libraries that contain the ‘leaf’
ontology term were to be retrieved from the database, all EST libraries that contain ontology terms that are
‘children’ of this term (such as ‘leaf axis’ and ‘leaf sheath’) would be retrieved as well.
A third example is the electropherogram trace data that is generated when ESTs are sequenced. These trace
data indicate the quality of each base in the nucleotide sequence and can improve the sensitivity of the
contig building process. Some of these data can be retrieved from the NCBI Trace Archive
(http://www.ncbi.nlm.nih.gov/Traces/), while other data are available at the sequencing centers where the
EST sequences were generated. An important consideration for not including these data is the volume that
they encompass, as every base in each EST sequence has a trace value.
28
Literature references
Batley J., Barker G., O'Sullivan H., Edwards K.J., Edwards D. (2003) Mining for Single Nucleotide
Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data. Plant Physiol. 132(1): 8491
Huang X., Madan A. (1999) CAP3: a DNA sequence assembly program. Genome Res. 9: 868-77
Kota R., Rudd S., Facius A., Kolesov G., Thiel T., Zhang H., Stein N., Mayer K., Graner A. (2003)
Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol Genet Genomics
270(1): 24-33
Kulikova T., Aldebert P., Althorpe N., Baker W., Bates K., Browne P., van den Broek A., Cochrane
G., Duggan K., Eberhardt R., Faruque N., Garcia-Pastor M., Harte N., Kanz C., Leinonen R., Lin Q.,
Lombard V., Lopez R., Mancuso R., McHale M., Nardone F., Silventoinen V., Stoehr P., Stoesser G.,
Tuli M.A., Tzouvara K., Vaughan R., Wu D., Zhu W., Apweiler R. (2004) The EMBL Nucleotide
Sequence Database. Nucleic Acids Res 32
Pertea G., Huang X., Liang F., Antonescu V., Sultana R., Karamycheva S., Lee Y., White J., Cheung
F., Parvizi B., Tsai J., Quackenbush J. (2003) TIGR Gene Indices clustering tools (TGICL): a software
system for fast clustering of large EST datasets. Bioinformatics 19(5): 651-2
Picoult-Newberg L., Ideker T.E., Pohl M.G., Taylor S.L., Donaldson M.A., Nickerson D.A., BoyceJacino M. (1999) Mining SNPs From EST Databases. Genome Res. 9(2): 167-74.
Rudd S. Expressed Sequence tags: alternative or complement to whole genome sequencing? (2003) Trends
Plant Sci.. 8(7): 321-29
Torjek O., Berger D., Meyer R.C., Mussig C., Schmid K.J., Rosleff Sorensen T., Weisshaar B.,
Mitchell-Olds T., Altmann T. (2003) Establishment of a high-efficiency SNP-based framework marker set
for Arabidopsis. Plant J. 36(1): 122-40
Vincent P.L.D., Coe Jr E.H., Polacco M.L. (2003) Zea Mays ontology – a database of international terms.
Trends Plant Sci. 8(11): 517-20.
Ware D.H., Jaiswal P., Ni J., Yap I.V., Pan X., Clark K.Y., Teytelman L., Schmidt S.C., Zhao W.,
Chang K., Cartinhour S., Stein L.D., McCouch S.R. (2002) Gramene, a Tool for Grass Genomics. Plant
Physiol. 130(4): 1606-13
Zhang Z., Schwartz S., Wagner L., Miller W. (2000) A greedy algorithm for aligning DNA sequence. J,
Comput Biol. 7: 203-14
29
Acknowledgements
The author would like to thank the following persons for the many ideas and contributions to this work, and
for the time they took answering my many questions:
Centre for Molecular and Biomolecular Informatics, Nijmegen
 Gert Vriend
supervision
 Maarten Hekkelman
databases, support of open-source software ;)
 Jos Boekhorst
Python programming language
 Sander Nabuurs
Python programming language
 Marc van Driel
data mining, Unix tutorial and help
KeyGene, Wageningen
 Antoine Jansen
 Pieter Vos
general database ideas
general database ideas
Plant Research International, Wageningen
 Paulien Adamse
ontologies
 Sander van der Krol
contigs
 Sander Peters
general database ideas
Wageningen University
 Peter Schaap
thesis coordinator
30
APPENDICES
Appendix 1: Overview of plant EST databases on the internet
NCBI dbEST (http://www.ncbi.nlm.nih.gov/dbEST)
EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/)
DDBJ (http://www.ddbj.nig.ac.jp/)
 General public databases of EST sequences
MIPS Sputnik (http://mips.gsf.de/proj/sputnik/)
 ‘A comprehensive resource for the functional annotation of clustered plant ESTs’
PlantGDB (http://www.plantgdb.org/)
 Plant ESTs and GSSs
GPiDB (http://genoplante-info.infobiogen.fr/Databases/GPDB/)
 ESTs and contgis from several plant species
International Triticeae EST Cooperative (http://wheat.pw.usda.gov/genome/)
 ESTs of Triticeae species
Solanaceaea Genomics Network (http://www.sgn.cornell.edu/index.html)
 ESTs of Solanaceaea species
GrainGenes wEST (http://wheat.pw.usda.gov/wEST/)
 Wheat EST, contig, and mapping information
KOGUMI (http://www.shigen.nig.ac.jp/wheat/komugi/top/top.jsp)
 Wheat nucleotide sequences and information
MaizeGDB (http://www.maizegdb.org/)
 Maize nucleotide sequences and information
CUGI (http://www.genome.clemson.edu/)
 ESTs of cotton, barley, peach, and almond
AGI (http://www.genome.arizona.edu/)
 ESTs of rice, barley, and cotton
ESTarray (http://www.estarray.org/)
 Rice and rice blast fungus ESTs
ESTDB (http://estdb.biology.ucla.edu/)
 ESTs from Bean (Phaseolus coccineus) and Petunia
Kazusa (http://www.kazusa.or.jp/)
 ESTs from Chlamydomonas reinhardtii, Lotus japonicus, Arabidopsis thaliana, Porphyra yezoensis
31
Appendix 2: Entities in the plantEST database
LEGEND
TABLE NAME
primary key
foreign key
indexed attribute
redundant attribute
Description of this attribute
SPECIES
species_no
species_name
Name of the species
EST_LIB
est_lib_no
species_no
unilib_no
seq_count
avg_seq_length
clone_lib
sub_species
strain
variety
cultivar
cell_type
cell_line
sex
lab_host
vector
r_site1
r_site2
treatment
note
Reference to the UniLib database
Number of sequences in this EST library
Average length of all sequences in this EST library
Name of the EST library
Sub-species from which the ESTs were extracted
Strain from which the ESTs were extracted
Variety from which the ESTs were extracted
Cultivar from which the ESTs were extracted
Cell type from which the ESTs were extracted
Cell line from which the ESTs were extracted
Sex of the organism from which the EST library was created
Organism in which the EST sequences were cloned
Vector in which the EST sequences were cloned
Restriction site used to digest the vector and insert the EST
Restriction site used to digest the vector and insert the EST
Treatment of the plants from which the ESTs were extracted
Any other comments
DEV_STAGE
devstage_no
dev_stage
Developmental stage of the plants
ELB_DEV
est_lib_no
devstage_no
32
TISSUE
tissue_no
tissue_type
Tissue from which the ESTs were extracted
ELB_TIS
est_lib_no
tissue_no
TIS_ONT
tis_ont_no
goid
tissue_ontology
tissue_definition
Gramene Ontology Identifier number
Tissue ontology term
Definition of the tissue ontology term
ELB_TIO
est_lib_no
tis_ont_no
DB_XREF
db_xref_no
est_no
database_name
primary_ID
secondary_ID
Name of the database to which the cross-reference is made
Primary identifier of the cross-referenced EST in this database
Secondary identifier
SQ_MASK
sq_mask_no
est_no
mask_start
mask_end
First nucleotide of the sequence mask
Last nucleotide of the sequence mask
33
EST
est_no
est_lib_no
ID_entryname
ID_dataclass
ID_moleculetype
ID_division
sequence_version
entry_date
DT_create_date
DT_update_date
DT_update_version
description
comments
clone
organelle
SQ_sequencelength
SQ_count_A
SQ_count_C
SQ_count_G
SQ_count_T
SQ_count_other
SQ_sequence
checksum
Primary NCBI accession number
Data class from the EMBL database
Molecule type from the EMBL database
Division from the EMBL database
Version of the nucleotide sequence
Date the EST record was entered into the plantEST database
Date the EST record was entered into the EMBL database
Date the EST record was last updated in the EMBL database
Version of the EST record from the EMBL database
Description of the EST
General comments, can contain any kind of data
Name of the clone in whch the EST was cloned
Organelle from which the EST was derived
Length of the EST sequence
Number of A nucleotides in the EST sequence
Number of C nucleotides in the EST sequence
Number of G nucleotides in the EST sequence
Number of T nucleotides in the EST sequence
Number of non-ACGT nucleotides in the EST sequence
EST nucleotide sequence
MD5 checksum of the nucleotide sequence
EST_LIT
est_no
lit_ref_no
LIT_REF
lit_ref_no
medline_no
pubmed_no
lit_group
lit_title
lit_location
lit_comments
Identifier of the literature reference in the MedLine database
Identifier of the literature reference in the PubMed database
Group that created the article
Title of the article
Location where the article was originally published
Comments on the literature reference
34
LIT_ATR
lit_atr_no
lit_ref_no
author_no
lit_atr_pos
Position of the author in the list of author for this article
AUTHOR
author_no
author_name
Name of the author
EST_CNT
est_cnt_no
est_no
contig_no
padded_sequence
orientation
consensus_start
align_start
align_end
Modified sequence of the EST as it is aligned to the consensus
Orientation of the EST sequence in the contig
Nucleotide postion on the consensus where this EST starts
First nucleotide of the EST that matches the consensus
Last nucleotide of the EST that matches the consensus
CONTIG
contig_no
build_no
consensus_sequence
contig_length
contig_seq_count
contig_avg_seq_length
contig_snp_count
Consensus sequence of the contig
Length of the consensus sequence
Number of sequences in the contig
Average length of the sequences in the contig
Number of SNPs in the contig
35
CNT_BLD
build_no
build_name
build_date
build_description
program_name
program_version
program_parameters
avg_contig_length
snps_detected
Name of the contig building project
Date of the contig building project
Description of the contig building project
Name of the program used to build the contigs
Version of the program used to build the contigs
Parameters used during contig building
Average length of the contigs in this project
Status of SNP detection within this contig build
PMORPH
pmorph_no
est_cnt_no
pmorph_startbase
pmorph_endbase
First base of the polymorphism
Last base of the polymorphism
36
Appendix 3: updateSpecies.sql – updates to tables SPECIES and EST_LIB
37
Download