Contents 1 Introduction and goals 1.1 1.2 Introduction Goals 4 4 4 2 Background 5 2.1 2.2 ESTs, contigs, and SNPs Ontologies 5 7 3 Database and interface design 3.1 3.2 3.6 Overview Data collection 3.2.1 EMBL EST data 3.2.2 UniLib EST library data 3.2.3 Gramene tissue ontology data Data storage 3.3.1 EMBL EST data 3.3.2 UniLib EST library data 3.3.3 Gramene tissue ontology data 3.3.4 Sequence mask data 3.3.5 Contig and SNP data 3.3.6 Indices 3.3.7 Redundancy versus speed Data import 3.4.1 EMBL EST data 3.4.2 UniLib EST library data 3.4.3 Gramene tissue ontology data Data processing 3.5.1 Overview 3.5.2 Vector contamination screening 3.5.3 Contig building 3.5.4 SNP detection 3.5.5 Results Data presentation 8 8 10 10 10 11 12 12 12 13 14 14 15 16 17 17 18 18 19 19 20 20 21 22 24 4 Conclusion and future work 4.1 4.2 Conclusion Future work 3.3 3.4 3.5 28 28 28 Literature References 29 Acknowledgements 30 Appendices Appendix 1: Overview of plant EST databases on the internet Appendix 2: Entities in the plantEST database Appendix 3: updateSpecies.sql – updates to tables SPECIES and EST_LIB 2 Summary Expressed Sequence Tags (ESTs) are single-read sequence fragments produced from reverse-transcribed mRNA molecules. They are especially important in plant genetics, as the draft genome sequences of only two plant species are available. This low number is accountable to the fact that plant genomes are generally large and contain many repeats. During the last few years an increasing number of EST sequencing projects have been undertaken to cover the expressed part of the genome for a large number of plant species. The data that is generated in these projects is usually stored in large public sequence databases such as NCBI dbEST and EMBL. The significant sequence redundancy in the EST datasets, in combination with the presence of sequences from many different varieties of a broad range of plant species, makes it attractive to screen plant ESTs for sequence features such as Single Nucleotide Polymorphisms (SNPs). SNPs are the most abundant sequence variations in most genomes, and can be used in association studies, in linkage analyses, and for generating high-density genetic maps that can be used in the study of disease-susceptibility genes. In most plant genomes SNPs are found at a frequency of roughly 1 in every 60 to 600 nucleotides. The EST data from the public databases are available as flat text files, and although this is a suitable format for data distribution, it does not allow for convenient processing. The plantEST relational database was designed to store these EST data from the EMBL database in a data structure that allows easy retrieval, modification, and processing of the data therein. Data selection was improved through the addition of annotation data from the UniLib database, and ontology terms from the Gramene Cereal Plant Anatomy Ontology. To evaluate the functionality of the plantEST database in the processing of EST sequences, a sample EST data set was extracted from the database and used to detect SNPs. To obtain these SNPs the EST sequences were screened for vector contamination sequences and clustered into contigs. The plantEST relational database structure was designed to allow storage of the data that was generated in the contig building process, in addition to the EST data. Additionally, interfaces were designed to view and extract data from the database in a user-friendly manner. 3 1 Introduction and goals 1.1 Introduction This report describes the construction and evaluation of plantEST, a relational database system for plant EST, contig, and polymorphism data. The database was developed during a six-month thesis for a Bioinformatics Master degree at the Centre for Molecular and Biomolecular Informatics (http://www.cmbi.kun.nl/) in Nijmegen. The plantEST database design was based on a request from KeyGene (http://www.keygene.com/) to develop a plant EST database for use in their EST processing pipelines. The database definitions of the plantEST database, as well as the source code for all the plantEST Explorer program modules described in this report, are available from the author on request. 1.2 Goals The research and database development that is described in this report was performed during a six-month thesis for a Bioinformatics Master degree. The goals of this research are (1) to create a data structure that allows fast and flexible access to plant EST data, (2) refine the annotation of this EST data, and (3) prove the functionality of this data structure in EST processing. The limited time span of the project narrows these global goals to the concrete goals outlined below. Realization of the first goal must result in a data structure, called the plantEST database, that allows the data to be searched on many criteria. This structure must be easily adaptable to store additional data types such as the contig and polymorphism data that will be produced during the realization of the third goal. The data must be accessible through a user-friendly interface that can be easily adapted to display the data in a different fashion, and to display other data types. The EST data that will be stored in this data structure consists of all plant EST sequences from the EMBL Nucleotide Sequence Database. To achieve the second goal, the EMBL data must be complemented with additional annotation information from the UniLib database. Furthermore, the tissue type annotation from the EMBL data must be refined through the use of the Gramene Cereal Plant Anatomy Ontology. The third goal must be realized through the creation of contigs from a part of the EST dataset and the detection of SNPs from these contig data. The data structure from the first goal must be adapted towards storing these data, and user interfaces must be developed to display the contig and polymorphism data. 4 2 Background 2.1 ESTs, contigs, and SNPs Because the genome sequences of many plant species are not available, genetic analysis in these organisms relies on genetic markers. The most abundant markers that can be used for genetic mapping are single nucleotide polymorphisms (SNPs),. A SNP is the substitution of a single nucleotide for another between to individuals of the same species. In important crop species such as maize and soybean, SNPs occur at frequencies in the order of magnitude of 1 SNP per 60 - 600 nucleotides (Kota et al, 2003). Next to their high abundance, advantages of SNPs over other molecular markers are their phenotypical neutrality, and the fact that they exist predominantly in only two variants, a property that is highly valuable in linkage analysis and genotyping (Torjek et al, 2003). Moreover, compared to other genetic markers such as tandem repeats, SNPs have a higher stability (Picoult-Newberg et al, 1999). The development of high-throughput detection methods for SNPs has led to their use as one of the most important genetic markers in agricultural breeding (Batley et al, 2003). Traditional identification of SNPs is predominantly based on targeted DNA amplification and sequencing of single genes. These methods require an ab initio knowledge of the sequence that contains the polymorphism (Torjek et al, 2003). Recently developed approaches employ the scanning of complete genomes for this kind of polymorphisms to generate high-density genetic maps. However, this requires the availability of a species' genome sequence. As these are not available for important crop species such as soybean and maize, expressed sequence tags (ESTs) have recently gained popularity as a potential source for polymorphism data such as SNPs. ESTs are single-read sequences produced from reverse-transcribed mRNA molecules. The typical preparation of an EST library results in a highly redundant set of nucleotide sequences. These ESTs represent a sampling of the transcriptome of the organism at a specific developmental stage under specific conditions (Rudd, 2003). The relative low cost and easy automation of EST sequencing have resulted in the availability of more than four million plant EST sequences in the three major public databases (NCBI, EMBL and DDJB). Because the ESTs from the public databases have been obtained from many different individuals, comparisons of these sequences can lead to the discovery of polymorphisms such as SNPs. There are two major limitations to EST sequences when compared to genome sequences for use in genetic analysis. The first involves the distribution of sampling of host genes within an EST library, which, under normal library construction conditions, represents the ratio of mRNAs present within a specific tissue under specific environmental conditions. Common household genes are ubiquitously expressed within almost any cell type and are overrepresented in most unnormalized EST collections, whereas specialized gene products may not be present at all because they were not expressed in detectable amounts. The second limitation to ESTs is the overall sequence quality and length. The quality of individual nucleotides within an EST sequence is partly determined by the biochemistry of the reverse transcriptase polymerase chain reaction (RT-PCR) and sequencing reaction that are applied to generate these sequences. Another qualitydetermining factor is the interpretation of the electropherogram trace data by automatic sequencer, and the quality and quantity of the sample from which the nucleotide sequence is determined. Because the sequence is read single-pass it can contain a number of incorrectly sequenced residues. Additionally, EST sequences may contain stretches of vector or polylinker contaminations. (Rudd, 2003) 5 The maximum length of reliable sequence of an EST is usually limited to several hundreds of nucleotides, which is significantly shorter than most full-length plant gene transcripts. Larger continuous stretches of sequence data can be generated by making use of the sequence redundancy that is present in large EST collections such as the EMBL database. This is achieved by building contigs from the EST sequences. The use of the word contig in this context requires a small explanation, as this word is usually used in the assembly of sequence fragments form genome sequencing projects. In the glossary of the NCBI Handbook (http://www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View..ShowSection&rid=handbook) a contig is defined as 'a contiguous segment of the genome made by joining overlapping clones or sequences'. In plantEST database terms, a contig consists of a collection of overlapping EST sequences which ideally represents a single gene transcript. The term cluster is not used as this term refers to a more general grouping of sequences based certain criteria such as sequence homology or expression data. A large set of methods to build contigs from EST collections is currently available, and most of these methods follow the same general outline. To create contigs, the available EST sequences from a species are first clustered on the basis of sequence similarity. In the next step a multiple alignment of each cluster of sequences is made, and a consensus sequence is deduced. The consensus sequence is of higher overall quality and length than the individual ESTs. Generation of the consensus sequence is either determined by electropherogram trace data or by internal program statistics, or a combination of both. Nucleotide mismatches between the consensus sequence and the individual EST sequences represent polymorphisms, sequencing errors or differences between highly homologous gene transcripts. In most EST contig building approaches, the EST sequences are trimmed of vector contamination prior to clustering. Contamination of the EST sequence by vector sequence fragments can arise during the molecular cloning of the EST. During the excision of an EST from its cloning vector, small fragments of the vector DNA can remain attached to the EST. These short stretches of vector sequence can result in biologically false overlaps between ESTs during contig building, because two sequences that share a common vector contamination can be joined together. Next to the removal of vector contaminations, or vector masking, other strategies can be employed to facilitate the contig building procedure. In addition to vector sequences, EST libraries can be contaminated with sequences from the cloning host. Furthermore, the presence of repeats in the EST sequences can hamper the contig building during the phase when overlaps between sequences are determined. Both these problems can be solved similarly to the vector contamination problem, namely by masking the undesired nucleotides in the EST sequences. The large number of nucleotide mismatches that is usually present within a contig of EST sequences is partially attributable to the low quality of the EST sequences. A number of different software applications has been described that focus on separating meaningful sequence polymorphisms from sequencing errors (Picoult-Newberg et al, 1999; Batley et al, 2003; Kota et al, 2003). Many of these applications make use of the electropherogram trace data to distinguish between 'true' and 'false' polymorphisms. However, recently developed algorithms such as SNiPper (Kota et al, 2003) can significantly enrich the pool of nucleotide mismatches for true SNPs without these trace data. 6 2.2 Ontologies Clustering and comparing biologically related data can be performed at other levels than grouping nucleotide sequences. The annotation data that describes the biological origin of the EST sequences and libraries can be used to compare gene expression in various tissues under various environmental conditions. However, such comparisons are often hindered by the variability of terms used to describe comparable objects. Ontologies such as the Zea mays Plant Structure Ontology (Vincent et al, 2003) and the Gramene Cereal Plant Anatomy Ontology (Ware et al, 2002) have been designed to overcome this problem. Vincent et al (2003) describe an ontology as 'a classification methodology for formalizing a subject's knowledge in a structured way'. In general, an ontology can be represented as a tree-like structure, or more specifically as a rooted and connected Directed Acyclic Graph (DAG) of terms with a defined meaning from a controlled vocabulary. Each element in the graph can be connected to other elements via hierarchical and associative relationships such as IS-A and PART-OF. Therefore it is important that the terms in an ontology are well defined and that the relationship between the terms is accurately represented. Correct and concise annotation of the controlled vocabulary terms by means of definitions and the inclusion of synonyms facilitate the consistent use of ontologies. The Plant Ontology Consortium (http://www.plantontology.org/) curates the development of a number of controlled vocabularies representing various plant-based knowledge domains. This consortium attempts to represent the current biological understanding of plant tissues and developmental stages as DAGs of controlled vocabularies. Each element in these biological ontology graphs represents a concrete biological term such as 'leaf' or 'epidermis'. The consortium currently focuses on the integration of the taxon- and species-specific ontologies for grasses, Zea mays, and Arabidopsis thaliana (http://www.gramene.org/, http://www.maizegdb.org/, http://www.arabidopsis.org/) into common plant tissue and developmental stage ontologies. 7 3 Database and interface design 3.1 Overview To realize the goals outlined in section 1.2 the approach represented in figure 1 was followed. This approach employs four basic steps that are discussed in more detail in the following sections. The first step is covered in section 3.2 and describes the collection of the EST data from the EMBL, UniLib and Gramene databases. The second step involves the development of the database structure (section 3.3) and the import of the EST data into this database (section 3.4). Section 3.5 covers the third step of the approach, which is the processing of the EST sequences into contig and SNP data. Finally, the fourth step involves the presentation of the EST, contig, and SNP data from the database to the user. This step in described in section 3.6. Figure 1: Overview of the methods used to create and test the plantEST database and its interface. The first step involves the collection of EST sequence and annotation data from the EMBL, UniLIb and Gramene databases. In the second step Python and PostgreSQL were used to develop a data structure to store these data. Contig and polymorphism data were derived from the EST data in the third step using the TGICL and VecScreen tools. The Python programming language was used to create parsers that import these data into the plantEST database. Interfaces to access the data were developed in the fourth step. 8 Figure 2: EMBL record for the EST with accession number BG443252. Each line in the record starts with a two-letter code that indicates the data attribute, followed by that attribute’s value. 9 3.2 Data collection The first step in the design of the plantEST database consists of the retrieval of EST sequence and annotation data from the EMB, UniLib, and Gramene databases. A description of the data sets that were used to create the plantEST database is discussed below. 3.2.1 EMBL EST data A large number of plant EST sequencing projects have been undertaken recently. However, most of these projects produce data that is not publicly available, or the data is deposited in public sequence databases such as the EMBL Nucleotide Sequence Database (Kulikova et al, 2004). Appendix 1 shows an overview of public EST databases on the Internet. The EMBL database can be considered the most extensive resource for plant EST data. The data from the EMBL database, the contents of which are mirrored at the NCBI an DDBJ, are accessible as downloadable flat files from the EBI FTP server (ftp://ftp.ebi.ac.uk/pub/databases/embl/release/). The data from the plant EST division of the EMBL database release 76 (September 2003) were used as the EST dataset for the plantEST database. These data consist of sequence and annotation data for 3,842,311 ESTs. The annotation data describes the biological source of the EST, such as the tissue type and developmental stage of the plant from which it was obtained, the environmental stresses that were applied to the plant, and the experimental conditions under which the EST was produced. An example of an EST record from the EMBL database is shown in figure 2 on the previous page. The EST data was copied directly from the plant EST division of the EMBL database, which on closer inspection contains several non-plant species. Examples of these are the oomycetes Hyaloperonospora parasitica, which causes the downy mildew disease in Arabidopsis thaliana, and Phytophthora infestans, which causes the late blight disease in Solanum tuberosum (potato). Because these pathogen species can be relevant in research on the affected plant species the corresponding ESTs will also be copied. 3.2.2 UniLib EST library data The records from the EMBL database often contain several cross-references to other databases such as GOA, MaizeDB, and SWISS-PROT, and almost every record contains a reference to the UniLib database. UniLib, the Unified Library Database, is maintained at the NCBI and contains biological and experimental annotation information of EST and SAGE libraries present in the dbEST, UniGene, and SAGEmap databases (ftp://ftp.ncbi.nih.gov/repository/UniLib/library.report). Figure 3: UniLib record for the EST library with UniLib number 1285. Each line in the record contains an attribute:value pair that describes a property of the EST library. 10 The data from the 03-03-2004 release of the UniLib database were used to expand the annotation of the ESTs from the EMBL database. The UniLib dataset from this date was used instead of the one available at the time of the EMBL Release 76 dataset (September 2003), because the older UniLib data lacked entries for approximately 100 EST libraries that were present in the EMBL dataset. Data for these EST libraries is present in the 03-03-2004 release of the UniLib data. The annotation of attributes such as tissue type, developmental stage and library treatment that are present within UniLib can also be found in the EMBL database. However, the information from the UniLib database is more detailed than the information in the EMBL database for a large number of EST libraries. In addition to the attributes present in the EMBL database the UniLib database contains distinct attributes for cloning vector and restriction sites. In the EMBL database these data are not present, or available as part of another attribute value. These data can be used for example to remove cloning vector sequence contaminations from the ESTs. Figure 3 displays an example of an EST library record from the UniLib database. 3.2.3 Gramene tissue ontology data The annotation of most biological and experimental attributes of the EMBL data such as tissue, developmental stage, and treatment is written in human readable form and often consists of multiple words or whole sentences. Because the data are submitted by many different researchers, different words and descriptions are used to refer to a single concept. For example, ESTs from Arabidopsis thaliana that are expressed in leaf tissue have tissue annotations such as 'leaf', 'leaves', 'sheath', or 'seedling green leaf'. This variability of terms used to describe a single concept hampers the clustering and comparison of data on the basis of biological meaningful attributes. The terms from the Gramene Cereal Plant Anatomy Ontology (ontology revision 2.0, definitions revision 2.1) were used in the plantEST database to describe the tissue type annotations from the EMBL data in a more consistent way. Gramene (http://www.gramene.org) is a database for comparative genomics of grasses, and uses a set of ontologies to describe the biological and biochemical properties of sequence data records such as ESTs. The Cereal Plant Anatomy Ontology is used as a controlled vocabulary of terms that are linked to tissue and cell type annotation data. This ontology covers a broad range of general plant tissues such as leaf and root, as well as several taxon-specific terms for the species in the Gramene database. 11 3.3 Data storage The relational database structure of the plantEST database was designed to store EST, contig, and polymorphism data. The PostgreSQL Relational Data Base System (http://www.postgresql.org/) was used to create this database structure. This section discusses the database structure that was designed to store the EST data from the EMBL, UniLib and Gramene databases. 3.3.1 EMBL EST data The EST records from the EMBL Nucleotide Sequence Database contain sequence and annotation information. Because this information is stored as attribute:value pairs, a relational database is a logical data structure to store these data. As a first consideration, each attribute from the EMBL record can be used to form a column in a single database table. However, several attributes can have more than one value in a single EST record. Furthermore, multiple EST records can have an identical value for an attribute common to each of those records. These types of attributes require the introduction of new tables in the relational database. The relationships that exist between these attributes define the relationships between the tables. The relationships that were found between the attributes of the EMBL EST records can be found below. The black part of the database model from figure 4 was created from these relationships. - one plant species can have many ESTs - many ESTs together form one EST library from one plant species - one EST can have many database cross-references - one EST can have many literature references - one literature reference can refer to many ESTs - one literature reference can have many authors - one author can contribute to many literature references Closer inspection of the EMBL data revealed that the tissue_type attribute from the feature table of the EMBL records can have identical or similar values for several EST records. This is also true for the dev_stage attribute from the feature table. From the relational database perspective this would mean that these attributes should be modeled into tables having a one-to-many relationship to the EST_LIB table. Furthermore, EST libraries can be generated from more than one tissue, and during more than one developmental stage. Based on these many-to-many relationships the tissue_type and dev_stage attributes were modeled into separate tables, as displayed in red in figure 4. 3.3.2 UniLib EST library data The data from the UniLib database consists of the annotation of biological attributes from the EST Libraries present in the EST division of the EMBL database. The UniLib database contains attributes similar to those found in the EMBL database. However, the vectors and restriction sites that are used in the molecular of the ESTs are stored in separate attributes in the UniLib database, whereas these data are part of the note attribute in the EMBL database, when present. Because these data may be valuable for removing vector contamination from the EST sequences, these attributes were introduced in the EST table of the plantEST database. 12 3.3.3 Gramene tissue ontology data The TIS_ONT table was introduced into the relational database structure as depicted in the green part of figure 4 to store the tissue ontology terms that will be associated with the tissue type descriptions from the EMBL files. The table is designed to store all ontology keywords from the Gramene Cereal Plant Anatomy Ontology, as well as any available definitions and comments. Each tissue type annotation from an EST can contain many tissue ontology keywords, and each ontology term can be linked to many tissue descriptions. This would suggest a many-to-many relationship between the TISSUE and the TIS_ONT tables. However, the TIS_ONT table is linked to the EST_LIB table rather than to the TISSUE table because other attributes from an EST library can also contain tissue annotation data. In this way, these annotations can also be used to link ontology keywords to EST libraries. Figure 4: Database model for the plant EST database, designed to store EST information from the EMBL, UniLib, and Gramene databases. The EST table in this model stores all data that is unique for an EST. Data that is common to all ESTs in an EST library is stored in the EST_LIB table. The TISSUE and DEVSTAGE tables (marked in red) store the tissue type and the developmental stage, respectively, from which the EST library was generated. The TIS_ONT table (marked in green) stores tissue ontology keywords from the Gramene database. The SPECIES table contains the name of the species to which the EST in an EST library belong. The DB_XREF table contains database cross-references, and the LIT_REF and AUTHOR table hold the literature references for each EST record. Tables ELB_TIS, ELB_DEV, ELB_TIO, and LIT_ATR store the many-to-many relationships that exist between the two tables that are linked to each of these tables. The legend in the lower right corner of this figure applies to figures 4 - 6 13 3.3.4 Sequence mask data As outlined in section 1.2, the functionality of the database will be tested by building contigs from part of the EST dataset and detecting polymorphisms within these contigs. The following two sections describe how the plantEST database structure was extended to allow storage of the data that is generated during contig building and SNP detection. Sequence masks, such as vector and repeat masks, are important when building contigs from EST data, as they prevent false overlaps between EST sequences. Many sequence masks can be applied to one EST sequence, as it can contain vector sequences on either end of the sequence, as well as one or more repeat sequences. Sequence mask data is stored in the SQ_MASK table, as displayed in red in the database model in figure 5. 3.3.5 Contig and SNP data To implement these data into the plantEST database, the relationships between the EST data and the contig and SNP data were examined. The initial relationships that were found are listed below, and the resulting database structure is displayed in the green area of figure 5. - one contig consists of many ESTs - one contig can contain many SNPs Figure 5: Extension of the plantEST database model to allow storage of sequence mask, contig and polymorphism data. The SQ_MASK table (marked in red) stores the sequence mask data. Conitg and polymorphism data is stored in tables CONTIG and P_MORPH (both marked in green) respectively. There are two drawbacks to the database structure from figure 5. The first drawback is that this structure allows an EST to belong to only one contig. Using this database structure it is not possible to assign an EST to multiple contigs, for example when a set of EST sequences has been processed multiple times or by different contig building tools. The second drawback arises from the fact that SNPs are stored with only a reference to the contig to which they belong. With this kind of assignment it is unclear on which sequence within that contig the SNP is located. To resolve these problems, the final database structure from figure 6 was created based on the following relationships: 14 - one contig build consists of many contigs - one contig consists of many ESTs - one EST can belong to many contigs - one EST-contig pair can have many polymorphisms Figure 6: Final database model for the plantEST database. The relationships between the EST, CONTIG, and P_MORPH tables have been redefined to represent the biological relationships more accurately, as highlighted in the figure. The CNT_BLD table holds general information on a set of contigs that was produced during a single contig building process and can be considered a 'project description' table. The CONTIG table contains the data that is specific for each contig in a contig build, and table EST_CNT stores the many-to-many relationship between EST and CONTIG. Polymorphism data such as SNPs are stored in the P_MORPH table. 3.3.6 Indices The plantEST database was designed to store sequence and annotation data on several million plant ESTs. Although the data is accessible from every attribute, it is likely that some attributes are used more often as search criteria than others. To the users of the database it is important that the most common types of requests to the database are processed as fast as possible. To accelerate the most common types of queries, indices were created on the attributes that are used most often as search criteria. The attributes that are indexed, with exception of the primary key attributes, are marked in green in the database overview in appendix 2. The primary key and unique attributes of each table are automatically indexed by the PostgreSQL Data Base Management System. These indices are used when searching a table for one specific record on the sequence number or unique attribute value, respectively. Additional indices have been created on all foreign key attributes to increase the search speed when two tables are joined. Finally, some specific data attributes have been indexed to increase the performance of frequently used queries. 15 3.3.7 Redundancy versus speed Data retrieval from the database has been accelerated through the addition of some redundant data next to the indices that were discussed before. Although the focus of a relational database system is to reduce redundancy, the increase of search speed outweighs the increased redundancy. The redundant data that has been added to the database can be calculated or derived from the original data, as is explained by the three examples outlined below. Attributes that store redundant data are marked in red in the database overview in appendix 2. The first example of redundant data in the database is the seq_count attribute in the EST_LIB table. This attribute stores the number of ESTs that belong to each EST library, which is often retrieved from the database. The PostgreSQL implementation of the SQL function that calculates the number of ESTs per library is rather slow, and calculating these numbers on demand therefore requires a substantial amount of time. This is significantly reduced by calculating the number of ESTs per library whenever the database is updated and storing these numbers as an attribute value of the EST_LIB table. Another example of a redundant attribute that is used to increase database speed is the checksum attribute of the EST table. For each EST sequence that is stored in the plantEST database an MD5 checksum is calculated, which is used to detect the presence of duplicate sequences in the database. The MD5 algorithm is a data encryption algorithm that creates a 128-bit message digest from a text message of arbitrary length. The RFC 1321 document on the MD5 algorithm (http://www.freesoft.org/CIE/RFC/1321/) states that 'it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest'. This means that any nucleotide sequence can be compressed into a unique 128-bit (or 32 human-readable characters) digest or checksum, which are significantly smaller in length than the EST nucleotide sequences. The checksums are used in rapid detection of duplicate sequences in the database, as comparisons between 32-character strings are faster than comparisons between strings of hundreds of characters in length. Duplicate sequences are not filtered out during the parsing of the EMBL data. The EST records of duplicate sequences may contain non-overlapping annotation data. Furthermore, the presence of two identical sequences reduces the chance of sequencing errors in those sequences. The third example is the inclusion of the padded_sequence attribute in table EST_CNT. This attribute stores a possibly modified copy of the original EST sequence. These modifications are created during contig building to generate sequence alignments, and can consist of gap insertions, deletions of masked residues, and reverse complementation of the original EST sequence. A more flexible design would store each modification on the sequence separately, so that modifications can be altered, removed, or added to an EST sequence. However this would impose two limitations on the data retrieval speed. The first limitation would be caused by the amount of records that would have to be retrieved from the database. In the current database model, one record is retrieved for each modified EST sequence. When each modification would be stored separately, one record would have to be retrieved from the database for each modification. The second limitation involves the sequence orientation. For many ESTs, the reverse complement of the original sequence is used in contig building. To store this property as an attribute, the concept of reverse complementation should be present in the database structure. Furthermore, each of these sequences would have to be reverse complemented whenever the corresponding contig information is requested. The drawback of adding these redundant data becomes apparent when the updating of the database is considered. For example, when new sequences are added to an existing EST Library, the seq_count attribute value should to be updated to reflect the new number of ESTs in that library. The programs (discussed in section 3.4) that import the data into the database also update these redundant attributes. 16 3.4 Data import After the database structure described in section 3.3 was created, the EST and ontology data were inserted into the plantEST database. The parser programs that are discussed below were written to automate the data import. The programs were written in the Python programming language (http://www.python.org/), and the interfaces that accompany the parser programs were created using the Tkinter library, and are discussed in section 3.6. A general outline of the steps performed to enter the data into the database is shown in figure 7. The ‘manual processing’ label in this figure refers to the changes introduced in the species table. The species_name values from the SPECIES table were updated to a more uniform format that consists of the full scientific species name without variety or subspecies annotations. Annotations of species subdivisions such as cultivar or variety were stored in the appropriate attribute of the EST_LIB table. The SQL commands that introduced these changes in the SPECIES and EST_LIB tables are available in appendix 3. Figure 7: Overview of data import into the plantEST database. First, the EmblParser program parsed the contents of the EMBL flat files to the plantEST database. Subsequently the UniLib data was processed by the UniLibParser script. After this step some manual processing of species data was performed. Finally, the ontology data were imported using the TissueOntologyParser script. The data flow is represented by the horizontally oriented arrows going from the EMBL, UniLib and Gramene databases to the plantEST database. The time flow is represented by the gray, vertically oriented arrows. 3.4.1 EMBL EST data The EmblParser module parses data from EMBL flat files to the plantEST database. Each flat file contains a maximum of 100,000 EST records ordered by accession number, and each record in the file consists of a number of formatted lines. A detailed description of the line types in an EMBL flat file can be found in the EMBL Nucleotide Sequence Database User Manual (http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html). 17 The EmblParser module reads the header of each line in the EMBL flat file and parses that line accordingly. All information that is parsed from one EST record is temporarily stored, and at the end of an entry that entry is written to the database tables. During the writing a redundancy check is performed to determine whether any data from the currently parsed record is already present in the database. These checks reduce the time required for the parsing by reducing the number of write transactions to the database. The idea behind this strategy is as follows: the EST records in the flat file are ordered by accession number, and most EST libraries consist of one or more series of incrementally assigned accession numbers. This means that two incremental accession numbers usually belong to the same EST library. 3.4.2 UniLib EST library data The UniLib flat file is parsed by the UniLibParser, a 'quick-and-dirty' parser script. The UniLibParser retrieves a list of UniLib numbers from the EST_LIB table of the plantEST database. For each UniLib number in the list, the parser searches the UniLib file for information on that EST library. The data from the UniLib record is compared to the data from the plantEST database, and the database is updated where appropriate. 3.4.3 Gramene tissue ontology data The ontology data consists of two flat files, an ontology file and a definitions file. The ontology file contains all tissue ontology terms and the relationships between these terms. The definitions file contains definitions and comments for part of the ontology terms present in the ontology file. Data from both files were imported into the TIS_ONT table of the plantEST database through the TissueOntologyParser program. After the ontology terms were imported into the database, the terms were manually assigned to EST libraries by examination of the tissue_type values that were associated with each EST library. A file was created that contains the tissue_no values of each record from the TISSUE table and the matching tis_ont_no values from the TIS_ONT table. In this way, each tissue_type value from the TISSUE table (referenced though the corresponding tissue_no) was matched to any number of tissue_ontology values from the TIS_ONT table (referenced through the corresponding tis_ont_no). The data from this file were parsed to the database by the TissueOntologyConverter program. The 690 distinct tissue_type annotations from the TISSUE table were linked to 91 distinct tissue ontology terms by searching each tissue_type value for words that appeared in the set of tissue ontology keywords. Tissue descriptions in the tissue_type attribute that had no apparent match to ontology keywords were searched for synonyms of these keywords using Google (http://google.com). 18 3.5 Data processing 3.5.1 Overview To determine the functionality of the plantEST database as a resource for EST data, and to extend the functionality of the database by integrating contig and polymorphism data, all 188,788 A. thaliana EST sequences from the database were assembled into contigs. The sequences were processed as outlined in figure 8, and parser modules were written to import the results form these steps into the plantEST database. The sequences from A. thaliana were chosen as a test dataset for two reasons. First, this plant species is in use as a model organism for plant molecular biology, and its complete genome sequence is publicly available. The large amount of available data on this species allows for validation of the contigs and polymorphisms. Second, the size of the dataset compared to the genome size of this organism suggests the presence of a sufficiently large amount of redundancy in the dataset, which is required to find overlaps between sequences during contig building. Figure 8: Overview of contig and SNP generation from the A. thaliana EST sequences. First, the EST sequences were extracted from the database and screened for vector contamination. Next, contigs were built from the vector-masked A. thaliana EST sequences. Finally, putative SNPs were detected from these contigs. The data that was generated in each step was stored in the plantEST database. The black arrows from and to the plantEST database indicate the data flow, whereas the vertically oriented gray arrows represent the time flow. 19 3.5.2 Vector contamination screening Prior to contig building the A. thaliana EST dataset was screened for vector contamination using a BLASTN sequence identity comparison against the NCBI’s UniVec vector database (http://www.ncbi.nlm.nih.gov/VecScreen/UniVec.html). The UniVec database is a non-redundant set of vector sequences that are commonly used in molecular cloning experiments. For part of the ESTs from this dataset a description of the cloning vector was available in the plantEST database, which could be used to divide the EST dataset in smaller subsets and compare each of those subsets only to the corresponding vector sequences. However, due to the number of different cloning vectors used in the A. thaliana dataset, each EST sequence in the A. thaliana dataset was compared to all the vector sequences in the UniVec database. The vector masking was performed on a Paracel computer using the BLASTN 1.4.9-Paracel algorithm (Paracel BLAST User Manual, version 1.3). The parameters (-q -5 -G 3 -E 3 -F "m D" -e 700 -Y 1.75e12) that were used for the BLASTN algorithm were copied from the NCBI's VecScreen application (http://www.ncbi.nlm.nih.gov/VecScreen/). These parameters are optimized to find short nearly-exact matches while tolerating single-base deletion and insertion errors that are often generated in EST sequencing. An example of a vector contamination BLASTN result is shown in figure 9. The Paracel BLASTN software writes the output to a flat file. The ParacelParser module was written to parse these data to the database. When an alignment between an EST sequence and a vector sequence was encountered by the parser, a sequence mask was created if the alignment was located within a user-specified distance from the 5' or 3' terminus of the EST sequence, as explained in figure 10. Subsequently, the plantEST database was queried for existing sequence masks for that EST sequence. If the new mask overlapped with an existing mask, that mask was extended; else the new mask was stored in the database. 3.5.3 Contig building The vector-masked A. thaliana EST sequences were assembled into contigs using the TGICL software (Pertea et al, 2003) using the default parameters. This software from The Institute for Genomic Research (TIGR) contains a modified version of the megablast clustering algorithm (Zhang et al, 2000), three custom clustering algorithms, and the CAP3 cluster assembly software (Huang and Madan, 1999). A command-line Perl script is wrapped around these programs to automate the contig building procedure. The contigs were built in two successive steps using the TGICL software. The clustering step consisted of all-versus-all pairwise similarity searches using the modified megablast program, which divided the large EST dataset into many smaller sets of homologous sequences, or clusters. In the assembly step, each of these clusters was submitted to the CAP3 assembly software that built one or more contigs from such a cluster. The CAP3 software does not require electropherogram trace data for the assembly of the contigs. This is important, as the trace data for the EST sequences are available in the plantEST database and difficult to retrieve from other resources. Clustering the sequences prior to the contig assembly significantly reduces the amount of time required for this process when compared to direct contig assembly from the complete EST dataset. The sequence clustering method that is employed in the TGICL software detects similarity between sequences on the nucleotide level and creates clusters accordingly. With this approach it is possible that two EST sequences that are derived from distinct, but highly homologous, genes or gene fragments are clustered together. However, this small loss of sensitivity is outweighed by the increase in speed that is gained from this clustering step. Furthermore, the exact clustering strategy is not of critical importance to this research as the goal of generating these contigs lies in testing the plantEST database. 20 The CAP3 software writes the alignment information for the sequences in each contig to an ACE flat file. The Cap3AceParser module was written to parse the contents of the output file to the plantEST database. For each contig in the output file, the consensus sequence, the possibly modified instances of the EST sequences, and the alignment of the EST sequences to the consensus sequence were stored in the database. Modifications to the EST sequence consist of introduction of gaps, deletion of masked residues, or reverse complementation of the original sequence. 3.5.4 SNP detection After contig building was completed, putative SNPs were detected within the contigs. A a putative SNP is defined as a difference between a nucleotide on an EST sequence and the corresponding nucleotide on the consensus sequence of the contig to which that EST belongs. Because the overall quality of EST sequences is low these sequences can contain errors such as incorrectly assigned residues and insertions or deletions of one or more bases. Several statistical rules can be applied to estimate the chance that a difference in nucleotide sequence between an EST and the corresponding consensus sequence is a true SNP or a sequencing error. A small number of software applications has been developed to use a set of statistical rules to enrich the pool of putative SNPs with true SNPs. However, the issue of separating true SNPs from sequencing errors was not considered in this project due to time limitations. Putative SNP discovery was performed by the SNPSearch module, and the results were stored in the plantEST database. The SNPSearch module compares the consensus sequence of each contig to all the EST sequences in that contig. Whenever a difference is found between the consensus sequence and the EST sequence, the location and type of the aberrant nucleotide on the EST sequence is stored in the database. 21 Figure 9: Paracel BLASTN report for EST with accession number AA585898 against the UniVec database. The EST sequence shows homology to the pNS1 vector at one end of the sequence (bases 219 – 234). 22 Figure 10: Graphical overview of vector mask validation and extension. The green bar represents an EST sequence, and the red vertical lines indicate the position of the left and right limit attributes on the EST sequence. The blue bars (indicated A, B and C) directly below the EST sequence represent regions of the EST show homology to a vector sequence. Homologous regions A and C are considered vector masks as they are located between the end of the sequence and the left respectively right limit. Homologous region B is not considered a vector mask because is not located within one of these regions. Regions A and Care extended towards the nearest end of the EST sequence (shaded blue bars on the lower row of the figure) as they fall within the limit regions. 3.5.5 Results The focus of generating the contig and SNP data discussed above was on testing the plantEST dsatabase for its use in EST processing, rather than generating a set of highly accurate contigs and SNPs. Therefore, no effort was placed in optimization of the parameters of the contig building programs. The contigs and SNPs that were generated in this experiment were not investigated in detail; however some of the results from the contig building are discussed below. A total of 2,830 vector masks were generated with the left and right limit parameters (as explained in figure 10) both set to 20. 16 EST sequences were masked on both ends of the sequence, and 28 sequences were completely masked. This last group consists of pure vector sequences and do not contain any plant DNA. Rather than only comparing each EST sequence to the sequence of the vector in which it was cloned, the complete A. thaliana EST dataset was compared to the UniVec database. This introduces a possibility of masking part of an EST that show homology to the sequence of a different vector than the one it was cloned in. The 188,772 remaining A. thaliana EST sequences were assembled into 19,138 contigs. 199,783 putative SNPs were detected from these contigs. Due to the high error rate in EST sequencing this number does not necessarily reflect the number of true sequence polymorphisms. The overall fraction of true polymorphisms in this set could be increased by two methods. The first method involves the use electropherogram trace data during contig building. These data assign a quality score to each base in an EST sequence, and can be used to discard putataive SNPs that have a low quality score. The second method is based on the frequency of a polymorphism’s occurrence in a contig. This method is employed by the statistical SNP detection software mentioned in section 2.1. 23 3.6 Data presentation The data that is present in the plantEST database can be retrieved in two ways. The first way involves the use of SQL commands to access the data via the psql interface of the PostgreSQL Relational Data Base Manager System. This text-only interface is included in the PostgreSQL software and displays database information as tables. An advantage of this interface is the high flexibility, as the user can retrieve any kind of information from the database as long as the SQL query used to retrieve the data is syntactically correct. However, knowledge of SQL is required to use this interface to access data from the plantEST database. A second disadvantage is that the table format is not convenient to display complex information such as multiple sequence alignments. As an alternative, the plantEST Explorer modules were developed to access part of the data in the plantEST database via graphical menus. The modules that were designed to visualize data from the plantEST database can be accessed through the buttons from the plantEST Explorer main menu shown in figure 11. The three buttons on the left side of the plantEST Explorer provide access to data visualization models. The buttons in the middle allow the user to import EST, vector mask, and contig data, and search the contigs in the database for SNPs. The right part of the interface contains buttons to access the help module, read the manual (this report) and exit the program. Figure 11: Main interface of the plantEST Explorer program. Each button opens an interface to the corresponding module. The menu bar (top) provides shortcuts to several extra functions. The Quick Help interface displayed in figure 12 can be accessed through the Help button of any plantEST module and informs the user about the usage of the plantEST Explorer interfaces. Each of the other interfaces are briefly discussed below. For a description of the usage of each interface, start the plantEST Explorer program and consult the Quick Help module. The EstSearch module searches the database for detailed information on an EST, based on a user-supplied accession number and detail options. The current implementation distinguishes between four levels of detail: EST, EST library, literature references, and contig information. The module retrieves data on the submitted accession number from the database based on the selected detail options, formats the data into human readable text and displays the result on the screen. The output can be saved to a file. Figure 13 displays the interface to the EstSearch module. The LibrarySearch module was designed to display information on the EST libraries in the database. The interface shown in figure 14 allows the user to select a set of EST libraries on the basis of identical characteristics such as tissue type and developmental stage, and display detailed information on the properties of these libraries. Additionally, the nucleotide sequences from the ESTs in the selected libraries can be filtered on properties such as sequence length and quality, and the final selection can be written to a FASTA file. 24 Figure 12: Quick Help interface of the plantEST Explorer. Figure 13: EstSearch interface of the plantEST Explorer. 25 The interface to the ContigSearch module (figure 15) presents contig and polymorphism data to the user. It allows the user to search the database for contigs matching criteria such as consensus sequence length, number of sequences in the contig, or presence of a specific EST. The contigs that match the criteria are displayed as a multiple alignment of EST sequences, and putative SNPs are highlighted in red. Several dialog interfaces were created to accompany the parser programs that import data into the database. The programs behind these interfaces were discussed in section 3.4. These dialog interfaces mainly consist of entryfields to enter the required data. Figure 16 shows the dialog interfaces for the EmblParser, ParacelParser and Cap3AceParser programs which import EST, vector mask, and contig data, respectively. Figure 14: LibrarySearch interface of the plantEST Explorer. Figure 15: ContigSearch interface of the plantEST Explorer. 26 4 Conclusion and future work Figure 16: From top to bottom, from left to right: EmblParser, ParacelParser, and Cap3AceParser dialog interfaces of the plantEST Explorer. 27 4 Conclusion and future work 4.1 Conclusion The plantEST database was created as a tool for fast and flexible access to EST data. It contains EST sequence and annotation data from the EMBL and UniLib databases. These data have been refined through the addition of tissue ontology terms from the Gramene database. The plantEST Explorer program models were developed to provide easy access to part of the data in the plantEST database through user interfaces, and to import data from various flat file formats into the plantEST database. The functionality of the database in EST processing was proven by building contigs from part of the EST data, and detecting SNPs in these contigs. The data structure of the plantEST database was modified to store these data, proving the flexibility of the database in the respect of adding new data types. 4.2 Future work Several features were considered during the development of the plantEST database that were not included in its final design due to time limitations. This section discusses the idea behind several of those features, and how they can be incorporated in the plantEST database structure. The first example of data that was considered valuable for the plantEST database, but was not included therein, is public EST data from other resources than the EMBL database. While the EMBL database contains the largest publicly available collection of plant EST data, some of the databases mentioned in appendix 1 contain several thousands of EST sequences that are not available in the EMBL database. To include these data in the plantEST database, parser modules should be written that extract the EST data from the files in which this data is distributed, without creating duplicate records. The second example of data to include in the plantEST database is additional ontology data. The annotation of the EMBL and UniLib data is extremely heterogeneous, limiting the possibilities to make biologically meaningful comparisons between the data. The Gramene Cereal Plant Anatomy Ontology that was introduced into the plantEST database has improved the annotation of the EST data; however, additional ontologies are available to further refine this annotation. An example of this is the Gramene Cereal Growth Stages Ontology, which can be used to annotate the developmental stages of the plants from which the ESTs were generated. These ontology terms should be linked to the EST libraries either through manual annotation or by a parser that contains a dictionary of ontology terms and synonyms. Additionally, the DAG structure of these ontologies can be included in or linked to the plantEST database. This can be used in the retrieval of EST libraries from the database on the basis of ontology terms by automatically including ‘child’ ontology terms when selecting a ‘parent’ term. For example, when all EST libraries that contain the ‘leaf’ ontology term were to be retrieved from the database, all EST libraries that contain ontology terms that are ‘children’ of this term (such as ‘leaf axis’ and ‘leaf sheath’) would be retrieved as well. A third example is the electropherogram trace data that is generated when ESTs are sequenced. These trace data indicate the quality of each base in the nucleotide sequence and can improve the sensitivity of the contig building process. Some of these data can be retrieved from the NCBI Trace Archive (http://www.ncbi.nlm.nih.gov/Traces/), while other data are available at the sequencing centers where the EST sequences were generated. An important consideration for not including these data is the volume that they encompass, as every base in each EST sequence has a trace value. 28 Literature references Batley J., Barker G., O'Sullivan H., Edwards K.J., Edwards D. (2003) Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data. Plant Physiol. 132(1): 8491 Huang X., Madan A. (1999) CAP3: a DNA sequence assembly program. Genome Res. 9: 868-77 Kota R., Rudd S., Facius A., Kolesov G., Thiel T., Zhang H., Stein N., Mayer K., Graner A. (2003) Snipping polymorphisms from large EST collections in barley (Hordeum vulgare L.). Mol Genet Genomics 270(1): 24-33 Kulikova T., Aldebert P., Althorpe N., Baker W., Bates K., Browne P., van den Broek A., Cochrane G., Duggan K., Eberhardt R., Faruque N., Garcia-Pastor M., Harte N., Kanz C., Leinonen R., Lin Q., Lombard V., Lopez R., Mancuso R., McHale M., Nardone F., Silventoinen V., Stoehr P., Stoesser G., Tuli M.A., Tzouvara K., Vaughan R., Wu D., Zhu W., Apweiler R. (2004) The EMBL Nucleotide Sequence Database. Nucleic Acids Res 32 Pertea G., Huang X., Liang F., Antonescu V., Sultana R., Karamycheva S., Lee Y., White J., Cheung F., Parvizi B., Tsai J., Quackenbush J. (2003) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19(5): 651-2 Picoult-Newberg L., Ideker T.E., Pohl M.G., Taylor S.L., Donaldson M.A., Nickerson D.A., BoyceJacino M. (1999) Mining SNPs From EST Databases. Genome Res. 9(2): 167-74. Rudd S. Expressed Sequence tags: alternative or complement to whole genome sequencing? (2003) Trends Plant Sci.. 8(7): 321-29 Torjek O., Berger D., Meyer R.C., Mussig C., Schmid K.J., Rosleff Sorensen T., Weisshaar B., Mitchell-Olds T., Altmann T. (2003) Establishment of a high-efficiency SNP-based framework marker set for Arabidopsis. Plant J. 36(1): 122-40 Vincent P.L.D., Coe Jr E.H., Polacco M.L. (2003) Zea Mays ontology – a database of international terms. Trends Plant Sci. 8(11): 517-20. Ware D.H., Jaiswal P., Ni J., Yap I.V., Pan X., Clark K.Y., Teytelman L., Schmidt S.C., Zhao W., Chang K., Cartinhour S., Stein L.D., McCouch S.R. (2002) Gramene, a Tool for Grass Genomics. Plant Physiol. 130(4): 1606-13 Zhang Z., Schwartz S., Wagner L., Miller W. (2000) A greedy algorithm for aligning DNA sequence. J, Comput Biol. 7: 203-14 29 Acknowledgements The author would like to thank the following persons for the many ideas and contributions to this work, and for the time they took answering my many questions: Centre for Molecular and Biomolecular Informatics, Nijmegen Gert Vriend supervision Maarten Hekkelman databases, support of open-source software ;) Jos Boekhorst Python programming language Sander Nabuurs Python programming language Marc van Driel data mining, Unix tutorial and help KeyGene, Wageningen Antoine Jansen Pieter Vos general database ideas general database ideas Plant Research International, Wageningen Paulien Adamse ontologies Sander van der Krol contigs Sander Peters general database ideas Wageningen University Peter Schaap thesis coordinator 30 APPENDICES Appendix 1: Overview of plant EST databases on the internet NCBI dbEST (http://www.ncbi.nlm.nih.gov/dbEST) EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/) DDBJ (http://www.ddbj.nig.ac.jp/) General public databases of EST sequences MIPS Sputnik (http://mips.gsf.de/proj/sputnik/) ‘A comprehensive resource for the functional annotation of clustered plant ESTs’ PlantGDB (http://www.plantgdb.org/) Plant ESTs and GSSs GPiDB (http://genoplante-info.infobiogen.fr/Databases/GPDB/) ESTs and contgis from several plant species International Triticeae EST Cooperative (http://wheat.pw.usda.gov/genome/) ESTs of Triticeae species Solanaceaea Genomics Network (http://www.sgn.cornell.edu/index.html) ESTs of Solanaceaea species GrainGenes wEST (http://wheat.pw.usda.gov/wEST/) Wheat EST, contig, and mapping information KOGUMI (http://www.shigen.nig.ac.jp/wheat/komugi/top/top.jsp) Wheat nucleotide sequences and information MaizeGDB (http://www.maizegdb.org/) Maize nucleotide sequences and information CUGI (http://www.genome.clemson.edu/) ESTs of cotton, barley, peach, and almond AGI (http://www.genome.arizona.edu/) ESTs of rice, barley, and cotton ESTarray (http://www.estarray.org/) Rice and rice blast fungus ESTs ESTDB (http://estdb.biology.ucla.edu/) ESTs from Bean (Phaseolus coccineus) and Petunia Kazusa (http://www.kazusa.or.jp/) ESTs from Chlamydomonas reinhardtii, Lotus japonicus, Arabidopsis thaliana, Porphyra yezoensis 31 Appendix 2: Entities in the plantEST database LEGEND TABLE NAME primary key foreign key indexed attribute redundant attribute Description of this attribute SPECIES species_no species_name Name of the species EST_LIB est_lib_no species_no unilib_no seq_count avg_seq_length clone_lib sub_species strain variety cultivar cell_type cell_line sex lab_host vector r_site1 r_site2 treatment note Reference to the UniLib database Number of sequences in this EST library Average length of all sequences in this EST library Name of the EST library Sub-species from which the ESTs were extracted Strain from which the ESTs were extracted Variety from which the ESTs were extracted Cultivar from which the ESTs were extracted Cell type from which the ESTs were extracted Cell line from which the ESTs were extracted Sex of the organism from which the EST library was created Organism in which the EST sequences were cloned Vector in which the EST sequences were cloned Restriction site used to digest the vector and insert the EST Restriction site used to digest the vector and insert the EST Treatment of the plants from which the ESTs were extracted Any other comments DEV_STAGE devstage_no dev_stage Developmental stage of the plants ELB_DEV est_lib_no devstage_no 32 TISSUE tissue_no tissue_type Tissue from which the ESTs were extracted ELB_TIS est_lib_no tissue_no TIS_ONT tis_ont_no goid tissue_ontology tissue_definition Gramene Ontology Identifier number Tissue ontology term Definition of the tissue ontology term ELB_TIO est_lib_no tis_ont_no DB_XREF db_xref_no est_no database_name primary_ID secondary_ID Name of the database to which the cross-reference is made Primary identifier of the cross-referenced EST in this database Secondary identifier SQ_MASK sq_mask_no est_no mask_start mask_end First nucleotide of the sequence mask Last nucleotide of the sequence mask 33 EST est_no est_lib_no ID_entryname ID_dataclass ID_moleculetype ID_division sequence_version entry_date DT_create_date DT_update_date DT_update_version description comments clone organelle SQ_sequencelength SQ_count_A SQ_count_C SQ_count_G SQ_count_T SQ_count_other SQ_sequence checksum Primary NCBI accession number Data class from the EMBL database Molecule type from the EMBL database Division from the EMBL database Version of the nucleotide sequence Date the EST record was entered into the plantEST database Date the EST record was entered into the EMBL database Date the EST record was last updated in the EMBL database Version of the EST record from the EMBL database Description of the EST General comments, can contain any kind of data Name of the clone in whch the EST was cloned Organelle from which the EST was derived Length of the EST sequence Number of A nucleotides in the EST sequence Number of C nucleotides in the EST sequence Number of G nucleotides in the EST sequence Number of T nucleotides in the EST sequence Number of non-ACGT nucleotides in the EST sequence EST nucleotide sequence MD5 checksum of the nucleotide sequence EST_LIT est_no lit_ref_no LIT_REF lit_ref_no medline_no pubmed_no lit_group lit_title lit_location lit_comments Identifier of the literature reference in the MedLine database Identifier of the literature reference in the PubMed database Group that created the article Title of the article Location where the article was originally published Comments on the literature reference 34 LIT_ATR lit_atr_no lit_ref_no author_no lit_atr_pos Position of the author in the list of author for this article AUTHOR author_no author_name Name of the author EST_CNT est_cnt_no est_no contig_no padded_sequence orientation consensus_start align_start align_end Modified sequence of the EST as it is aligned to the consensus Orientation of the EST sequence in the contig Nucleotide postion on the consensus where this EST starts First nucleotide of the EST that matches the consensus Last nucleotide of the EST that matches the consensus CONTIG contig_no build_no consensus_sequence contig_length contig_seq_count contig_avg_seq_length contig_snp_count Consensus sequence of the contig Length of the consensus sequence Number of sequences in the contig Average length of the sequences in the contig Number of SNPs in the contig 35 CNT_BLD build_no build_name build_date build_description program_name program_version program_parameters avg_contig_length snps_detected Name of the contig building project Date of the contig building project Description of the contig building project Name of the program used to build the contigs Version of the program used to build the contigs Parameters used during contig building Average length of the contigs in this project Status of SNP detection within this contig build PMORPH pmorph_no est_cnt_no pmorph_startbase pmorph_endbase First base of the polymorphism Last base of the polymorphism 36 Appendix 3: updateSpecies.sql – updates to tables SPECIES and EST_LIB 37