Project Overview Introduction The document provides a complete overview of the project. The next few sections will summarize the background and purpose of the project and the later sections will evolve into the products and technologies that will be used to implement the project. Purpose The purpose of this project is to use Java Servlets, JSP, XML and XSL technologies to develop an EST model database (ESTMD) system, which is a web-based, user-friendly relational database system. It will allow K-State biologist to search expression sequences and related information. Background What are ESTs ESTs, Expressed Sequence Tags, are partial sequences of randomly chosen cDNA, obtained from the results of a single DNA sequencing reaction. ESTs are used both to identify transcribed regions in genomic sequence and to characterize patterns of gene expression in the tissue that was the source of the cDNA and as markers for genetic mapping. EST is an important resource for gene identification, genome annotation and comparative genomics. Laboratories are producing more and more EST sequences. Typically, processing ESTs includes original (raw) sequence cleaning steps such as low quality, vector and adaptor sequence removing, cleaned sequence assembly to become unique sequences, and unique sequence annotation and functional assignment. Keeping track and managing the information is a critical issue for many labs. The Figure 1 is shown data processes of ESTMD. Cleaned (EST) Sequences Trace Files CAP3 Phred Raw (clone) Sequences Cross_match & PERL program Blast Unique Sequences with hit Assembled (Unique) Sequences Figure 1. ESTs data processes Gene Ontology Gene Ontology is a set of controlled vocabularies used to describe biological features within a specified domain of biological knowledge. The Gene Ontology describes the 1 molecular functions, biological processes and cellular components of gene. One interest of biologists is to know the functions of these sequences. Searching Gene Ontology (GO) is an efficient way for them to find the functions of these sequences and their corresponding genes. Pathway Pathway is the sequence of enzyme catalyzed reactions by which an energy-yielding substance is utilized by protoplasm. Goals In this project, an EST model database system (ESTMD) is being developed to satisfy the requirements of the Department of Biology, Kansas State University. Users want many ESTs’ information of various organisms to be stored and accessed through Internet as the basis for further research. The main data has been processed by biological software and programs. The database will maintain the information about raw sequences, cleaned sequences, assembled sequences, and other related information. ESTMD will provide various comprehensive search tools for mining EST raw, cleaned and assembled sequences, Gene Ontology and pathway information. It will also provide the functions of data submission and sequence download. The main goals of this project are: To help faculties and researchers access EST information efficiently and make further decision. To develop a lower cost, reliable, and user-friendly web application. To simplify EST database development and management. Technologies ESTMD system is a distributed, three-tier, web-based system (Figure 2). Client tier, or presentation tier, is responsible for the presentation of data, receiving user events and controlling the user interface. HTML with JavaScript will be used in the client tier. Application-server tier, or business logic, is responsible for recording and abstracting business processes in business-objects. Java Servlets, Java ServerPage (JSP) and JDBC will be used in this tier. Data-server tier is responsible for data storage. MySQL4.0 is chosen as database server. In this structure the client tier is not in direct communication with the database. In order to send or receive data it must communicate with the application-server tier which in turn communicates with the data server. 2 Figure 2. Three-tier Architecture Qualities Efficiency With traditional CGI, a new process is started for each HTTP request. However, with servlets, the Java virtual machine stays running and handles each request with a lightweight Java thread. If there are N requests to the same CGI program, the code for the CGI program is loaded into memory N times. With servlets, however, only a single copy of the servlet class would be loaded. This approach reduces server memory requirements and saves time by instantiating fewer objects. Servlets remain in memory even after they complete a response, so it is straightforward to store arbitrarily complex data between client requests. Platform-independence Servlets are the Java platform technology of choice for extending and enhancing web servers. They provide a component-based, platform-independent method for building web-based applications. Convenience Web interfaces make the system easy to use. User only needs to know how to use a web browser and does not need to download, install, or learn any special software. 3 Reliability HTML with JavaScript will validate user input on client side. Exceptions and errors on server side will be handled by java exception handling. Security In traditional CGI, the programs are often executed by operating system shells, and processed by languages that do not automatically check array or string bounds. Servlets suffer from neither of these problems. Even if a servlet executes a system call to invoke a program on the local operating system, it does not use a shell to do so. And array bounds checking and other memory protection features are a central part of the Java programming language. Three-tier structure can make the data safe. The client tier is not in direct communication with the database. In order to send or receive data it must communicate with the application-server tier which in turn communicates with the data-server tier. Risks The requirements may change continually. Some biology knowledge is needed for the project. Some new technologies in computer science need to be understood. Constraints The main constraint of the project is MySQL4.0 database. MySQL is faster than Oracle on small to medium sized databases, and is easy to administrate. But MySQL is less powerful on complex queries. Another constraint is that some data are not available now. Some related databases need to be downloaded, and some data need to be processed. References Marty Hall, Core Servlets and JavaServer Pages, Prentice Hall PTR, 2000 CORBA web page, “http://www.corba.ch/e/3tier.html” ESTAP, “http://www.vbi.vt.edu/~estap” 4