Project Overview - Kansas State University

advertisement
Project Overview
Introduction
The document provides a complete overview of the project. The next few sections will
summarize the background and purpose of the project and the later sections will evolve
into the products and technologies that will be used to implement the project.
Purpose
The purpose of this project is to use Java Servlets, JSP, XML and XSL technologies to
develop an EST model database (ESTMD) system, which is a web-based, user-friendly
relational database system. It will allow K-State biologist to search expression sequences
and related information.
Background
What are ESTs
ESTs, Expressed Sequence Tags, are partial sequences of randomly chosen cDNA,
obtained from the results of a single DNA sequencing reaction. ESTs are used both to
identify transcribed regions in genomic sequence and to characterize patterns of gene
expression in the tissue that was the source of the cDNA and as markers for genetic
mapping. EST is an important resource for gene identification, genome annotation and
comparative genomics. Laboratories are producing more and more EST sequences.
Typically, processing ESTs includes original (raw) sequence cleaning steps such as low
quality, vector and adaptor sequence removing, cleaned sequence assembly to become
unique sequences, and unique sequence annotation and functional assignment. Keeping
track and managing the information is a critical issue for many labs.
The Figure 1 is shown data processes of ESTMD.
Cleaned (EST)
Sequences
Trace Files
CAP3
Phred
Raw (clone)
Sequences
Cross_match &
PERL program
Blast
Unique
Sequences
with hit
Assembled
(Unique)
Sequences
Figure 1. ESTs data processes
Gene Ontology
Gene Ontology is a set of controlled vocabularies used to describe biological features
within a specified domain of biological knowledge. The Gene Ontology describes the
1
molecular functions, biological processes and cellular components of gene. One interest
of biologists is to know the functions of these sequences. Searching Gene Ontology (GO)
is an efficient way for them to find the functions of these sequences and their
corresponding genes.
Pathway
Pathway is the sequence of enzyme catalyzed reactions by which an energy-yielding
substance is utilized by protoplasm.
Goals
In this project, an EST model database system (ESTMD) is being developed to satisfy the
requirements of the Department of Biology, Kansas State University. Users want many
ESTs’ information of various organisms to be stored and accessed through Internet as the
basis for further research. The main data has been processed by biological software and
programs. The database will maintain the information about raw sequences, cleaned
sequences, assembled sequences, and other related information. ESTMD will provide
various comprehensive search tools for mining EST raw, cleaned and assembled
sequences, Gene Ontology and pathway information. It will also provide the functions of
data submission and sequence download.
The main goals of this project are:
 To help faculties and researchers access EST information efficiently and make
further decision.
 To develop a lower cost, reliable, and user-friendly web application.
 To simplify EST database development and management.
Technologies
ESTMD system is a distributed, three-tier, web-based system (Figure 2).
 Client tier, or presentation tier, is responsible for the presentation of data, receiving
user events and controlling the user interface. HTML with JavaScript will be used in
the client tier.
 Application-server tier, or business logic, is responsible for recording and abstracting
business processes in business-objects. Java Servlets, Java ServerPage (JSP) and
JDBC will be used in this tier.
 Data-server tier is responsible for data storage. MySQL4.0 is chosen as database
server.
In this structure the client tier is not in direct communication with the database. In order
to send or receive data it must communicate with the application-server tier which in turn
communicates with the data server.
2
Figure 2. Three-tier Architecture
Qualities

Efficiency
With traditional CGI, a new process is started for each HTTP request. However,
with servlets, the Java virtual machine stays running and handles each request
with a lightweight Java thread. If there are N requests to the same CGI program,
the code for the CGI program is loaded into memory N times. With servlets,
however, only a single copy of the servlet class would be loaded. This approach
reduces server memory requirements and saves time by instantiating fewer
objects. Servlets remain in memory even after they complete a response, so it is
straightforward to store arbitrarily complex data between client requests.

Platform-independence
Servlets are the Java platform technology of choice for extending and enhancing
web servers. They provide a component-based, platform-independent method for
building web-based applications.

Convenience
Web interfaces make the system easy to use. User only needs to know how to use
a web browser and does not need to download, install, or learn any special
software.
3

Reliability
HTML with JavaScript will validate user input on client side. Exceptions and
errors on server side will be handled by java exception handling.

Security
In traditional CGI, the programs are often executed by operating system shells,
and processed by languages that do not automatically check array or string
bounds. Servlets suffer from neither of these problems. Even if a servlet executes
a system call to invoke a program on the local operating system, it does not use a
shell to do so. And array bounds checking and other memory protection features
are a central part of the Java programming language.
Three-tier structure can make the data safe. The client tier is not in direct
communication with the database. In order to send or receive data it must
communicate with the application-server tier which in turn communicates with
the data-server tier.
Risks



The requirements may change continually.
Some biology knowledge is needed for the project.
Some new technologies in computer science need to be understood.
Constraints
The main constraint of the project is MySQL4.0 database. MySQL is faster than Oracle
on small to medium sized databases, and is easy to administrate. But MySQL is less
powerful on complex queries.
Another constraint is that some data are not available now. Some related databases need
to be downloaded, and some data need to be processed.
References
Marty Hall, Core Servlets and JavaServer Pages, Prentice Hall PTR, 2000
CORBA web page, “http://www.corba.ch/e/3tier.html”
ESTAP, “http://www.vbi.vt.edu/~estap”
4
Download