Lab 1 - ODU Computer Science

advertisement
SCORPION Product Description
1
SCORPION Product Description
SCORPION – Blue Team
Old Dominion University
CS411 – Janet Brunelle
Authors: Stanley Zheng
Last Modified: Sept 20, 2014
Version: 1
SCORPION Product Description
2
TABLE OF CONTENTS
1
INTRODUCTION ...............................................................................................................................................................3
2
SCORPION PRODUCT DESCRIPTION.......................................................................................................................4
2.1 Key Product Features and Capabilities ............................................................................................................4
2.2 Major Components (Hardware/Software) .....................................................................................................5
3
IDENTIFICATION OF CASE STUDY...........................................................................................................................8
4
SCORPION PRODUCT PROTOTYPE DESCRIPTION............................................................................................8
4.1 Prototype Architecture (Hardware/Software) ............................................................................................9
4.2 Prototype Features and Capabilities .............................................................................................................. 10
4.3 Prototype Development Challenges ............................................................................................................... 12
LIST OF FIGURES
Figure 1: Website Hardware Components .................................................................................................................... 6
Figure 2: Estimated Protein Structure Prediction Time Algorithm .................................................................... 7
Figure 3: Protein Sequence Validation Algorithm ..................................................................................................... 7
Figure 4:Sequence sanitization ....................................................................................................................................... 10
GLOSSARY ........................................................................................................................................................................ 14
REFERENCES ................................................................................................................................................................ 15
SCORPION Product Description
1
3
INTRODUCTION
Bioinformatics has accelerated the scientific community's understanding about genetics
and shortened the time between the lab and market for breakthrough medical advances.
Fundamentally all living units are composed of DNA, which are composed of millions of base
protein pairs. These proteins are comprised amino acids, which are sometimes thousands of
amino acids long ("RCSB PDB - Histograms", n.d.). How these amino acids interact with each
other determines how protein will fold and determine the structures they form. The intersection
of DNA and computer algorithms are tools such as neural network models that statistically
predict how an amino acid sequence folds into its subsequent protein. The data upon from which
these tools are built improve daily as laboratories culture and sequence terabytes of DNA into
protein databases. Old Dominion University Professor Dr. Yaohang Li, created SCORPION
(SeCOndaRy structure PredictION) neural network in the pursuit of creating one of the fields
most accurate protein prediction models.
SCORPION is an actively used neural network, trained upon the solved amino acid
sequences within the RCSB Protein Data Bank (Research Collaboratory for Structural
Bioinformatics database). The current interface to interact with the system is the public facing
website hosted by Dr. Li, utilizing web forms to submit large amino acid sequences. This is not
dissimilar to many of Dr. Li’s competitors but improving the interfaces by which researchers
may access SCORPION will reward and grow the community that utilizes his tools.
Therefore the proposal includes giving SCORPION a modern website and a RESTful
public web service API. The website will include features that aid researchers in submitting,
cataloging and retrieving protein sequence submissions. The API will allow researchers to utilize
their own tools and platform to programmatically interact with SCORPION over HTTP.
SCORPION Product Description
2
4
SCORPION PRODUCT DESCRIPTION
The system is composed of three sections, a website, a web API service and the program
known as SCORPION. SCORPION is a single compiled program hosted on a server run by Old
Dominion University. The website currently is a method to pass input into scorpion through a
web form and receive the results later when finished by email. SCORPION itself is a neural
network composed of combination of algorithms and data designed to predict secondary
structure protein folding. These predicted sequences are multiple factors faster than using
traditional X ray crystallography to manually see how these sequences would fold. A neural
network is a type of machine learning algorithm that uses a system of nodes that are designed to
predict the classification of a collection of data. Neural networks are trained with training sets, a
series of previously classified and processed amino acid sequences where the structures are
known. This allows scientists to test the accuracy of the neural network by measuring the trained
model against well-documented sets of proteins. SCORPION is periodically retrained to take
into account larger sets of known protein structures that have been added to the international
protein knowledge base.
Replacing the SCORPION website and adding an API will not affect any parts of the
training system. The system is extremely modular so if the stakeholder has the ability to
implement or decouple either of the solutions proposed.
2.1 KEY PRODUCT FEATURES AND CAPABILITIES
The website will allow a better user experience and be easier to utilize after a feature such
as user profiles. With this implemented, researchers are empowered to track their progress, while
also giving the stakeholder a better idea about who is utilizing the system. This allows other core
SCORPION Product Description
5
features such as job tracking and analytics to be created. However this service will stay wholly
optional. The login feature will ask for volunteered information but can be utilized without a
profile.
Adding an Application Programmable Interface implemented over HTTP, the protocol of
how Internet, opens the SCORPION platform to new users. The concept of Representation
Stateful Transfer (REST), implies utilizing common verbs browsers utilize to facilitate data
transactions. For example, the browser when a user who visits google.com in the background
browser submits a GET request to google.com. The RESTful API in the context of scorpion will
allow a researcher to write a script in their language of choosing to automatically upload a
spreadsheet of sequences from a program on their computer.
2.2 MAJOR COMPONENTS (HARDWARE/SOFTWARE)
As iterated previously, the system is composed of three sections, a website, a web API
service and the program known as SCORPION.
After the neural network is trained and the software packaged, it can be executed on any
Linux system with the right requirements. To run efficiently, the SCORPION software requires a
super computer with multiple high-powered graphical processing units (GPU) and a Linux
operating system run time. Using multiple GPUs, the program will be able take advantage of
threading. Threading will allow multiple processes to execute simultaneously and process protein
sequence jobs at once.
The other services, the website and the API, can live on a separate service, networked
with SCORPION. The website portion of SCORPION will be hosted using a web server which
will also host the user login system and analytics platform. The API will also be hosted on the
SCORPION Product Description
6
same server on the same system. One caveat is that SCORPION requires PSI-BLAST, a third
party service that formats submitted protein sequence into a Position Specific Scoring Matrix
(PSSM) before it is able to be submitted to the Neural Network. All hardware components are
already in use with the current implementation SCORPION and additional server resources have
been provisioned.
Figure 1: Website Hardware Components
Any computer with a web browser with Internet connection will allow the user access the
SCORPION. The website layout will comply with US 508 standards as per agreement with NSF
grant stipulations. Google Analytics, a third-party service will record user statistics such as
webpage hits and visitor’s IP address. Another third party service that will be required is Open
ID. It reduces the requirement of software to implement and host user credentials but allows
SCORPION to allow lookup by profiles.
SCORPION Product Description
7
The API will be accessible by any language or software that support HTTP. The software
will be written in PHP, and support a RESTful interface. The API will run adjacent to the
website as they both support the same run time.
For both services, the API and website will support two algorithms, both of which pertain
to the protein sequence submission process. The data sanitization algorithm (Figure 2) is an
algorithm that validates the input of a protein sequence submission. It assists in identifying and
normalizing sequences for PSI BLAST before submission including removing special, invalid
and whitespace characters. An additional sequence time prediction algorithm (Figure 3) will be
performed on each job to give an estimated time of completion. It is known from
experimentation that each character will add approximately 2.13 seconds. Estimated prediction
time will be assigned to each submitted sequence.
Figure 2: Protein Sequence Validation Algorithm
Figure 3: Estimated Protein
Structure Prediction Time
Algorithm
SCORPION Product Description
3
8
IDENTIFICATION OF CASE STUDY
Dr. Yaohang Li and his team are constantly improving the algorithms and data available
for the neural networks to produce even better protein prediction results. The SCORPION neural
network was a project of his PhD student at Old Dominion University, Ashraf S. Yaseen, for
almost two years. Predecessors to SCORPION have utilized the same system architecture to
interface with the neural networks. Therefore, future bioinformatics neural network prediction
systems can be utilized as a way to interface with future models. The API and Website improve
the way the general public can access bioinformatics tools created by Old Dominion University,
which will encourage more usage by outside parties.
An API provides an open platform for other researchers and companies to utilize services
to integrate tools into their preexisting systems. The new SCORPION system will be usable by
small labs or pharmaceutical companies who are looking to verify their own prediction results.
This makes the tool flexible and reduces the friction with accommodating new software. In the
future public APIs will likely power a majority of traffic on the Internet (Jacobson).
4
SCORPION PRODUCT PROTOTYPE DESCRIPTION
The prototype of the SCORPION web services will demonstrate a better user
experience for users of the SCORPION platform. It will enable researchers to submit sequences
and retrieve predicted results programmatically. Reducing the time and complexity of integrating
with SCORPION makes it more valuable to researchers to begin and continue utilizing
SCORPION.
SCORPION Product Description
9
4.1 PROTOTYPE ARCHITECTURE (HARDWARE/SOFTWARE)
The additional services for SCORPION will be developed in PHP. Parts of the current
system are written in PHP, which allows for reuse, and extensibility to the existing codebase.
PHP is a supported runtime that aligns with the other software available to the stakeholders. To
fulfill the requirement of storing user ID sessions for the website, a persistent data source is
required. SQLite was chosen for a lightweight database with the benefits of MySQL syntax
without the bloat. It does not require setting up or configuring a database and is available as a
single portable file.
The current website will be updated with a more mainstreamed website but still utilizing
HTML/JavaScript. The website will have client side JavaScript to validate protein sequence
submissions before they are submitted to the server. The website will allow users who
authenticate with the service to discover and retrieve submissions in one interface. The system
currently heavily relies on email as the sole method of retrieving and cataloging predicted
submissions.
To tie together the services, the frontend website will interact with SCORPION through
the API. This demonstrates the concept that the API is agnostic to platform and able to facilitate
all operations possible available through the web interface.
The API will open up Dr. Li platform to more systems to use SCORPION but also
improve reliability by stabilizing the way sequences are submitted. Tracking jobs in the system
ensure all submitted sequences are eventually answered. Currently SCORPION jobs remain
unfulfilled if the server is overloaded and resources sparse.
SCORPION Product Description
10
4.2 PROTOTYPE FEATURES AND CAPABILITIES
The prototype will demonstrate a two-part solution to improve SCORPION. The
redesigned website powered by the API will enable users interactions with SCORPION.
The user facing side of SCORPION would be a website application, to replace the current
design. The website will demonstrate 508 web accessibility standards to comply with federal
funding requirements held by the National Science Foundation. The redesign will incorporate a
similar user experience to the original website, and will offer a mirror to the original site if the
user prefers.
A feature of the redesign would be client side validation and data sanitization tools.
Ideally the validation aims to ensure input sequenced is normalized to FASTA specifications.
The prototype would be able to handle on input rule based (Figure 4) highlighting that will aid
the user in correcting sequences. A common way researchers input sequences is utilizing copy
and paste into the text area field. Optional sanitization buttons would available to target large
copy/pasted sequences that could have multiple inconsistencies and formatting issues.
Rules
Figure 4: Valid sequence rules
1. No 1.
special
character
No special
character
2. No 2.
numerics
No numerics
3. All alphabetical characters excluding (BJOUX)
3. All alphabetical characters excluding (BJOUX)
4. Alphabet case insensitive
4. Alphabet case insensitive
5. Minimum sequence length
5. Minimum sequence length 40 characters
SCORPION Product Description
11
The prototype will demonstrate logged in user capabilities utilizing OpenID. Users who
opt in to login will be able to manage their data on the site and be informed about changes on the
site. This user login system will offer capabilities for the system stakeholder to measure usage on
the site.
The prototype will utilize google analytics embedded in pages. Google Analytics support
GeoIP and can determine from where the visitor origin. In addition they can provide tracking
including duration of visit and user utilization of the site. Google Analytics provides a solution
for stakeholder to review activity and all this associated information within the application.
Finally, the user prototype will offer an API following RESTful principles.
Programmatically, the user can submit requests and perform all actions available on the website.
Users submitting sequences with sufficient parameters will either be returned a token URL to
find their results later.
Since this is a multiple part solution there are system and subsystem risks. On a system
level the largest risks are failure to adoption and failure of implementation. To get the most
benefit out of the system, Dr. Li would adopt both parts of the solution. Throughout development
a tenet of the project is to maintain backwards compatibility at all times.
The client facing systems will maintain a similar user interface and the ability to use the
original design. The solution was designed on tried solutions making support and documentation
plentiful. Modularity allows piecewise integration of the solution and test coverage focused on
regression testing ensures successful system transition. The server code that currently is in
production is not documented and would gain from a refactor to reduce possible future bugs and
downtime.
SCORPION Product Description
12
4.3 PROTOTYPE DEVELOPMENT CHALLENGES (JASMINE JONES)
Challenges present themselves when integrating new features with a preexisting platform.
SCORPION is a well-used system that exists and new functionality should not break current use
cases. This restricts solutions to be modular and backwards compatible with existing
architecture. The stakeholder Dr. Li wishes to ensure maintained uptime of SCORPION for all
users that are currently using SCORPION. Integrating with the current software to fulfill new
features is treacherous when the original developer is not readily available to answer questions.
Addressing the challenges recognizes that the solution is two phases, implementing a public API
and replacing the current web interface. With the SCORPION website, the challenge lies in
privacy concerns and user acceptance.
The API is a new feature to streamline interfacing with SCORPION. The desired data
being present in the current system validates that building a public API is feasible. The API
would provide insight into jobs that are processed by SCORPION by making sure submitted
sequences are not lost. Speaking to the system uptime, Dr. Li’s system currently has no recourse
if the service does become unavailable. Improving the software will increase stability and
reliability during high usage periods. The API will be able reinforce this by creating a standard
pipeline to access SCORPION. -
The stakeholder is worried about collecting any personal data on the system. The
suggestion to implement Open ID offers the advantage of profiles without hosting any private
information. No passwords will be stored on Old Dominion University servers and as an extra
safeguard OpenSSL will be implemented to encrypt user information. Overall this system is optin and users are not required to log in to submit sequences to SCORPION.
SCORPION Product Description
13
User satisfaction and acceptance is addressed by adoption of the new interface. The
design will be kept to a similar layout to its predecessor with an option to revert. To assess
acceptance, analytics will be added to all pages for stakeholders to have better understanding
how the site is being used.
GLOSSARY
Amino Acids/Residues: The building blocks of proteins
API: Application Programmable Interface (abstract way for services to communicate)
Cross-validation Training: The process of dividing training data into k mutually exclusive subsets
(folds), of roughly equal size where some subsets are used for training, validating, and testing. The
process is repeated k times.
Data cleansing: The process of removing non-representative instances from the data set.
Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data
from the RCSB
ETL: Extract, Transform and Load. Referring to the manipulation of Data
FASTA: Format widely adopted in bioinformatic to make it easier to manipulate and parse
sequences
GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers
to match IP origin
GUI: Graphical User Interface, the visual client side facing software
NSF: National Science Foundation
PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the
PSSM
PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of
the original protein sequence
SCORPION Product Description
14
RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The
database holds all known and recognized protein sequences.
REST:A REST API is a set of operations that can be invoked by means of any the four verbs, using
the actual URI as parameters for your operations. Four verbs including (GET,POST,PUT,DELETE)
SCORPION: SeCOndaRy structure PredictION
STING: Streamlined Training In Neural-network GUI
Training set: Set of instances from the problem domain used to train the algorithm
508 Compliance: Adhering to guidelines established to make website content equally accessible to
people with disabilities
REFERENCES
Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014, from
http://www.rcsb.org/pdb/home/home.do
Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11, 2014,
from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx periment.xlsx
Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research Funding National Cancer Institute. Retrieved May 8, 2014, from
http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding
Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014, from
http://www.foresight.org/Nanomedicine/Ch03_1.html
Jacobson, Daniel. "1." APIs: A Strategy Guide. Sebastopol, CA: O'Reilly, 2012. N. Print.
Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from
http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf
RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from
http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC
ount&name=Residue%20Count
Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March 15,
2014, from http://www.hhs.gov/web/508/index.html
Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014, from
http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act
Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction
SCORPION Product Description
Accuracy.
15
Download