Lab 1 – STING Product Description - ODU Computer Science

advertisement
Running head: Lab 1 – STING Product Description
Lab 1 – STING Product Description
Team Blue
Jasmine Jones
CS411W
Professors Janet Brunelle
October 16, 2014
Version: 2.0
1
Lab 1 – SCORPION Product Description
2
TABLE OF CONTENTS
1
INTRODUCTION .............................................................................................................................................................. 3
2
STING PRODUCT DESCRIPTION ............................................................................................................................... 5
2.1 Key Product Features and Capabilities ........................................................................................................... 6
2.2 Major Components (Hardware/Software) .................................................................................................... 7
3
IDENTIFICATION OF CASE STUDY ........................................................................................................................ 10
4
STING PRODUCT PROTOTYPE DESCRIPTION.................................................................................................. 11
4.1 Prototype Architecture ........................................................................................................................................ 12
4.2 Prototype Features and Capabilities .............................................................................................................. 13
4.3 Prototype Development Challenges ............................................................................................................... 14
LIST OF FIGURES
Figure 1: SCORPION Hardware Components ............................................................................................................... 8
Figure 2: Protein Sequence Sanitation Algorithm ...................................................................................................... 9
Figure 3: Estimated Prediction Time Algorithm ....................................................................................................... 10
Figure 4: Prototype Hardware and Software Components .................................................................................. 13
Figure 5: Sitemap of Prototype Login ............................................................................................................................ 14
LIST OF TABLESCases
References .............................................................................................................................................................................. 18
GLOSSARY ........................................................................................................................................................................ 16
REFERENCES ................................................................................................................................................................ 17
1
INTRODUCTION
Lab 1 – SCORPION Product Description
3
Cancer, Alzheimer's, Parkinson's, ALS, and type 2 diabetes are just five of the more than
three hundred diseases which result from improper protein structures in the body (Northwestern
University). In 2010, cancer alone was the second-leading cause of death in the United States, taking
the lives of 576,691 people (Heron). Proteins, which are in every cell of the human body, play a
vital role in carrying out almost all of the human’s bodily functions (What is protein). Some of these
functions include breaking down food for muscle support, sending signals through the brain to
control the body, and transporting nutrients through the blood (What is protein). When a protein
has an improper structure, it can lead to cellular dysfunction and cellular death (Northwestern
University).
Proteins are formed from a string of amino acids. These amino acids have folded together to
create a protein structure. To understand the amino acid folding process more clearly, the following
simplified analogy can be used. Imagine a string of yarn. Imagine holding the string of yarn by one
end and letting it dangle down in a straight line. Envision that the yarn has many small magnets
which run down its length. Slowly lower the string onto a table, allowing it to coil into a shape. The
magnets will cling together and a structure will be formed. This structure is the protein, and the
magnets are the amino acids. The string of amino acids is referred to as the protein’s “primary
structure,” and the way in which the primary structure folds is called the protein’s “secondary
structure.” An identical amino acid sequence will always fold into the same secondary structure
(What is protein).
A protein’s secondary structure dictates what role it will play in the body (What is protein).
The structure can be thought of as a “key” which can only carry out a certain function if it is the
right shape to fit in the “lock” (What is protein). Predicting a protein’s secondary structure given its
primary structure is an important goal for curing protein-related diseases; the ability to predict a
protein’s secondary structure is important because it will allow scientists to create new proteins
with which they can then use to combat disease-related proteins (What is protein). Currently, the
Lab 1 – SCORPION Product Description
4
most accurate protein secondary structure prediction service available is SCORPION at Old
Dominion University (ODU).
While SCORPION encompasses the most accurate protein prediction software available,
SCORPION’s website (SCORPION’s graphical user interface) lacks the professional design and user
features to match its high quality of service. Aesthetic appeal is critical; a study by Kent State
University found that in just 3.42 seconds, participants had judged a website’s credibility based on
aesthetic appeal (Robins). Another study by Northumbria University found that 94 percent of
participants mistrusted and rejected health-related websites based on their design factors
(Sillence). SCORPION has the most accurate secondary protein structure prediction service
available, but it may be overlooked based on its poor aesthetic appeal and interface functionality;
this is a problem.
The solution to this problem is to provide SCORPION with a professional Web design,
competitive Web-functionality, and additional Web tools, including an application-programming
interface (API). An API is a set of communication protocols which can take place between two
applications. Adding an API to SCORPION would enable users to bypass SCORPION’S graphical user
interface and instead access SCORPION’s prediction software directly. Improving SCORPION’s
aesthetic appeal, functionality, and accessibility will allow SCORPION to contribute its full potential
to fight against protein-related diseases. This new, enhanced product will be named STING.
Running head: Lab 1 – STING Product Description
2
5
STING PRODUCT DESCRIPTION
STING is a Web service which will build upon the preexisting Web service SCORIPION.
SCORPION is a Web service which provides protein secondary structure prediction. To utilize
SCORPION, a user visits the SCORPION website and submits a Web form with their email address
and a string of alphabetical characters. Each alphabetical character represents an amino acid;
amino acids are the building blocks of a protein. The string of alphabetical characters, representing
an amino acid sequence, is the input which is used by SCORPION to predict the protein’s secondary
structure. Once SCORPION has predicted the structure, a Web page which displays the results is
created and stored on SCORPION’s Web server. An email containing the URL of the results Web
page is then sent to the user. STING will enhance SCORPION’s preexisting service with Web tools, a
professional Web design, and competitive Web-features in order to improve SCORPION’s aesthetic
appeal, functionality, and accessibility.
Three Web tools will be utilized by STING: A representational state transfer applicationprogramming interface (RESTful API), website visitor statistics, and an administrator login in which
an administrator can access visitor statistics. A RESTful API is a set of architectural constraints that
will make SCORPION’s submission processing more efficient and make SCORPION’s prediction
software accessible to other applications. In practical terms, a RESTful API will enable a third party,
such as a different website or a mobile application, to bypass SCORPION’s website and instead
submit a protein sequence directly to SCORPION’s prediction software.
STING will showcase a professional Web design which adheres to Section 508 standards.
Section 508 standards is a set of design guidelines established by the U.S. Department of Health &
Human Services to make website content equally accessible to people with disabilities (Section
508).
STING will include three competitive Web-features: an estimated wait time for prediction
results, a login in which users can access previous prediction results, and automatic sequence
sanitation. Automatic sequence sanitation simply means that when a user enters a sequence of
Lab 1 – SCORPION Product Description
6
amino acids (alphabetic letters) and if the sequence contains invalid characters, they will have the
option to automatically remove (or “sanitize”) the invalid characters, rather than removing them
manually.
2.1 KEY PRODUCT FEATURES AND CAPABILITIES
One of the most important new tools is STING’s RESTful API. STING’s primary goal is to
improve accessibility to SCORPION’s prediction software. A RESTful API accomplishes this by
allowing other applications to utilize STING through a common interface. Additionally, a RESTful
API provides protection against synchronized (syn) flooding attacks, which would occur if a user
attempted to submit a harmful number of amino acid submissions.
SCORPION’s Web design is also important. When a user is pleased with a website’s
aesthetic appeal, they are more likely to explore the website’s content and features. Additionally,
adhering to 508 standards will not only allow users with disabilities to access the SCORPION
website, but STING will also satisfy the U.S. Law for all federally funded departments and agencies.
Implementing an optional user login will give users more convenient access to their
previous prediction results. As it stands, a user of SCORPION must access previous submissions by
searching through their email inbox. STING will ensure that previous submissions will not get lost
in a user’s email inbox. It will also allow SCORPION’s administrators to gain more information about
their users, which is useful.
Automatic sequence sanitation will make using SCORPION much more convenient for users.
As it stands, if a SCOPRION user enters an amino acid sequence which contains invalid characters,
such as non-alphabetic characters or whitespace, the submission is rejected. The user must
manually remove invalid characters to have their submission accepted. Instead of hand-typing a
protein sequence as input, users will often copy-and-paste large sequences which have been
preformatted to contain whitespace. Manually removing whitespace and invalid characters is time
consuming and makes the sequence submission process inconvenient. STING will provide the user
Lab 1 – SCORPION Product Description
7
with the option to automatically remove invalid characters, making the submission process faster
and more efficient for the user.
An estimated wait time for prediction results will provide users with an idea of when they
can expect to receive their prediction results. As it stands, a SCORPION user does not know if they
will receive their prediction results in a matter of hours, days, or weeks. Providing an estimated
prediction time enables users to have a concrete idea of when they will receive results. An
estimated prediction time may be especially helpful to a user who is working to meet a deadline.
Tracking visitor statistics such as page views and geographical demographics will give
SCORPION’s administrators feedback and insight as to how many users are utilizing SCORPION and
will provide information about those users. STING’s primary goal is to improve accessibility to
SCORPION’s prediction software. The best method for measuring whether or not STING has
succeeded in making SCORPION more accessible is to measure and record the number of people
using SCORPION’s service. Tracking visitor statistics will provide administrators with consistent
traffic feedback, allowing them to monitor traffic increases and declines. Recording specific page
hits will enable administrators to understand which content users find most useful and which
content may be unnecessary.
2.2 MAJOR COMPONENTS (HARDWARE/SOFTWARE)
Whether STING is accessed through its website or through its API, four hardware
components are necessary: A personal computer (PC), a PHP Web server, a service called “PSIBLAST,” and SCORPION’s prediction software, called a “Neural Network.” The relationship between
these components is illustrated in Figure 1. A PC will be necessary for a user to access STING’s user
interface (website or API). The PHP Web server will be used to host STING’s website and support
STING’s API. PSI-BLAST is a third party service which functions to reformat the submitted protein
sequence to prepare it for SCORPION’s prediction software. SCORPION’s Neural Network is the
software which predicts protein secondary structures.
Lab 1 – SCORPION Product Description
8
Figure 1: SCORPION Hardware Components
There will also be five software/virtual components necessary for STING: A Web browser
with Internet connection, a RESTful API, a Web page template, a service called “OpenID,” a database,
and a service called “Google Analytics.” A Web browser with Internet connection will be necessary
for a user to access STING’s user interface. A RESTful API will be used to allow other applications to
use SCORPION’s prediction software. Specifically, the RESTful API will be incorporated into the
queuing of protein submissions (jobs) and incorporated into protein sequence sanitation. This will
enable other applications to view (GET) the list of current jobs and to submit (POST) a new job. A
Web page template will be used as SCORPION’s primary graphical user interface (GUI).
OpenID is a third party service which will be used to implement the user login. OpenID will
allow a user to log in to STING with a preexisting third-party account such as Google or Facebook. A
database will be required to store logged-in user information and protein sequence prediction
Lab 1 – SCORPION Product Description
9
results. Specifically, the database will require a table to store user login OpenIDs, a table to store
sequence submissions, a table to link OpenIDs to the sequence submission results, and a table to
store optionally provided user information. Google Analytics is a third party service which will be
used to record user statistics such as Web page hits and visitor’s IP address.
STING will use two algorithms: A protein sequence sanitation algorithm and an estimated
prediction time algorithm. Both algorithms pertain to the protein sequence submission form. The
protein sequence sanitation algorithm, shown in Figure 2, is used to validate the input of a protein
sequence submission. The user is not required to provide an email address if they are logged in
because they can choose to view their protein structure prediction results in their user history area.
Additionally, a logged-in user can provide their email address through their user account area.
Figure 2: Protein Sequence Sanitation Algorithm
The estimated prediction time algorithm, shown in Figure 3, will be used to calculate the
estimated duration of time that the user will wait to receive their prediction results. The algorithm
is simple and is based on a timed experiment which concluded that each amino acid character will
add approximately 2.13 seconds to the estimated prediction time (CS410 Blue Team). The
Lab 1 – SCORPION Product Description
10
prediction time will be displayed on both the submission form page before the user has submitted
their sequence, as well as the thank you page after the user has submitted their sequence.
Figure 3: Estimated Prediction Time Algorithm
3
IDENTIFICATION OF CASE STUDY
SCORPION was developed by Ashraf Yaseen, a PhD student at ODU, and his PhD advisor, Dr.
Yaohang Li. During SCORPION’s development, the focus was the accuracy of SCORPION’s protein
secondary structure prediction software, rather than SCORPION’s website functionality or design.
Dr. Li is overseeing the development of SCORPION’s new features, making him one of
SCORPION’s target customers; he will be the deciding factor of whether or not the new features
from the prototype will be implemented into the existing product. SCORPION will also be catering
to its primary users: computational biologists, pharmaceutical companies, research students, and
geneticists. These are the individuals who rely on protein secondary structure prediction to
progress in their work on a regular basis. Dr. Li has emphasized that in just one day, SCORPION can
enable the completion of work that might take years in a lab.
Lab 1 – SCORPION Product Description
4
11
STING PRODUCT PROTOTYPE DESCRIPTION
SCORPION’s prototype will be nearly identical to the proposed end-product. It will maintain
nearly all of the functionality of the end-product, including protein sequence sanitation and
estimated prediction time. The essential distinction between the prototype and the end-product is
that the prototype will use a mock version of Dr. Li’s protein prediction software. Table 1 compares
the features of the real-world-product to STING’s prototype. Any feature which is not listed means
that it will be identical between the real-world-product and prototype.
Features/Components
Protein Secondary Structure
Prediction Results
Web Server
User login
User account
Estimated Prediction Time
Real-World-Product
Results will be accurate
predictions made by Dr. Li’s
Neural Network
The Web server will be hosted by
ODU’s SCORPION Web server
Users will have the ability to
login through a third-party
account
Users will have the ability to
submit additional personal
information about themselves to
the user database
The estimated prediction time
will be based upon multiple
timed experiments completed
over the course of several weeks
Prototype
Results will be randomly
generated sequences intended
to simulate prediction results
The Web server will be hosted by
ODU’s CS 411 Web server
Users will have the ability to
login through a Google account
Users will have the ability to
view their username and email
address which have been
retrieved from the user database
The estimated prediction time
will be based upon one timed
experiment completed over the
course of one week
Table 1: Comparison Betweem Real-World-Product and Prototype
The goals and objectives of the prototype are to produce and demonstrate a fully functional
model of SCORPION’s new features. The website template will display the new layout and will be
508 compliant. The API will allow other services to take advantage of SCORPION’s features while
enabling them to create a custom platform over HTTP.
Lab 1 – SCORPION Product Description
12
4.1 PROTOTYPE ARCHITECTURE
Similar to the end-product, the prototype will require a personal computer (PC) and PHP
Web server. The prototype will be accessible through either the API or website. The backend of the
API will use Dr. Li’s preexisting SCORPION binary code in the simulation of SCORPION’s prediction
software. PHP will be used for the submission (POST) and retrieval (GET) of protein sequences. The
submissions (jobs) and server load will be monitored, and each job will have a unique ID,
monitoring the server load will help to prevent flooding attacks. The API will also be used when
emailing the predicted results to the user. The frontend of the API will utilize XML (Extensible
Markup Language) and JSON (JavaScript Object Notation) as they are both very common languages.
The website will also use PHP for its server-side code, meaning PHP will be used when a
user submits a protein sequence through the Web form. PHP is already used in the preexisting
implementation of SCORPION, so the transition will be easy. Additionally, PHP is a module based
language which makes it easier to isolate and modify specific functions. The database portion of the
website will use SQL, specifically Sqlite3, which is designed for moderate-traffic websites that
require large data storage. Sqlite3 is also compatible with PHP, making it an optimal choice.
The website template will be designed using XHTML and CSS3, both of which are current
standards in Web design. 508 compliance will be ensured by using the 28 guidelines from the
checklist at webAIM.org, all of which are excerpted from Section 508 of the Rehabilitation Act,
§1194.22 (Section 508 Checklist). The Web form sequence sanitation will be completed using
JavaScript. JavaScript is a front-end language and will keep the burden off of the Web server.
OpenID will be used for the user login. OpenID is a commonly used tool, allowing users to login with
a third-party account. The way in which the prototype hardware and software components will
interact with each other is illustrated in Figure 4.
Lab 1 – SCORPION Product Description
13
Figure 4: Prototype Hardware and Software Components
4.2 PROTOTYPE FEATURES AND CAPABILITIES
As the prototype is nearly identical to the end-product, it will demonstrate all of STING’s
new functionality. The API will demonstrate how requests can be submitted over HTTP, how the
status of a single job or multiple jobs can be accessed, and that its documentation is public.
The website will display the professional, 508 compliant, template design. It will show that
when a user submits a protein sequence, they will have the convenient option of choosing to
automatically sanitize their input. A login, as illustrated by the sitemap in Figure 5, will give users a
way to connect with SCORPION and to easily retrieve their previous prediction results. Additionally,
the administrator will be able to log-in and view user statistics from Google Analytics as well as
view user information from the database about users who have logged-in.
Lab 1 – SCORPION Product Description
14
Figure 5: Sitemap of Prototype Login
To mitigate risks that may be faced, SCORPION’s new features have been designed to build
upon the preexisting SCORPION product. Everything that has been added is fully compatible with
the preexisting SCORPION. Additionally, for the user login and user statistics, highly established
third party services will be used.
4.3 PROTOTYPE DEVELOPMENT CHALLENGES
The development of the prototype website is expected to encounter four challenges:
Ensuring 508 compliance, addressing the new design, accurately estimating prediction time, and
securing logged-in user information.
It is important to ensure that the website meets 508 standards. If the standards are not met,
the website may be inaccessible to users with disabilities. Additionally, because all federally funded
websites are required to be 508 compliant, the project could lose federal funding if it fails to
Lab 1 – SCORPION Product Description
15
comply. To ensure 508 compliance, the website will be extensively tested using 508 compliance
verification tools, such as those found at W3.org/WAI.
Another challenge will be addressing how users will react to the new design and features.
Users will be unfamiliar with the changes and may resist them. To counter this, the page layout will
remain similar to the previous design and the changes will be announced on the website’s
homepage. Additionally, instructions for using SCORPION will be provided.
Providing an accurate estimate of the protein structure prediction time is another
consideration. If the estimated prediction time is inaccurate, the user may become frustrated,
especially if they have a deadline to meet. To accommodate slight variations in the duration of the
prediction time, a time window rather than a finite time will be given.
Security of logged in user information is also very important. To keep user information
secure, OpenID will be used for the user login, so no passwords will be stored on the SCORPION
system. Additionally, SSL will be implemented to protect any additional information the user
chooses to provide. Further, if a user does not have a third party account, they are not required to
log-in to benefit from SCORPION.
The API will have its own challenges, which will be: securing the API, ensuring proper
interfacing, and ensuring administrator access to user statistics. The API will be open to the public,
making it vulnerable to attacks. An API has four commands: GET, POST, DELETE, and PUT. Even
though SCORPION’s API will only utilize the “GET” and “POST” API commands, the two other
commands, “DELETE” and “PUT” will still be defined as functions simply to make certain that an
attacker can’t alter these commands.
It may be a challenge to incorporate the API with the existing SCORPION resources. The API
will reference SCORPION’s resources by the address of their file location. If one of SCORPION’s files
is moved, this could pose a problem. To prevent this problem, the API will be structured such that if
a file address changes, the system will have instructions of what to do, and will not crash.
Lab 1 – SCORPION Product Description
16
GLOSSARY
508 Compliance: Adhering to guidelines established to make website content equally accessible to people with
disabilities
Amino Acids/Residues: The building blocks of proteins
API: Application Programmable Interface (abstract way for services to communicate)
Cross-validation Training: The process of dividing training data into k mutually exclusive subsets (folds), of roughly
equal size where some subsets are used for training, validating, and testing. The process is repeated k times.
Data cleansing: The process of removing non-representative instances from the data set.
Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data from the RCSB
ETL: Extract, Transform and Load. Referring to the manipulation of Data
FASTA: Format widely adopted in bioinformatic to make it easier to manipulate and parse sequences
Fold: The fold of an amino acid sequence forms the protein’s secondary structure
GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers to match IP
origin
GUI: Graphical User Interface
JSON: JavaScript Object Notation
NSF: National Science Foundation
PC: Personal Computer
PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the PSSM
PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of the original
protein sequence
RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The database holds all
known and recognized protein sequences.
REST: A REST API is a set of operations that can be invoked by means of any the four verbs, using the actual URI as
parameters for your operations. Four verbs including (GET,POST,PUT,DELETE)
SCORPION: SeCOndaRy structure PredictION
Training set: Set of instances from the problem domain used to train the algorithm
VM: Virtual machine
XML: Extensible Markup Language
Lab 1 – SCORPION Product Description
17
REFERENCES
Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014, from
http://www.rcsb.org/pdb/home/home.do
Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11, 2014,
from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx periment.xlsx
Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research Funding National Cancer Institute. Retrieved May 8, 2014, from
http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding
Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014, from
http://www.foresight.org/Nanomedicine/Ch03_1.html
Heron, M. (2014, July 14). Leading Causes of Death. Retrieved September 12, 2014, from
http://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from
http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf
Northwestern University. (2012, January 8). New hope for diseases of protein folding such as
Alzheimer’s, Parkinson’s diseases, ALS, cancer and diabetes. ScienceDaily. Retrieved
September 12, 2014 from www.sciencedaily.com/releases/2012/01/120106135946.htm
RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from
http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC
ount&name=Residue%20Count
Robins, D., & Holmes, J. (2008). Aesthetics and credibility in website design. Information Processing
And Management, 44(Evaluation of Interactive Information Retrieval Systems), 386-399.
doi:10.1016/j.ipm.2007.02.003
Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March 15,
2014, from http://www.hhs.gov/web/508/index.html
Section 508 Checklist. (n.d.). Retrieved September 17, 2014, from
http://webaim.org/standards/508/checklist
Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014, from
http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act
Sillence, E., Briggs, P., Harris, P., & Fishwick, L. (2007). How do patients evaluate and make use of
Lab 1 – SCORPION Product Description
online health information?. Social Science & Medicine, 641853-1862.
doi:10.1016/j.socscimed.2007.01.012
What is protein folding? (n.d.). Retrieved October 16, 2014,
from http://fold.it/portal/info/science
Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction
Accuracy.
18
Download