Running head: Lab II – Prototype Product Specification For STING 1

advertisement
Running head: Lab II – Prototype Product Specification For STING
CS 411W Lab II
Prototype Product Specification
For
STING
Prepared by: Jasmine, Blue Team
Date: 11/25/14
1
Lab II – Prototype Product Specification For STING
2
TABLE OF CONTENTS
1 Introduction ........................................................................................................................................................................... 3
1.1 Purpose............................................................................................................................................................................ 4
1.2 Scope................................................................................................................................................................................. 6
1.3 Definitions, Acronyms, and Abbreviations ....................................................................................................... 8
1.4 References ...................................................................................................................................................................... 9
1.5 Overview ....................................................................................................................................................................... 11
2
General Description ..................................................................................................................................................... 11
2.1 Prototype Architecture Description .................................................................................................................. 11
2.2 Prototype Functional Description ...................................................................................................................... 12
2.3 External Interfaces.................................................................................................................................................... 13
2.3.1 Hardware Interfaces..................................................................................................................................... 14
2.3.2 Software Interfaces ....................................................................................................................................... 14
2.3.3 User Interfaces ................................................................................................................................................ 15
2.3.4 Communications Protocols and Interfaces ......................................................................................... 18
LIST OF FIGURES
Figure 1: STING Hardware and Software Components .......................................................................................... 12
Figure 2: Protein Sequence Sanitation Algorithm .................................................................................................... 13
Figure 3: Estimated Prediction Time Algorithm ....................................................................................................... 14
Figure 4: Login Sitemap of Prototype ............................................................................................................................ 15
Figure 5: Main Sitemap of Prototype ............................................................................................................................. 16
Figure 6: Web site Prototype Visual Aid....................................................................................................................... 17
LIST OF TABLESNo table of authorities entries found.
1 Introduction
Lab II – Prototype Product Specification For STING
3
Cancer, Alzheimer's, Parkinson's, ALS, and type 2 diabetes are just five of the more than
three hundred diseases which result from improper protein structures in the body (Northwestern
University). In 2010, cancer alone was the second-leading cause of death in the United States, taking
the lives of 576,691 people (Heron). Proteins, which are in every cell of the human body, play a
vital role in carrying out almost all of the human’s bodily functions (What is protein). Some of these
functions include breaking down food for muscle support, sending signals through the brain to
control the body, and transporting nutrients through the blood (What is protein). When a protein
has an improper structure, it can lead to cellular dysfunction and cellular death (Northwestern
University).
Proteins are formed from a string of amino acids. These amino acids have folded together to
create a protein structure. To understand the amino acid folding process more clearly, the following
simplified analogy can be used. Imagine a string of yarn. Imagine holding the string of yarn by one
end and letting it dangle down in a straight line. Envision that the yarn has many small magnets
which run down its length. Slowly lower the string onto a table, allowing it to coil into a shape. The
magnets will cling together, and a structure will be formed. This structure is the protein, and the
magnets are the amino acids. The string of amino acids is referred to as the protein’s “primary
structure,” and the way in which the primary structure folds is called the protein’s “secondary
structure.” An identical amino acid sequence will always fold into the same secondary structure
(What is protein).
A protein’s secondary structure dictates what role it will play in the body (What is protein).
The structure can be thought of as a “key” which can only carry out a certain function if it is the
right shape to fit in the “lock” (What is protein). Predicting a protein’s secondary structure, given
its primary structure, is an important goal for curing protein-related diseases; the ability to predict
a protein’s secondary structure is important because it will allow scientists to create new proteins,
with which they can then use to combat disease-related proteins (What is protein). Currently, the
Lab II – Prototype Product Specification For STING
4
most accurate protein secondary structure prediction service available is SCORPION at Old
Dominion University (ODU).
While SCORPION encompasses the most accurate protein prediction software available,
SCORPION’s Web site (SCORPION’s graphical user interface) lacks the professional design and user
features to match its high quality of service. Aesthetic appeal is critical; a study by Kent State
University found that in just 3.42 seconds, participants had judged a Web site’s credibility based on
aesthetic appeal (Robins). Another study by Northumbria University found that 94 percent of
participants mistrusted and rejected health-related Web sites based on their design factors
(Sillence). SCORPION has the most accurate secondary protein structure prediction service
available, but it may be overlooked based on its poor aesthetic appeal and interface functionality;
this is a problem.
1.1 Purpose
STING’s purpose is to improve SCORPION’s aesthetic appeal, functionality, and accessibility,
in order to allow SCORPION to contribute its full potential to fight against protein-related diseases.
STING is a Web service which will build upon the preexisting Web service SCORPION. SCORPION is
a Web service which provides protein secondary structure prediction. To utilize SCORPION, a user
visits the SCORPION Web site and submits a Web form with their email address and a string of
alphabetical characters. Each alphabetical character represents an amino acid; amino acids are the
building blocks of a protein. The string of alphabetical characters, representing an amino acid
sequence, is the input which is used by SCORPION to predict the protein’s secondary structure.
Once SCORPION has predicted the structure, a Web page which displays the results is created and
stored on SCORPION’s Web server. An email containing the URL of the results Web page is then sent
to the user.
Lab II – Prototype Product Specification For STING
5
STING will implement a RESTful API in order to improve accessibility to SCORPION’s
prediction software. A RESTful API accomplishes this by allowing other applications to utilize
STING through a common interface. Additionally, a RESTful API provides protection against
synchronized (syn) flooding attacks, which would occur if a user attempted to submit a harmful
number of amino acid submissions.
STING will provide SCORPION with a new, aesthetically pleasing and 508 compliant Web
site design. Adhering to 508 standards will not only allow users with disabilities to access the
SCORPION Web site, but STING will also satisfy the U.S. Law for all federally funded departments
and agencies.
STING will implement an optional user login which will give users more convenient access
to their previous prediction results. As it stands, a user of SCORPION must access previous
submissions by searching through their email inbox. STING will ensure that previous submissions
will not get lost in a user’s email inbox. It will also allow SCORPION’s administrators to gain more
information about their users, which is useful.
Automatic sequence sanitation will make using SCORPION much more convenient for users.
As it stands, if a SCOPRION user enters an amino acid sequence which contains invalid characters,
such as non-alphabetic characters or whitespace, the submission is rejected. The user must
manually remove invalid characters to have their submission accepted. Instead of hand-typing a
protein sequence as input, users will often copy-and-paste large sequences which have been
preformatted to contain whitespace. Manually removing whitespace and invalid characters is time
consuming and makes the sequence submission process inconvenient. STING will provide the user
with the option to automatically remove invalid characters, making the submission process faster
and more efficient for the user.
An estimated wait time for prediction results will provide users with an idea of when they
can expect to receive their prediction results. As it stands, a SCORPION user does not know if they
Lab II – Prototype Product Specification For STING
6
will receive their prediction results in a matter of hours, days, or weeks. Providing an estimated
prediction time enables users to have a reasonable idea of when they will receive results. An
estimated prediction time may be especially helpful to a user who is working to meet a deadline.
Tracking visitor statistics such as page views and geographical demographics will give
SCORPION’s administrators feedback and insight as to how many users are utilizing SCORPION and
will provide information about those users. STING’s primary goal is to improve accessibility to
SCORPION’s prediction software. The best method for measuring whether or not STING has
succeeded in making SCORPION more accessible is to measure and record the number of people
using SCORPION’s service. Tracking visitor statistics will provide administrators with consistent
traffic feedback, allowing them to monitor traffic increases and declines. Recording specific page
hits will enable administrators to understand which content users find most useful and which
content may be unnecessary.
1.2 Scope
SCORPION’s prototype will be nearly identical to the proposed end-product. It will maintain
nearly all of the functionality of the end-product, including protein sequence sanitation and
estimated prediction time. The essential distinction between the prototype and the end-product is
that the prototype will use a mock version of Dr. Li’s protein prediction software. Table 1 compares
the features of the real-world-product to STING’s prototype. Any feature which is not listed means
that it will be identical between the real-world-product and prototype.
Lab II – Prototype Product Specification For STING
Features/Components
Protein Secondary Structure
Prediction Results
Web Server
User login
User account
Estimated Prediction Time
Real-World-Product
Results will be accurate
predictions made by Dr. Li’s
Neural Network
The Web server will be hosted
by ODU’s SCORPION Web
server
Users will have the ability to
login through a third-party
account
Users will have the ability to
submit additional personal
information about themselves
to the user database
The estimated prediction time
will be based upon multiple
timed experiments completed
over the course of several
weeks
7
Prototype
Results will be randomly
generated sequences intended
to simulate prediction results
The Web server will be hosted
by ODU’s CS 411 Web server
Users will have the ability to
login through a Google account
Users will have the ability to
view their username and email
address which have been
retrieved from the user
database
The estimated prediction time
will be based upon one timed
experiment completed over the
course of one week
Table 1: Comparison Between Real-World-Product and Prototype
The goals and objectives of the prototype are to produce and demonstrate a fully functional
model of SCORPION’s new features. The Web site template will display the new layout and will be
508 compliant. The API will allow other services to take advantage of SCORPION’s features while
enabling them to create a custom platform over HTTP.
Lab II – Prototype Product Specification For STING
8
1.3 Definitions, Acronyms, and Abbreviations
508 Compliance: Adhering to guidelines established to make Web site content equally accessible
to people with disabilities
Amino Acids/Residues: The building blocks of proteins
API: Application Programmable Interface (abstract way for services to communicate)
Cross-validation Training: The process of dividing training data into k mutually exclusive subsets
(folds), of roughly equal size where some subsets are used for training, validating, and testing. The
process is repeated k times.
Data cleansing: The process of removing non-representative instances from the data set.
Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data
from the RCSB
ETL: Extract, Transform and Load. Referring to the manipulation of Data
FASTA: Format widely adopted in bioinformatics to make it easier to manipulate and parse
sequences
Fold: The fold of an amino acid sequence forms the protein’s secondary structure
GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers
to match IP origin
GUI: Graphical User Interface
JSON: JavaScript Object Notation
NSF: National Science Foundation
PC: Personal Computer
PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the
PSSM
PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of
the original protein sequence
RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The
database holds all known and recognized protein sequences.
REST: A REST API is a set of operations that can be invoked by means of any the four verbs, using
the actual URI as parameters for your operations. Four verbs including (GET,POST,PUT,DELETE)
SCORPION: SeCOndaRy structure PredictION
Lab II – Prototype Product Specification For STING
9
Training set: Set of instances from the problem domain used to train the algorithm
VM: Virtual machine
XML: Extensible Markup Language
1.4 References
Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014,
from http://www.rcsb.org/pdb/home/home.do
Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11,
2014, from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx
periment.xlsx
Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research
Funding - National Cancer Institute. Retrieved May 8, 2014, from
http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding
Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014,
from http://www.foresight.org/Nanomedicine/Ch03_1.html
Heron, M. (2014, July 14). Leading Causes of Death. Retrieved September 12, 2014, from
http://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm
Lab 1 – STING Product Description. Version 2. (2014, October). STING Team. Blue Team.
CS411W: Jasmine Jones
Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from
http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf
Northwestern University. (2012, January 8). New hope for diseases of protein folding such as
Alzheimer’s, Parkinson’s diseases, ALS, cancer and diabetes. ScienceDaily. Retrieved
September 12, 2014 from www.sciencedaily.com/releases/2012/01/120106135946.htm
Lab II – Prototype Product Specification For STING
10
RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from
http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC
ount&name=Residue%20Count
Robins, D., & Holmes, J. (2008). Aesthetics and credibility in Web site design. Information
Processing And Management, 44(Evaluation of Interactive Information Retrieval Systems),
386-399. doi:10.1016/j.ipm.2007.02.003
Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March
15, 2014, from http://www.hhs.gov/web/508/index.html
Section 508 Checklist. (n.d.). Retrieved September 17, 2014, from
http://webaim.org/standards/508/checklist
Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014,
from http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act
Sillence, E., Briggs, P., Harris, P., & Fishwick, L. (2007). How do patients evaluate and make use
of online health information?. Social Science & Medicine, 641853-1862.
doi:10.1016/j.socscimed.2007.01.012
What is protein folding? (n.d.). Retrieved October 16, 2014,
from http://fold.it/portal/info/science
Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction
Accuracy.
Lab II – Prototype Product Specification For STING
11
1.5 Overview
This product specification provides a description of STING’s Prototype Architecture,
functionality, algorithms and interfaces. The hardware and software which will be used are
described in detail. Additionally, in depth requirements depicting how to implement STING are
described in section 3.1.
2
General Description
STING is a Web service which builds upon a preexisting protein secondary structure
prediction service which is called SCORPION. STING will consist of a Web site, STING’s primary
user interface, and an API, STING’s secondary user interface. STING’s primary user interface will
have a professional design and be fully 508 compliant. STING will allow a user to submit a protein
sequence and email address, through either of STING’s two interfaces. STING will return mock
prediction results to a user via email, and optionally, via a user login account. STING will provide
users with the ability to login and access previous prediction results. STING will also track user
statistics and allow administrators to login and to view user statists.
2.1 Prototype Architecture Description
Whether STING is accessed through its Web site or through its API, two hardware
components are necessary: A personal computer (PC) and a PHP Web server. A PC will be necessary
for a user to access STING’s user interface (Web site or API), and the PHP Web server will be used
to host STING’s Web site and support STING’s API. There will also be five software/virtual
components necessary for STING: a Web browser with Internet connection, a RESTful API, a Web
page template, a service called “OpenID,” a database, and a service called “Google Analytics.” The
relationship between these components is illustrated in Figure 1.
Lab II – Prototype Product Specification For STING
12
Figure 1: STING Hardware and Software Components
2.2 Prototype Functional Description
A Web browser with Internet connection will be necessary for a user to access STING’s user
interface. A RESTful API will be used to allow other applications to use STING’s mock prediction
software. Specifically, the RESTful API will be incorporated into the queuing of protein submissions
(jobs) and incorporated into protein sequence sanitation. This will enable other applications to
view (GET) the list of current jobs and to submit (POST) a new job. A Web page template will be
used as SCORPION’s primary graphical user interface (GUI).
OpenID is a third party service which will be used to implement the user login. OpenID will
allow a user to log in to STING with a preexisting third-party account such as Google or Facebook. A
database will be required to store logged-in user information and protein sequence prediction
Lab II – Prototype Product Specification For STING
13
results. Specifically, the database will require a table to store user login OpenIDs, a table to store
sequence submissions, a table to link OpenIDs to the sequence submission results, and a table to
store optionally provided user information. Google Analytics is a third party service which will be
used to record user statistics such as Web page hits and visitor’s IP address.
2.3 External Interfaces
STING will utilize two user interfaces: a Web site and an API. Additionally, STING will use
two algorithms: A protein sequence sanitation algorithm and an estimated prediction time
algorithm. Both algorithms pertain to the protein sequence submission form. The protein
sequence sanitation algorithm, shown in Figure 2, is used to validate the input of a protein
sequence submission. The user is not required to provide an email address if they are logged in
because they can choose to view their protein structure prediction results in their user history area.
Additionally, a logged-in user can provide their email address through their user account area.
Figure 2: Protein Sequence Sanitation Algorithm
Lab II – Prototype Product Specification For STING
14
The estimated prediction time algorithm, shown in Figure 3, will be used to calculate the
estimated duration of time that the user will wait to receive their prediction results. The algorithm
is simple and is based on a timed experiment which concluded that each amino acid character will
add approximately 2.13 seconds to the estimated prediction time (CS410 Blue Team). The
prediction time will be displayed on both the submission form page before the user has submitted
their sequence, as well as the thank you page after the user has submitted their sequence.
Figure 3: Estimated Prediction Time Algorithm
2.3.1 Hardware Interfaces
STING will be hosted by a Web server which contains all of the hardware components
necessary to produce STING’s services. No hardware interfacing is anticipated to produce STING’s
services. To utilize STING, the user will need to have a PC with an internet connection.
2.3.2 Software Interfaces
STING will communicate with Google Analytics third-party software through the Google
Analytics API. STING will also communicate with Google’s oAuth2 Master API in order to provide a
user login which enables a user to login through their Google account. How Google’s login and
Google Analytics will be incorporated into STING’s Web site is illustrated in the sitemap in Figure 4.
Lab II – Prototype Product Specification For STING
15
Figure 4: Login Sitemap of Prototype
2.3.3 User Interfaces
STING’s primary user interface will be STING’s Web site. The Web site will consist of a Web
site template, protein sequence submission form, thank you Web page, home Web page, contact
Web page, about Web page, login Web page, admin account Web page, user Information Web page,
Google Analytics Web page, user account Web page, set of results Web pages, and an expired results
Web page. How these Web site elements connect to each other is illustrated below in Figure 5 and
in Figure 4 located in section 2.3.2.
Lab II – Prototype Product Specification For STING
Figure 5: Main Sitemap of Prototype
16
Lab II – Prototype Product Specification For STING
17
Every webpage that is a part of STING’s Web site must meet 508 compliance. The protein
sequence submission form is the principal method for which a user can submit a protein sequence
to STING’s prediction software. The central method for which a user will receive their prediction
results also will be provided through this Web site. Additionally, users and administrators of STING
will be given the capability to log-in to STING through this Web site. A visual aid of what STING’s
Web site could look like is illustrated in Figure 6.
Figure 6: Web site Prototype Visual Aid
Lab II – Prototype Product Specification For STING
18
2.3.4 Communications Protocols and Interfaces
STING’s RESTful API will allow other application to communicate directly with STING over
Hypertext Transfer Protocol (HTTP). The backend of the API will use Dr. Li’s preexisting SCORPION
binary code in the simulation of SCORPION’s prediction software. PHP will be used for the
submission (POST) and retrieval (GET) of protein sequences. The submissions (jobs) and server
load will be monitored, and each job will have a unique ID. The API will also be used when emailing
the predicted results to the user. The frontend of the API will utilize XML (Extensible Markup
Language) and JSON (JavaScript Object Notation) as they are both very common languages.
Download