Running head: Lab II – Prototype Product Specification For STING CS 411W Lab II Prototype Product Specification For STING Prepared by: Jasmine, Blue Team Date: 11/25/14 1 Lab II – Prototype Product Specification For STING 2 TABLE OF CONTENTS 1 Introduction ........................................................................................................................................................................... 3 1.1 Purpose............................................................................................................................................................................ 4 1.2 Scope................................................................................................................................................................................. 6 1.3 Definitions, Acronyms, and Abbreviations ....................................................................................................... 8 1.4 References ...................................................................................................................................................................... 9 1.5 Overview ....................................................................................................................................................................... 11 2 General Description ..................................................................................................................................................... 11 2.1 Prototype Architecture Description .................................................................................................................. 11 2.2 Prototype Functional Description ...................................................................................................................... 12 2.3 External Interfaces.................................................................................................................................................... 13 2.3.1 Hardware Interfaces..................................................................................................................................... 14 2.3.2 Software Interfaces ....................................................................................................................................... 14 2.3.3 User Interfaces ................................................................................................................................................ 15 2.3.4 Communications Protocols and Interfaces ......................................................................................... 18 LIST OF FIGURES Figure 1: STING Hardware and Software Components .......................................................................................... 12 Figure 2: Protein Sequence Sanitation Algorithm .................................................................................................... 13 Figure 3: Estimated Prediction Time Algorithm ....................................................................................................... 14 Figure 4: Login Sitemap of Prototype ............................................................................................................................ 15 Figure 5: Main Sitemap of Prototype ............................................................................................................................. 16 Figure 6: Web site Prototype Visual Aid....................................................................................................................... 17 LIST OF TABLESNo table of authorities entries found. 1 Introduction Lab II – Prototype Product Specification For STING 3 Cancer, Alzheimer's, Parkinson's, ALS, and type 2 diabetes are just five of the more than three hundred diseases which result from improper protein structures in the body (Northwestern University). In 2010, cancer alone was the second-leading cause of death in the United States, taking the lives of 576,691 people (Heron). Proteins, which are in every cell of the human body, play a vital role in carrying out almost all of the human’s bodily functions (What is protein). Some of these functions include breaking down food for muscle support, sending signals through the brain to control the body, and transporting nutrients through the blood (What is protein). When a protein has an improper structure, it can lead to cellular dysfunction and cellular death (Northwestern University). Proteins are formed from a string of amino acids. These amino acids have folded together to create a protein structure. To understand the amino acid folding process more clearly, the following simplified analogy can be used. Imagine a string of yarn. Imagine holding the string of yarn by one end and letting it dangle down in a straight line. Envision that the yarn has many small magnets which run down its length. Slowly lower the string onto a table, allowing it to coil into a shape. The magnets will cling together, and a structure will be formed. This structure is the protein, and the magnets are the amino acids. The string of amino acids is referred to as the protein’s “primary structure,” and the way in which the primary structure folds is called the protein’s “secondary structure.” An identical amino acid sequence will always fold into the same secondary structure (What is protein). A protein’s secondary structure dictates what role it will play in the body (What is protein). The structure can be thought of as a “key” which can only carry out a certain function if it is the right shape to fit in the “lock” (What is protein). Predicting a protein’s secondary structure, given its primary structure, is an important goal for curing protein-related diseases; the ability to predict a protein’s secondary structure is important because it will allow scientists to create new proteins, with which they can then use to combat disease-related proteins (What is protein). Currently, the Lab II – Prototype Product Specification For STING 4 most accurate protein secondary structure prediction service available is SCORPION at Old Dominion University (ODU). While SCORPION encompasses the most accurate protein prediction software available, SCORPION’s Web site (SCORPION’s graphical user interface) lacks the professional design and user features to match its high quality of service. Aesthetic appeal is critical; a study by Kent State University found that in just 3.42 seconds, participants had judged a Web site’s credibility based on aesthetic appeal (Robins). Another study by Northumbria University found that 94 percent of participants mistrusted and rejected health-related Web sites based on their design factors (Sillence). SCORPION has the most accurate secondary protein structure prediction service available, but it may be overlooked based on its poor aesthetic appeal and interface functionality; this is a problem. 1.1 Purpose STING’s purpose is to improve SCORPION’s aesthetic appeal, functionality, and accessibility, in order to allow SCORPION to contribute its full potential to fight against protein-related diseases. STING is a Web service which will build upon the preexisting Web service SCORPION. SCORPION is a Web service which provides protein secondary structure prediction. To utilize SCORPION, a user visits the SCORPION Web site and submits a Web form with their email address and a string of alphabetical characters. Each alphabetical character represents an amino acid; amino acids are the building blocks of a protein. The string of alphabetical characters, representing an amino acid sequence, is the input which is used by SCORPION to predict the protein’s secondary structure. Once SCORPION has predicted the structure, a Web page which displays the results is created and stored on SCORPION’s Web server. An email containing the URL of the results Web page is then sent to the user. Lab II – Prototype Product Specification For STING 5 STING will implement a RESTful API in order to improve accessibility to SCORPION’s prediction software. A RESTful API accomplishes this by allowing other applications to utilize STING through a common interface. Additionally, a RESTful API provides protection against synchronized (syn) flooding attacks, which would occur if a user attempted to submit a harmful number of amino acid submissions. STING will provide SCORPION with a new, aesthetically pleasing and 508 compliant Web site design. Adhering to 508 standards will not only allow users with disabilities to access the SCORPION Web site, but STING will also satisfy the U.S. Law for all federally funded departments and agencies. STING will implement an optional user login which will give users more convenient access to their previous prediction results. As it stands, a user of SCORPION must access previous submissions by searching through their email inbox. STING will ensure that previous submissions will not get lost in a user’s email inbox. It will also allow SCORPION’s administrators to gain more information about their users, which is useful. Automatic sequence sanitation will make using SCORPION much more convenient for users. As it stands, if a SCOPRION user enters an amino acid sequence which contains invalid characters, such as non-alphabetic characters or whitespace, the submission is rejected. The user must manually remove invalid characters to have their submission accepted. Instead of hand-typing a protein sequence as input, users will often copy-and-paste large sequences which have been preformatted to contain whitespace. Manually removing whitespace and invalid characters is time consuming and makes the sequence submission process inconvenient. STING will provide the user with the option to automatically remove invalid characters, making the submission process faster and more efficient for the user. An estimated wait time for prediction results will provide users with an idea of when they can expect to receive their prediction results. As it stands, a SCORPION user does not know if they Lab II – Prototype Product Specification For STING 6 will receive their prediction results in a matter of hours, days, or weeks. Providing an estimated prediction time enables users to have a reasonable idea of when they will receive results. An estimated prediction time may be especially helpful to a user who is working to meet a deadline. Tracking visitor statistics such as page views and geographical demographics will give SCORPION’s administrators feedback and insight as to how many users are utilizing SCORPION and will provide information about those users. STING’s primary goal is to improve accessibility to SCORPION’s prediction software. The best method for measuring whether or not STING has succeeded in making SCORPION more accessible is to measure and record the number of people using SCORPION’s service. Tracking visitor statistics will provide administrators with consistent traffic feedback, allowing them to monitor traffic increases and declines. Recording specific page hits will enable administrators to understand which content users find most useful and which content may be unnecessary. 1.2 Scope SCORPION’s prototype will be nearly identical to the proposed end-product. It will maintain nearly all of the functionality of the end-product, including protein sequence sanitation and estimated prediction time. The essential distinction between the prototype and the end-product is that the prototype will use a mock version of Dr. Li’s protein prediction software. Table 1 compares the features of the real-world-product to STING’s prototype. Any feature which is not listed means that it will be identical between the real-world-product and prototype. Lab II – Prototype Product Specification For STING Features/Components Protein Secondary Structure Prediction Results Web Server User login User account Estimated Prediction Time Real-World-Product Results will be accurate predictions made by Dr. Li’s Neural Network The Web server will be hosted by ODU’s SCORPION Web server Users will have the ability to login through a third-party account Users will have the ability to submit additional personal information about themselves to the user database The estimated prediction time will be based upon multiple timed experiments completed over the course of several weeks 7 Prototype Results will be randomly generated sequences intended to simulate prediction results The Web server will be hosted by ODU’s CS 411 Web server Users will have the ability to login through a Google account Users will have the ability to view their username and email address which have been retrieved from the user database The estimated prediction time will be based upon one timed experiment completed over the course of one week Table 1: Comparison Between Real-World-Product and Prototype The goals and objectives of the prototype are to produce and demonstrate a fully functional model of SCORPION’s new features. The Web site template will display the new layout and will be 508 compliant. The API will allow other services to take advantage of SCORPION’s features while enabling them to create a custom platform over HTTP. Lab II – Prototype Product Specification For STING 8 1.3 Definitions, Acronyms, and Abbreviations 508 Compliance: Adhering to guidelines established to make Web site content equally accessible to people with disabilities Amino Acids/Residues: The building blocks of proteins API: Application Programmable Interface (abstract way for services to communicate) Cross-validation Training: The process of dividing training data into k mutually exclusive subsets (folds), of roughly equal size where some subsets are used for training, validating, and testing. The process is repeated k times. Data cleansing: The process of removing non-representative instances from the data set. Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data from the RCSB ETL: Extract, Transform and Load. Referring to the manipulation of Data FASTA: Format widely adopted in bioinformatics to make it easier to manipulate and parse sequences Fold: The fold of an amino acid sequence forms the protein’s secondary structure GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers to match IP origin GUI: Graphical User Interface JSON: JavaScript Object Notation NSF: National Science Foundation PC: Personal Computer PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the PSSM PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of the original protein sequence RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The database holds all known and recognized protein sequences. REST: A REST API is a set of operations that can be invoked by means of any the four verbs, using the actual URI as parameters for your operations. Four verbs including (GET,POST,PUT,DELETE) SCORPION: SeCOndaRy structure PredictION Lab II – Prototype Product Specification For STING 9 Training set: Set of instances from the problem domain used to train the algorithm VM: Virtual machine XML: Extensible Markup Language 1.4 References Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014, from http://www.rcsb.org/pdb/home/home.do Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11, 2014, from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx periment.xlsx Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research Funding - National Cancer Institute. Retrieved May 8, 2014, from http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014, from http://www.foresight.org/Nanomedicine/Ch03_1.html Heron, M. (2014, July 14). Leading Causes of Death. Retrieved September 12, 2014, from http://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm Lab 1 – STING Product Description. Version 2. (2014, October). STING Team. Blue Team. CS411W: Jasmine Jones Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf Northwestern University. (2012, January 8). New hope for diseases of protein folding such as Alzheimer’s, Parkinson’s diseases, ALS, cancer and diabetes. ScienceDaily. Retrieved September 12, 2014 from www.sciencedaily.com/releases/2012/01/120106135946.htm Lab II – Prototype Product Specification For STING 10 RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC ount&name=Residue%20Count Robins, D., & Holmes, J. (2008). Aesthetics and credibility in Web site design. Information Processing And Management, 44(Evaluation of Interactive Information Retrieval Systems), 386-399. doi:10.1016/j.ipm.2007.02.003 Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March 15, 2014, from http://www.hhs.gov/web/508/index.html Section 508 Checklist. (n.d.). Retrieved September 17, 2014, from http://webaim.org/standards/508/checklist Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014, from http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act Sillence, E., Briggs, P., Harris, P., & Fishwick, L. (2007). How do patients evaluate and make use of online health information?. Social Science & Medicine, 641853-1862. doi:10.1016/j.socscimed.2007.01.012 What is protein folding? (n.d.). Retrieved October 16, 2014, from http://fold.it/portal/info/science Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction Accuracy. Lab II – Prototype Product Specification For STING 11 1.5 Overview This product specification provides a description of STING’s Prototype Architecture, functionality, algorithms and interfaces. The hardware and software which will be used are described in detail. Additionally, in depth requirements depicting how to implement STING are described in section 3.1. 2 General Description STING is a Web service which builds upon a preexisting protein secondary structure prediction service which is called SCORPION. STING will consist of a Web site, STING’s primary user interface, and an API, STING’s secondary user interface. STING’s primary user interface will have a professional design and be fully 508 compliant. STING will allow a user to submit a protein sequence and email address, through either of STING’s two interfaces. STING will return mock prediction results to a user via email, and optionally, via a user login account. STING will provide users with the ability to login and access previous prediction results. STING will also track user statistics and allow administrators to login and to view user statists. 2.1 Prototype Architecture Description Whether STING is accessed through its Web site or through its API, two hardware components are necessary: A personal computer (PC) and a PHP Web server. A PC will be necessary for a user to access STING’s user interface (Web site or API), and the PHP Web server will be used to host STING’s Web site and support STING’s API. There will also be five software/virtual components necessary for STING: a Web browser with Internet connection, a RESTful API, a Web page template, a service called “OpenID,” a database, and a service called “Google Analytics.” The relationship between these components is illustrated in Figure 1. Lab II – Prototype Product Specification For STING 12 Figure 1: STING Hardware and Software Components 2.2 Prototype Functional Description A Web browser with Internet connection will be necessary for a user to access STING’s user interface. A RESTful API will be used to allow other applications to use STING’s mock prediction software. Specifically, the RESTful API will be incorporated into the queuing of protein submissions (jobs) and incorporated into protein sequence sanitation. This will enable other applications to view (GET) the list of current jobs and to submit (POST) a new job. A Web page template will be used as SCORPION’s primary graphical user interface (GUI). OpenID is a third party service which will be used to implement the user login. OpenID will allow a user to log in to STING with a preexisting third-party account such as Google or Facebook. A database will be required to store logged-in user information and protein sequence prediction Lab II – Prototype Product Specification For STING 13 results. Specifically, the database will require a table to store user login OpenIDs, a table to store sequence submissions, a table to link OpenIDs to the sequence submission results, and a table to store optionally provided user information. Google Analytics is a third party service which will be used to record user statistics such as Web page hits and visitor’s IP address. 2.3 External Interfaces STING will utilize two user interfaces: a Web site and an API. Additionally, STING will use two algorithms: A protein sequence sanitation algorithm and an estimated prediction time algorithm. Both algorithms pertain to the protein sequence submission form. The protein sequence sanitation algorithm, shown in Figure 2, is used to validate the input of a protein sequence submission. The user is not required to provide an email address if they are logged in because they can choose to view their protein structure prediction results in their user history area. Additionally, a logged-in user can provide their email address through their user account area. Figure 2: Protein Sequence Sanitation Algorithm Lab II – Prototype Product Specification For STING 14 The estimated prediction time algorithm, shown in Figure 3, will be used to calculate the estimated duration of time that the user will wait to receive their prediction results. The algorithm is simple and is based on a timed experiment which concluded that each amino acid character will add approximately 2.13 seconds to the estimated prediction time (CS410 Blue Team). The prediction time will be displayed on both the submission form page before the user has submitted their sequence, as well as the thank you page after the user has submitted their sequence. Figure 3: Estimated Prediction Time Algorithm 2.3.1 Hardware Interfaces STING will be hosted by a Web server which contains all of the hardware components necessary to produce STING’s services. No hardware interfacing is anticipated to produce STING’s services. To utilize STING, the user will need to have a PC with an internet connection. 2.3.2 Software Interfaces STING will communicate with Google Analytics third-party software through the Google Analytics API. STING will also communicate with Google’s oAuth2 Master API in order to provide a user login which enables a user to login through their Google account. How Google’s login and Google Analytics will be incorporated into STING’s Web site is illustrated in the sitemap in Figure 4. Lab II – Prototype Product Specification For STING 15 Figure 4: Login Sitemap of Prototype 2.3.3 User Interfaces STING’s primary user interface will be STING’s Web site. The Web site will consist of a Web site template, protein sequence submission form, thank you Web page, home Web page, contact Web page, about Web page, login Web page, admin account Web page, user Information Web page, Google Analytics Web page, user account Web page, set of results Web pages, and an expired results Web page. How these Web site elements connect to each other is illustrated below in Figure 5 and in Figure 4 located in section 2.3.2. Lab II – Prototype Product Specification For STING Figure 5: Main Sitemap of Prototype 16 Lab II – Prototype Product Specification For STING 17 Every webpage that is a part of STING’s Web site must meet 508 compliance. The protein sequence submission form is the principal method for which a user can submit a protein sequence to STING’s prediction software. The central method for which a user will receive their prediction results also will be provided through this Web site. Additionally, users and administrators of STING will be given the capability to log-in to STING through this Web site. A visual aid of what STING’s Web site could look like is illustrated in Figure 6. Figure 6: Web site Prototype Visual Aid Lab II – Prototype Product Specification For STING 18 2.3.4 Communications Protocols and Interfaces STING’s RESTful API will allow other application to communicate directly with STING over Hypertext Transfer Protocol (HTTP). The backend of the API will use Dr. Li’s preexisting SCORPION binary code in the simulation of SCORPION’s prediction software. PHP will be used for the submission (POST) and retrieval (GET) of protein sequences. The submissions (jobs) and server load will be monitored, and each job will have a unique ID. The API will also be used when emailing the predicted results to the user. The frontend of the API will utilize XML (Extensible Markup Language) and JSON (JavaScript Object Notation) as they are both very common languages.