Running head: Lab 1 – STING Product Description Lab 1 – STING Product Description Team Blue Jasmine Jones CS411W Professors Janet Brunelle October 16, 2014 Version: 2.0 1 Lab 1 – SCORPION Product Description 2 TABLE OF CONTENTS 1 INTRODUCTION .............................................................................................................................................................. 3 2 STING PRODUCT DESCRIPTION ............................................................................................................................... 5 2.1 Key Product Features and Capabilities ........................................................................................................... 6 2.2 Major Components (Hardware/Software) .................................................................................................... 7 3 IDENTIFICATION OF CASE STUDY ........................................................................................................................ 10 4 STING PRODUCT PROTOTYPE DESCRIPTION.................................................................................................. 11 4.1 Prototype Architecture ........................................................................................................................................ 12 4.2 Prototype Features and Capabilities .............................................................................................................. 13 4.3 Prototype Development Challenges ............................................................................................................... 14 LIST OF FIGURES Figure 1: SCORPION Hardware Components ............................................................................................................... 8 Figure 2: Protein Sequence Sanitation Algorithm ...................................................................................................... 9 Figure 3: Estimated Prediction Time Algorithm ....................................................................................................... 10 Figure 4: Prototype Hardware and Software Components .................................................................................. 13 Figure 5: Sitemap of Prototype Login ............................................................................................................................ 14 LIST OF TABLESCases References .............................................................................................................................................................................. 18 GLOSSARY ........................................................................................................................................................................ 16 REFERENCES ................................................................................................................................................................ 17 1 INTRODUCTION Lab 1 – SCORPION Product Description 3 Cancer, Alzheimer's, Parkinson's, ALS, and type 2 diabetes are just five of the more than three hundred diseases which result from improper protein structures in the body (Northwestern University). In 2010, cancer alone was the second-leading cause of death in the United States, taking the lives of 576,691 people (Heron). Proteins, which are in every cell of the human body, play a vital role in carrying out almost all of the human’s bodily functions (What is protein). Some of these functions include breaking down food for muscle support, sending signals through the brain to control the body, and transporting nutrients through the blood (What is protein). When a protein has an improper structure, it can lead to cellular dysfunction and cellular death (Northwestern University). Proteins are formed from a string of amino acids. These amino acids have folded together to create a protein structure. To understand the amino acid folding process more clearly, the following simplified analogy can be used. Imagine a string of yarn. Imagine holding the string of yarn by one end and letting it dangle down in a straight line. Envision that the yarn has many small magnets which run down its length. Slowly lower the string onto a table, allowing it to coil into a shape. The magnets will cling together and a structure will be formed. This structure is the protein, and the magnets are the amino acids. The string of amino acids is referred to as the protein’s “primary structure,” and the way in which the primary structure folds is called the protein’s “secondary structure.” An identical amino acid sequence will always fold into the same secondary structure (What is protein). A protein’s secondary structure dictates what role it will play in the body (What is protein). The structure can be thought of as a “key” which can only carry out a certain function if it is the right shape to fit in the “lock” (What is protein). Predicting a protein’s secondary structure given its primary structure is an important goal for curing protein-related diseases; the ability to predict a protein’s secondary structure is important because it will allow scientists to create new proteins with which they can then use to combat disease-related proteins (What is protein). Currently, the Lab 1 – SCORPION Product Description 4 most accurate protein secondary structure prediction service available is SCORPION at Old Dominion University (ODU). While SCORPION encompasses the most accurate protein prediction software available, SCORPION’s website (SCORPION’s graphical user interface) lacks the professional design and user features to match its high quality of service. Aesthetic appeal is critical; a study by Kent State University found that in just 3.42 seconds, participants had judged a website’s credibility based on aesthetic appeal (Robins). Another study by Northumbria University found that 94 percent of participants mistrusted and rejected health-related websites based on their design factors (Sillence). SCORPION has the most accurate secondary protein structure prediction service available, but it may be overlooked based on its poor aesthetic appeal and interface functionality; this is a problem. The solution to this problem is to provide SCORPION with a professional Web design, competitive Web-functionality, and additional Web tools, including an application-programming interface (API). An API is a set of communication protocols which can take place between two applications. Adding an API to SCORPION would enable users to bypass SCORPION’S graphical user interface and instead access SCORPION’s prediction software directly. Improving SCORPION’s aesthetic appeal, functionality, and accessibility will allow SCORPION to contribute its full potential to fight against protein-related diseases. This new, enhanced product will be named STING. Running head: Lab 1 – STING Product Description 2 5 STING PRODUCT DESCRIPTION STING is a Web service which will build upon the preexisting Web service SCORIPION. SCORPION is a Web service which provides protein secondary structure prediction. To utilize SCORPION, a user visits the SCORPION website and submits a Web form with their email address and a string of alphabetical characters. Each alphabetical character represents an amino acid; amino acids are the building blocks of a protein. The string of alphabetical characters, representing an amino acid sequence, is the input which is used by SCORPION to predict the protein’s secondary structure. Once SCORPION has predicted the structure, a Web page which displays the results is created and stored on SCORPION’s Web server. An email containing the URL of the results Web page is then sent to the user. STING will enhance SCORPION’s preexisting service with Web tools, a professional Web design, and competitive Web-features in order to improve SCORPION’s aesthetic appeal, functionality, and accessibility. Three Web tools will be utilized by STING: A representational state transfer applicationprogramming interface (RESTful API), website visitor statistics, and an administrator login in which an administrator can access visitor statistics. A RESTful API is a set of architectural constraints that will make SCORPION’s submission processing more efficient and make SCORPION’s prediction software accessible to other applications. In practical terms, a RESTful API will enable a third party, such as a different website or a mobile application, to bypass SCORPION’s website and instead submit a protein sequence directly to SCORPION’s prediction software. STING will showcase a professional Web design which adheres to Section 508 standards. Section 508 standards is a set of design guidelines established by the U.S. Department of Health & Human Services to make website content equally accessible to people with disabilities (Section 508). STING will include three competitive Web-features: an estimated wait time for prediction results, a login in which users can access previous prediction results, and automatic sequence sanitation. Automatic sequence sanitation simply means that when a user enters a sequence of Lab 1 – SCORPION Product Description 6 amino acids (alphabetic letters) and if the sequence contains invalid characters, they will have the option to automatically remove (or “sanitize”) the invalid characters, rather than removing them manually. 2.1 KEY PRODUCT FEATURES AND CAPABILITIES One of the most important new tools is STING’s RESTful API. STING’s primary goal is to improve accessibility to SCORPION’s prediction software. A RESTful API accomplishes this by allowing other applications to utilize STING through a common interface. Additionally, a RESTful API provides protection against synchronized (syn) flooding attacks, which would occur if a user attempted to submit a harmful number of amino acid submissions. SCORPION’s Web design is also important. When a user is pleased with a website’s aesthetic appeal, they are more likely to explore the website’s content and features. Additionally, adhering to 508 standards will not only allow users with disabilities to access the SCORPION website, but STING will also satisfy the U.S. Law for all federally funded departments and agencies. Implementing an optional user login will give users more convenient access to their previous prediction results. As it stands, a user of SCORPION must access previous submissions by searching through their email inbox. STING will ensure that previous submissions will not get lost in a user’s email inbox. It will also allow SCORPION’s administrators to gain more information about their users, which is useful. Automatic sequence sanitation will make using SCORPION much more convenient for users. As it stands, if a SCOPRION user enters an amino acid sequence which contains invalid characters, such as non-alphabetic characters or whitespace, the submission is rejected. The user must manually remove invalid characters to have their submission accepted. Instead of hand-typing a protein sequence as input, users will often copy-and-paste large sequences which have been preformatted to contain whitespace. Manually removing whitespace and invalid characters is time consuming and makes the sequence submission process inconvenient. STING will provide the user Lab 1 – SCORPION Product Description 7 with the option to automatically remove invalid characters, making the submission process faster and more efficient for the user. An estimated wait time for prediction results will provide users with an idea of when they can expect to receive their prediction results. As it stands, a SCORPION user does not know if they will receive their prediction results in a matter of hours, days, or weeks. Providing an estimated prediction time enables users to have a concrete idea of when they will receive results. An estimated prediction time may be especially helpful to a user who is working to meet a deadline. Tracking visitor statistics such as page views and geographical demographics will give SCORPION’s administrators feedback and insight as to how many users are utilizing SCORPION and will provide information about those users. STING’s primary goal is to improve accessibility to SCORPION’s prediction software. The best method for measuring whether or not STING has succeeded in making SCORPION more accessible is to measure and record the number of people using SCORPION’s service. Tracking visitor statistics will provide administrators with consistent traffic feedback, allowing them to monitor traffic increases and declines. Recording specific page hits will enable administrators to understand which content users find most useful and which content may be unnecessary. 2.2 MAJOR COMPONENTS (HARDWARE/SOFTWARE) Whether STING is accessed through its website or through its API, four hardware components are necessary: A personal computer (PC), a PHP Web server, a service called “PSIBLAST,” and SCORPION’s prediction software, called a “Neural Network.” The relationship between these components is illustrated in Figure 1. A PC will be necessary for a user to access STING’s user interface (website or API). The PHP Web server will be used to host STING’s website and support STING’s API. PSI-BLAST is a third party service which functions to reformat the submitted protein sequence to prepare it for SCORPION’s prediction software. SCORPION’s Neural Network is the software which predicts protein secondary structures. Lab 1 – SCORPION Product Description 8 Figure 1: SCORPION Hardware Components There will also be five software/virtual components necessary for STING: A Web browser with Internet connection, a RESTful API, a Web page template, a service called “OpenID,” a database, and a service called “Google Analytics.” A Web browser with Internet connection will be necessary for a user to access STING’s user interface. A RESTful API will be used to allow other applications to use SCORPION’s prediction software. Specifically, the RESTful API will be incorporated into the queuing of protein submissions (jobs) and incorporated into protein sequence sanitation. This will enable other applications to view (GET) the list of current jobs and to submit (POST) a new job. A Web page template will be used as SCORPION’s primary graphical user interface (GUI). OpenID is a third party service which will be used to implement the user login. OpenID will allow a user to log in to STING with a preexisting third-party account such as Google or Facebook. A database will be required to store logged-in user information and protein sequence prediction Lab 1 – SCORPION Product Description 9 results. Specifically, the database will require a table to store user login OpenIDs, a table to store sequence submissions, a table to link OpenIDs to the sequence submission results, and a table to store optionally provided user information. Google Analytics is a third party service which will be used to record user statistics such as Web page hits and visitor’s IP address. STING will use two algorithms: A protein sequence sanitation algorithm and an estimated prediction time algorithm. Both algorithms pertain to the protein sequence submission form. The protein sequence sanitation algorithm, shown in Figure 2, is used to validate the input of a protein sequence submission. The user is not required to provide an email address if they are logged in because they can choose to view their protein structure prediction results in their user history area. Additionally, a logged-in user can provide their email address through their user account area. Figure 2: Protein Sequence Sanitation Algorithm The estimated prediction time algorithm, shown in Figure 3, will be used to calculate the estimated duration of time that the user will wait to receive their prediction results. The algorithm is simple and is based on a timed experiment which concluded that each amino acid character will add approximately 2.13 seconds to the estimated prediction time (CS410 Blue Team). The Lab 1 – SCORPION Product Description 10 prediction time will be displayed on both the submission form page before the user has submitted their sequence, as well as the thank you page after the user has submitted their sequence. Figure 3: Estimated Prediction Time Algorithm 3 IDENTIFICATION OF CASE STUDY SCORPION was developed by Ashraf Yaseen, a PhD student at ODU, and his PhD advisor, Dr. Yaohang Li. During SCORPION’s development, the focus was the accuracy of SCORPION’s protein secondary structure prediction software, rather than SCORPION’s website functionality or design. Dr. Li is overseeing the development of SCORPION’s new features, making him one of SCORPION’s target customers; he will be the deciding factor of whether or not the new features from the prototype will be implemented into the existing product. SCORPION will also be catering to its primary users: computational biologists, pharmaceutical companies, research students, and geneticists. These are the individuals who rely on protein secondary structure prediction to progress in their work on a regular basis. Dr. Li has emphasized that in just one day, SCORPION can enable the completion of work that might take years in a lab. Lab 1 – SCORPION Product Description 4 11 STING PRODUCT PROTOTYPE DESCRIPTION SCORPION’s prototype will be nearly identical to the proposed end-product. It will maintain nearly all of the functionality of the end-product, including protein sequence sanitation and estimated prediction time. The essential distinction between the prototype and the end-product is that the prototype will use a mock version of Dr. Li’s protein prediction software. Table 1 compares the features of the real-world-product to STING’s prototype. Any feature which is not listed means that it will be identical between the real-world-product and prototype. Features/Components Protein Secondary Structure Prediction Results Web Server User login User account Estimated Prediction Time Real-World-Product Results will be accurate predictions made by Dr. Li’s Neural Network The Web server will be hosted by ODU’s SCORPION Web server Users will have the ability to login through a third-party account Users will have the ability to submit additional personal information about themselves to the user database The estimated prediction time will be based upon multiple timed experiments completed over the course of several weeks Prototype Results will be randomly generated sequences intended to simulate prediction results The Web server will be hosted by ODU’s CS 411 Web server Users will have the ability to login through a Google account Users will have the ability to view their username and email address which have been retrieved from the user database The estimated prediction time will be based upon one timed experiment completed over the course of one week Table 1: Comparison Betweem Real-World-Product and Prototype The goals and objectives of the prototype are to produce and demonstrate a fully functional model of SCORPION’s new features. The website template will display the new layout and will be 508 compliant. The API will allow other services to take advantage of SCORPION’s features while enabling them to create a custom platform over HTTP. Lab 1 – SCORPION Product Description 12 4.1 PROTOTYPE ARCHITECTURE Similar to the end-product, the prototype will require a personal computer (PC) and PHP Web server. The prototype will be accessible through either the API or website. The backend of the API will use Dr. Li’s preexisting SCORPION binary code in the simulation of SCORPION’s prediction software. PHP will be used for the submission (POST) and retrieval (GET) of protein sequences. The submissions (jobs) and server load will be monitored, and each job will have a unique ID, monitoring the server load will help to prevent flooding attacks. The API will also be used when emailing the predicted results to the user. The frontend of the API will utilize XML (Extensible Markup Language) and JSON (JavaScript Object Notation) as they are both very common languages. The website will also use PHP for its server-side code, meaning PHP will be used when a user submits a protein sequence through the Web form. PHP is already used in the preexisting implementation of SCORPION, so the transition will be easy. Additionally, PHP is a module based language which makes it easier to isolate and modify specific functions. The database portion of the website will use SQL, specifically Sqlite3, which is designed for moderate-traffic websites that require large data storage. Sqlite3 is also compatible with PHP, making it an optimal choice. The website template will be designed using XHTML and CSS3, both of which are current standards in Web design. 508 compliance will be ensured by using the 28 guidelines from the checklist at webAIM.org, all of which are excerpted from Section 508 of the Rehabilitation Act, §1194.22 (Section 508 Checklist). The Web form sequence sanitation will be completed using JavaScript. JavaScript is a front-end language and will keep the burden off of the Web server. OpenID will be used for the user login. OpenID is a commonly used tool, allowing users to login with a third-party account. The way in which the prototype hardware and software components will interact with each other is illustrated in Figure 4. Lab 1 – SCORPION Product Description 13 Figure 4: Prototype Hardware and Software Components 4.2 PROTOTYPE FEATURES AND CAPABILITIES As the prototype is nearly identical to the end-product, it will demonstrate all of STING’s new functionality. The API will demonstrate how requests can be submitted over HTTP, how the status of a single job or multiple jobs can be accessed, and that its documentation is public. The website will display the professional, 508 compliant, template design. It will show that when a user submits a protein sequence, they will have the convenient option of choosing to automatically sanitize their input. A login, as illustrated by the sitemap in Figure 5, will give users a way to connect with SCORPION and to easily retrieve their previous prediction results. Additionally, the administrator will be able to log-in and view user statistics from Google Analytics as well as view user information from the database about users who have logged-in. Lab 1 – SCORPION Product Description 14 Figure 5: Sitemap of Prototype Login To mitigate risks that may be faced, SCORPION’s new features have been designed to build upon the preexisting SCORPION product. Everything that has been added is fully compatible with the preexisting SCORPION. Additionally, for the user login and user statistics, highly established third party services will be used. 4.3 PROTOTYPE DEVELOPMENT CHALLENGES The development of the prototype website is expected to encounter four challenges: Ensuring 508 compliance, addressing the new design, accurately estimating prediction time, and securing logged-in user information. It is important to ensure that the website meets 508 standards. If the standards are not met, the website may be inaccessible to users with disabilities. Additionally, because all federally funded websites are required to be 508 compliant, the project could lose federal funding if it fails to Lab 1 – SCORPION Product Description 15 comply. To ensure 508 compliance, the website will be extensively tested using 508 compliance verification tools, such as those found at W3.org/WAI. Another challenge will be addressing how users will react to the new design and features. Users will be unfamiliar with the changes and may resist them. To counter this, the page layout will remain similar to the previous design and the changes will be announced on the website’s homepage. Additionally, instructions for using SCORPION will be provided. Providing an accurate estimate of the protein structure prediction time is another consideration. If the estimated prediction time is inaccurate, the user may become frustrated, especially if they have a deadline to meet. To accommodate slight variations in the duration of the prediction time, a time window rather than a finite time will be given. Security of logged in user information is also very important. To keep user information secure, OpenID will be used for the user login, so no passwords will be stored on the SCORPION system. Additionally, SSL will be implemented to protect any additional information the user chooses to provide. Further, if a user does not have a third party account, they are not required to log-in to benefit from SCORPION. The API will have its own challenges, which will be: securing the API, ensuring proper interfacing, and ensuring administrator access to user statistics. The API will be open to the public, making it vulnerable to attacks. An API has four commands: GET, POST, DELETE, and PUT. Even though SCORPION’s API will only utilize the “GET” and “POST” API commands, the two other commands, “DELETE” and “PUT” will still be defined as functions simply to make certain that an attacker can’t alter these commands. It may be a challenge to incorporate the API with the existing SCORPION resources. The API will reference SCORPION’s resources by the address of their file location. If one of SCORPION’s files is moved, this could pose a problem. To prevent this problem, the API will be structured such that if a file address changes, the system will have instructions of what to do, and will not crash. Lab 1 – SCORPION Product Description 16 GLOSSARY 508 Compliance: Adhering to guidelines established to make website content equally accessible to people with disabilities Amino Acids/Residues: The building blocks of proteins API: Application Programmable Interface (abstract way for services to communicate) Cross-validation Training: The process of dividing training data into k mutually exclusive subsets (folds), of roughly equal size where some subsets are used for training, validating, and testing. The process is repeated k times. Data cleansing: The process of removing non-representative instances from the data set. Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data from the RCSB ETL: Extract, Transform and Load. Referring to the manipulation of Data FASTA: Format widely adopted in bioinformatic to make it easier to manipulate and parse sequences Fold: The fold of an amino acid sequence forms the protein’s secondary structure GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers to match IP origin GUI: Graphical User Interface JSON: JavaScript Object Notation NSF: National Science Foundation PC: Personal Computer PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the PSSM PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of the original protein sequence RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The database holds all known and recognized protein sequences. REST: A REST API is a set of operations that can be invoked by means of any the four verbs, using the actual URI as parameters for your operations. Four verbs including (GET,POST,PUT,DELETE) SCORPION: SeCOndaRy structure PredictION Training set: Set of instances from the problem domain used to train the algorithm VM: Virtual machine XML: Extensible Markup Language Lab 1 – SCORPION Product Description 17 REFERENCES Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014, from http://www.rcsb.org/pdb/home/home.do Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11, 2014, from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx periment.xlsx Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research Funding National Cancer Institute. Retrieved May 8, 2014, from http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014, from http://www.foresight.org/Nanomedicine/Ch03_1.html Heron, M. (2014, July 14). Leading Causes of Death. Retrieved September 12, 2014, from http://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf Northwestern University. (2012, January 8). New hope for diseases of protein folding such as Alzheimer’s, Parkinson’s diseases, ALS, cancer and diabetes. ScienceDaily. Retrieved September 12, 2014 from www.sciencedaily.com/releases/2012/01/120106135946.htm RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC ount&name=Residue%20Count Robins, D., & Holmes, J. (2008). Aesthetics and credibility in website design. Information Processing And Management, 44(Evaluation of Interactive Information Retrieval Systems), 386-399. doi:10.1016/j.ipm.2007.02.003 Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March 15, 2014, from http://www.hhs.gov/web/508/index.html Section 508 Checklist. (n.d.). Retrieved September 17, 2014, from http://webaim.org/standards/508/checklist Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014, from http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act Sillence, E., Briggs, P., Harris, P., & Fishwick, L. (2007). How do patients evaluate and make use of Lab 1 – SCORPION Product Description online health information?. Social Science & Medicine, 641853-1862. doi:10.1016/j.socscimed.2007.01.012 What is protein folding? (n.d.). Retrieved October 16, 2014, from http://fold.it/portal/info/science Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction Accuracy. 18