SCORPION Product Description 1 SCORPION Product Description SCORPION – Blue Team Old Dominion University CS411 – Janet Brunelle Authors: Stanley Zheng Last Modified: Sept 20, 2014 Version: 1 SCORPION Product Description 2 TABLE OF CONTENTS 1 INTRODUCTION ...............................................................................................................................................................3 2 SCORPION PRODUCT DESCRIPTION.......................................................................................................................4 2.1 Key Product Features and Capabilities ............................................................................................................4 2.2 Major Components (Hardware/Software) .....................................................................................................5 3 IDENTIFICATION OF CASE STUDY...........................................................................................................................8 4 SCORPION PRODUCT PROTOTYPE DESCRIPTION............................................................................................8 4.1 Prototype Architecture (Hardware/Software) ............................................................................................9 4.2 Prototype Features and Capabilities .............................................................................................................. 10 4.3 Prototype Development Challenges ............................................................................................................... 12 LIST OF FIGURES Figure 1: Website Hardware Components .................................................................................................................... 6 Figure 2: Estimated Protein Structure Prediction Time Algorithm .................................................................... 7 Figure 3: Protein Sequence Validation Algorithm ..................................................................................................... 7 Figure 4:Sequence sanitization ....................................................................................................................................... 10 GLOSSARY ........................................................................................................................................................................ 14 REFERENCES ................................................................................................................................................................ 15 SCORPION Product Description 1 3 INTRODUCTION Bioinformatics has accelerated the scientific community's understanding about genetics and shortened the time between the lab and market for breakthrough medical advances. Fundamentally all living units are composed of DNA, which are composed of millions of base protein pairs. These proteins are comprised amino acids, which are sometimes thousands of amino acids long ("RCSB PDB - Histograms", n.d.). How these amino acids interact with each other determines how protein will fold and determine the structures they form. The intersection of DNA and computer algorithms are tools such as neural network models that statistically predict how an amino acid sequence folds into its subsequent protein. The data upon from which these tools are built improve daily as laboratories culture and sequence terabytes of DNA into protein databases. Old Dominion University Professor Dr. Yaohang Li, created SCORPION (SeCOndaRy structure PredictION) neural network in the pursuit of creating one of the fields most accurate protein prediction models. SCORPION is an actively used neural network, trained upon the solved amino acid sequences within the RCSB Protein Data Bank (Research Collaboratory for Structural Bioinformatics database). The current interface to interact with the system is the public facing website hosted by Dr. Li, utilizing web forms to submit large amino acid sequences. This is not dissimilar to many of Dr. Li’s competitors but improving the interfaces by which researchers may access SCORPION will reward and grow the community that utilizes his tools. Therefore the proposal includes giving SCORPION a modern website and a RESTful public web service API. The website will include features that aid researchers in submitting, cataloging and retrieving protein sequence submissions. The API will allow researchers to utilize their own tools and platform to programmatically interact with SCORPION over HTTP. SCORPION Product Description 2 4 SCORPION PRODUCT DESCRIPTION The system is composed of three sections, a website, a web API service and the program known as SCORPION. SCORPION is a single compiled program hosted on a server run by Old Dominion University. The website currently is a method to pass input into scorpion through a web form and receive the results later when finished by email. SCORPION itself is a neural network composed of combination of algorithms and data designed to predict secondary structure protein folding. These predicted sequences are multiple factors faster than using traditional X ray crystallography to manually see how these sequences would fold. A neural network is a type of machine learning algorithm that uses a system of nodes that are designed to predict the classification of a collection of data. Neural networks are trained with training sets, a series of previously classified and processed amino acid sequences where the structures are known. This allows scientists to test the accuracy of the neural network by measuring the trained model against well-documented sets of proteins. SCORPION is periodically retrained to take into account larger sets of known protein structures that have been added to the international protein knowledge base. Replacing the SCORPION website and adding an API will not affect any parts of the training system. The system is extremely modular so if the stakeholder has the ability to implement or decouple either of the solutions proposed. 2.1 KEY PRODUCT FEATURES AND CAPABILITIES The website will allow a better user experience and be easier to utilize after a feature such as user profiles. With this implemented, researchers are empowered to track their progress, while also giving the stakeholder a better idea about who is utilizing the system. This allows other core SCORPION Product Description 5 features such as job tracking and analytics to be created. However this service will stay wholly optional. The login feature will ask for volunteered information but can be utilized without a profile. Adding an Application Programmable Interface implemented over HTTP, the protocol of how Internet, opens the SCORPION platform to new users. The concept of Representation Stateful Transfer (REST), implies utilizing common verbs browsers utilize to facilitate data transactions. For example, the browser when a user who visits google.com in the background browser submits a GET request to google.com. The RESTful API in the context of scorpion will allow a researcher to write a script in their language of choosing to automatically upload a spreadsheet of sequences from a program on their computer. 2.2 MAJOR COMPONENTS (HARDWARE/SOFTWARE) As iterated previously, the system is composed of three sections, a website, a web API service and the program known as SCORPION. After the neural network is trained and the software packaged, it can be executed on any Linux system with the right requirements. To run efficiently, the SCORPION software requires a super computer with multiple high-powered graphical processing units (GPU) and a Linux operating system run time. Using multiple GPUs, the program will be able take advantage of threading. Threading will allow multiple processes to execute simultaneously and process protein sequence jobs at once. The other services, the website and the API, can live on a separate service, networked with SCORPION. The website portion of SCORPION will be hosted using a web server which will also host the user login system and analytics platform. The API will also be hosted on the SCORPION Product Description 6 same server on the same system. One caveat is that SCORPION requires PSI-BLAST, a third party service that formats submitted protein sequence into a Position Specific Scoring Matrix (PSSM) before it is able to be submitted to the Neural Network. All hardware components are already in use with the current implementation SCORPION and additional server resources have been provisioned. Figure 1: Website Hardware Components Any computer with a web browser with Internet connection will allow the user access the SCORPION. The website layout will comply with US 508 standards as per agreement with NSF grant stipulations. Google Analytics, a third-party service will record user statistics such as webpage hits and visitor’s IP address. Another third party service that will be required is Open ID. It reduces the requirement of software to implement and host user credentials but allows SCORPION to allow lookup by profiles. SCORPION Product Description 7 The API will be accessible by any language or software that support HTTP. The software will be written in PHP, and support a RESTful interface. The API will run adjacent to the website as they both support the same run time. For both services, the API and website will support two algorithms, both of which pertain to the protein sequence submission process. The data sanitization algorithm (Figure 2) is an algorithm that validates the input of a protein sequence submission. It assists in identifying and normalizing sequences for PSI BLAST before submission including removing special, invalid and whitespace characters. An additional sequence time prediction algorithm (Figure 3) will be performed on each job to give an estimated time of completion. It is known from experimentation that each character will add approximately 2.13 seconds. Estimated prediction time will be assigned to each submitted sequence. Figure 2: Protein Sequence Validation Algorithm Figure 3: Estimated Protein Structure Prediction Time Algorithm SCORPION Product Description 3 8 IDENTIFICATION OF CASE STUDY Dr. Yaohang Li and his team are constantly improving the algorithms and data available for the neural networks to produce even better protein prediction results. The SCORPION neural network was a project of his PhD student at Old Dominion University, Ashraf S. Yaseen, for almost two years. Predecessors to SCORPION have utilized the same system architecture to interface with the neural networks. Therefore, future bioinformatics neural network prediction systems can be utilized as a way to interface with future models. The API and Website improve the way the general public can access bioinformatics tools created by Old Dominion University, which will encourage more usage by outside parties. An API provides an open platform for other researchers and companies to utilize services to integrate tools into their preexisting systems. The new SCORPION system will be usable by small labs or pharmaceutical companies who are looking to verify their own prediction results. This makes the tool flexible and reduces the friction with accommodating new software. In the future public APIs will likely power a majority of traffic on the Internet (Jacobson). 4 SCORPION PRODUCT PROTOTYPE DESCRIPTION The prototype of the SCORPION web services will demonstrate a better user experience for users of the SCORPION platform. It will enable researchers to submit sequences and retrieve predicted results programmatically. Reducing the time and complexity of integrating with SCORPION makes it more valuable to researchers to begin and continue utilizing SCORPION. SCORPION Product Description 9 4.1 PROTOTYPE ARCHITECTURE (HARDWARE/SOFTWARE) The additional services for SCORPION will be developed in PHP. Parts of the current system are written in PHP, which allows for reuse, and extensibility to the existing codebase. PHP is a supported runtime that aligns with the other software available to the stakeholders. To fulfill the requirement of storing user ID sessions for the website, a persistent data source is required. SQLite was chosen for a lightweight database with the benefits of MySQL syntax without the bloat. It does not require setting up or configuring a database and is available as a single portable file. The current website will be updated with a more mainstreamed website but still utilizing HTML/JavaScript. The website will have client side JavaScript to validate protein sequence submissions before they are submitted to the server. The website will allow users who authenticate with the service to discover and retrieve submissions in one interface. The system currently heavily relies on email as the sole method of retrieving and cataloging predicted submissions. To tie together the services, the frontend website will interact with SCORPION through the API. This demonstrates the concept that the API is agnostic to platform and able to facilitate all operations possible available through the web interface. The API will open up Dr. Li platform to more systems to use SCORPION but also improve reliability by stabilizing the way sequences are submitted. Tracking jobs in the system ensure all submitted sequences are eventually answered. Currently SCORPION jobs remain unfulfilled if the server is overloaded and resources sparse. SCORPION Product Description 10 4.2 PROTOTYPE FEATURES AND CAPABILITIES The prototype will demonstrate a two-part solution to improve SCORPION. The redesigned website powered by the API will enable users interactions with SCORPION. The user facing side of SCORPION would be a website application, to replace the current design. The website will demonstrate 508 web accessibility standards to comply with federal funding requirements held by the National Science Foundation. The redesign will incorporate a similar user experience to the original website, and will offer a mirror to the original site if the user prefers. A feature of the redesign would be client side validation and data sanitization tools. Ideally the validation aims to ensure input sequenced is normalized to FASTA specifications. The prototype would be able to handle on input rule based (Figure 4) highlighting that will aid the user in correcting sequences. A common way researchers input sequences is utilizing copy and paste into the text area field. Optional sanitization buttons would available to target large copy/pasted sequences that could have multiple inconsistencies and formatting issues. Rules Figure 4: Valid sequence rules 1. No 1. special character No special character 2. No 2. numerics No numerics 3. All alphabetical characters excluding (BJOUX) 3. All alphabetical characters excluding (BJOUX) 4. Alphabet case insensitive 4. Alphabet case insensitive 5. Minimum sequence length 5. Minimum sequence length 40 characters SCORPION Product Description 11 The prototype will demonstrate logged in user capabilities utilizing OpenID. Users who opt in to login will be able to manage their data on the site and be informed about changes on the site. This user login system will offer capabilities for the system stakeholder to measure usage on the site. The prototype will utilize google analytics embedded in pages. Google Analytics support GeoIP and can determine from where the visitor origin. In addition they can provide tracking including duration of visit and user utilization of the site. Google Analytics provides a solution for stakeholder to review activity and all this associated information within the application. Finally, the user prototype will offer an API following RESTful principles. Programmatically, the user can submit requests and perform all actions available on the website. Users submitting sequences with sufficient parameters will either be returned a token URL to find their results later. Since this is a multiple part solution there are system and subsystem risks. On a system level the largest risks are failure to adoption and failure of implementation. To get the most benefit out of the system, Dr. Li would adopt both parts of the solution. Throughout development a tenet of the project is to maintain backwards compatibility at all times. The client facing systems will maintain a similar user interface and the ability to use the original design. The solution was designed on tried solutions making support and documentation plentiful. Modularity allows piecewise integration of the solution and test coverage focused on regression testing ensures successful system transition. The server code that currently is in production is not documented and would gain from a refactor to reduce possible future bugs and downtime. SCORPION Product Description 12 4.3 PROTOTYPE DEVELOPMENT CHALLENGES (JASMINE JONES) Challenges present themselves when integrating new features with a preexisting platform. SCORPION is a well-used system that exists and new functionality should not break current use cases. This restricts solutions to be modular and backwards compatible with existing architecture. The stakeholder Dr. Li wishes to ensure maintained uptime of SCORPION for all users that are currently using SCORPION. Integrating with the current software to fulfill new features is treacherous when the original developer is not readily available to answer questions. Addressing the challenges recognizes that the solution is two phases, implementing a public API and replacing the current web interface. With the SCORPION website, the challenge lies in privacy concerns and user acceptance. The API is a new feature to streamline interfacing with SCORPION. The desired data being present in the current system validates that building a public API is feasible. The API would provide insight into jobs that are processed by SCORPION by making sure submitted sequences are not lost. Speaking to the system uptime, Dr. Li’s system currently has no recourse if the service does become unavailable. Improving the software will increase stability and reliability during high usage periods. The API will be able reinforce this by creating a standard pipeline to access SCORPION. - The stakeholder is worried about collecting any personal data on the system. The suggestion to implement Open ID offers the advantage of profiles without hosting any private information. No passwords will be stored on Old Dominion University servers and as an extra safeguard OpenSSL will be implemented to encrypt user information. Overall this system is optin and users are not required to log in to submit sequences to SCORPION. SCORPION Product Description 13 User satisfaction and acceptance is addressed by adoption of the new interface. The design will be kept to a similar layout to its predecessor with an option to revert. To assess acceptance, analytics will be added to all pages for stakeholders to have better understanding how the site is being used. GLOSSARY Amino Acids/Residues: The building blocks of proteins API: Application Programmable Interface (abstract way for services to communicate) Cross-validation Training: The process of dividing training data into k mutually exclusive subsets (folds), of roughly equal size where some subsets are used for training, validating, and testing. The process is repeated k times. Data cleansing: The process of removing non-representative instances from the data set. Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data from the RCSB ETL: Extract, Transform and Load. Referring to the manipulation of Data FASTA: Format widely adopted in bioinformatic to make it easier to manipulate and parse sequences GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers to match IP origin GUI: Graphical User Interface, the visual client side facing software NSF: National Science Foundation PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the PSSM PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of the original protein sequence SCORPION Product Description 14 RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The database holds all known and recognized protein sequences. REST:A REST API is a set of operations that can be invoked by means of any the four verbs, using the actual URI as parameters for your operations. Four verbs including (GET,POST,PUT,DELETE) SCORPION: SeCOndaRy structure PredictION STING: Streamlined Training In Neural-network GUI Training set: Set of instances from the problem domain used to train the algorithm 508 Compliance: Adhering to guidelines established to make website content equally accessible to people with disabilities REFERENCES Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014, from http://www.rcsb.org/pdb/home/home.do Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11, 2014, from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx periment.xlsx Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research Funding National Cancer Institute. Retrieved May 8, 2014, from http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014, from http://www.foresight.org/Nanomedicine/Ch03_1.html Jacobson, Daniel. "1." APIs: A Strategy Guide. Sebastopol, CA: O'Reilly, 2012. N. Print. Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC ount&name=Residue%20Count Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March 15, 2014, from http://www.hhs.gov/web/508/index.html Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014, from http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction SCORPION Product Description Accuracy. 15