CS 411W LAB I - PRODUCT DESCRIPTION DOCUMENT 1 INTRODUCTION 1. Societal Problem a. Biology Protein Prediction b. Website 2. Solutions a. API i. Extends Scorpion ii. Adds to ODU Research Standing b. Website i. More Professional ii. Added functionality iii. User statistics 2 SCORPION PRODUCT DESCRIPTION 1. RESTful API a. application agnostic, communicate over http b. queue and schedule jobs 2. Website a. User friendly b. Added features(Login) c. Tracking and Statistics 2.1 Key Product Features and Capabilities 1. API a. API allows more people to use Scorpion b. Security and Protection c. Common interface 2. Website Functionality a. Improved Design b. Competitive Features i. History tracking ii. Automated input sanitization 2.2 Major Components (Hardware/Software) 1. Hardware-API a. Scorpion Neural Network b. PHP Server 2. 1. 2. 3. 3 c. PSI-BLAST d. PC Software/Virtual- API a. RCSB Protein Database-Provided b. Dunbrack Labs Protein Database - Provided c. RESTful API i. GET/POST ii. Input Sanitization iii. Queueing jobs Hardware - Website a. PC b. PHP Web Server c. PSI-BLAST (Third Party/Provided) d. Scorpion Neural Network Software/Virtual - Website a. Web browser + Internet connection b. Webpage design template c. Google Analytics (Third Party/Provided) d. Database i. Protein sequence & user database 1. Open ID table 2. Sequence submissions table 3. User data table 4. User sequence submissions link table Algorithms - Website a. Protein sequence submission i. Input sanitation ii. Estimated prediction time IDENTIFICATION OF CASE STUDY 1. Scorpion a. History b. Highest Accuracy 2. Target Audience a. Professor Li b. Biologist around the world (overall scientific knowledge) c. More researchers 4 SCORPION PRODUCT PROTOTYPE DESCRIPTION 1. Interface a. protein chain input - no reduced functionality i. data sanitization ii. character counting to produce estimated time of prediction 2. PROTOTYPE FUNCTIONAL GOALS and OBJECTIVES 1. Improve the interface to scorpion a. Website i. Improved website 508 compliance b. API i. platform agnostic, using an API, users can use their language and platform of choice over http. Allows a hook to SCORPION 4.1 Prototype Architecture (Hardware/Software) <How will the prototype be structured to demonstrate key features of the 410 product. Prototype MFCD provided and described.> 1. API a. API -Backend i. Dr.Li’s c3Scorpion Binary ii. Handling POST Requests 1. Security (Attacks) 2. General POST Handling -PHP 3. Returning Job ID and Information -PHP iii. Sending Data through the NN iv. Getting Job Information 1. One Job 2. All Current Jobs v. Emailing Result b. API - Frontend i. GET format of return 1. XML/JSON 2. User Friendly 2. Website a. Server-side code in PHP i. Already used in the website before solution ii. Module based for ease of addition/removal of functions b. SQL i. Standard in database profession ii. Why SQL instead of NOSQL iii. Sqlite3 chosen for ease of use iv. Separate database from training part of project c. HTML/Javascript i. Benefits - Uses client processor instead of server ii. JQuery - Standard in web development iii. Openid.net for login - A known product used by other websites 4.2 Prototype Features and Capabilities Demonstrates 1. API a. b. c. d. Submit and schedule requests over http Get Status of Jobs over on queue Get status of a single job Public Documentation for the public to use 2. Website a. Web layout in compliance with 508 web standards b. Offer client tools for users to sanitize input c. Opt in User registration system d. User tracking solution to track registered and unregistered users usage Mitigate Risk 1. backwards compatible with existing C3 Scorpion system 2. Use open source and free tools solutions to solve user registration and tracking 4.3 Prototype Development Challenges 1. Website a. 508 compliance b. New, unfamiliar webpage design c. Accuracy of estimated prediction time d. User security 2. API a. Public facing API has to be secured b. Interfacing with existing scorpion infrastructure and codebase c. Ensuring that Admin can access statistics Glossary: Amino Acids/Residues: The building blocks of proteins Training set: Set of instances from the problem domain used to train the algorithm. Data cleansing: The process of removing non-representative instances from the data set. Cross-validation Training: The process of dividing training data into k mutually exclusive subsets (folds), of roughly equal size where some subsets are used for training, validating, and testing. The process is repeated k times. PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of the original protein sequence PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the PSSM References: 1. "Biological Macromolecular Resource." RCSB Protein Data Bank. N.p., n.d. Web. 20 Feb. 2014. <http://www.rcsb.org/pdb/home/home.do>. 2. CS410 Blue Team. "Scorpion Protein Prediction Timed Experiment." N.p., 11 Feb. 2014. Web. <www.cs.odu.edu/~410blue/CS410ScorpionProteinPredictionTimeEx periment.xlsx>. 3. Ashraf Yaseen and Yaohang Li. “Context-based Features Enhance Protein Secondary Structure Prediction Accuracy”. 4. "Section 508." United States Department of Health and Human Services. N.p., n.d. Web. 15 Mar. 2014. <http://www.hhs.gov/web/508/index.html>. 5. "Section 508 Of The Rehabilitation Act." Section 508 Home. N.p., n.d. Web. 15 Mar. 2014. <http://www.section508.gov/Section-508-Of-The-RehabilitationAct>.