Word - ODU Computer Science

advertisement
Lab 1 – SCORPION Product Description
SCORPION - 1
SCORPION Product Description
SCORPION – Blue Team
Old Dominion University
CS410 – Janet Brunelle
Authors: David Eason, Jasmine Jones, Jack Muratore, Alexander Rudasi, Kelly Shott, Stanley Zheng
Last Modified: May 7, 2014
Version: 1
Lab 1 – SCORPION Product Description
SCORPION - 2
TABLE OF CONTENTS
1
INTRODUCTION (Alexander Rudasi) ..................................................................................................................... 3
2
SCORPION PRODUCT DESCRIPTION (Kelly Shott)........................................................................................... 4
2.1 Key Product Features and Capabilities (Jack Muratore) ......................................................................... 5
2.2 Major Components (Hardware/Software) (STING: David Eason, Website: Jasmine Jones)..... 6
3
IDENTIFICATION OF CASE STUDY (Alexander Rudasi) ............................................................................... 11
4
SCORPION PRODUCT PROTOTYPE DESCRIPTION (Kelly Shott).............................................................. 11
4.1 Prototype Architecture (Hardware/Software) (STING: David Eason, Website: Jack Muratore) . 12
4.2 Prototype Features and Capabilities (Stanley Zheng) ............................................................................. 15
4.3 Prototype Development Challenges (Jasmine Jones) .............................................................................. 18
LIST OF FIGURES
Figure 1: Website Hardware Components .................................................................................................................... 9
Figure 2: Estimated Protein Structure Prediction Time Algorithm .................................................................. 10
Figure 3: Protein Sequence Validation Algorithm ................................................................................................... 10
Figure 4: Improved Process for Training .................................................................................................................... 16
GLOSSARY ........................................................................................................................................................................ 21
REFERENCES ................................................................................................................................................................ 22
Lab 1 – SCORPION Product Description
1
SCORPION - 3
INTRODUCTION (ALEXANDER RUDASI)
Cancer was the second leading cause of death in the United States in 2010 according to the
United States Centers for Disease Control and Prevention and will supersede heart disease in
following years (Murphy, 2013). Billions of dollars per year are spent on cancer research, with the
National Cancer Institute alone spending $4.9 billion per year ("Cancer Research Funding", 2013).
Eventually solving this large scale human problem will require a fuller understanding of how the
human body works. And because, excluding water, the majority of our bodies are composed of
varying proteins, understanding proteins is paramount to solving this problem (Freitas, 1998).
Proteins are comprised of a string of amino acids sometimes thousands of amino acids long ("RCSB
PDB - Histograms", n.d.) and how these amino acids interact with each other determines how the
protein will fold. How the protein folds determines what it does within an organism and how it
interacts with its surroundings. Trying to statistically predict how an amino acid sequence folds
into its subsequent protein has been a primary goal of bioinformatics researchers for many
decades.
One of the most accurate pursuits of this goal recently has been the SCORPION
(SeCOndaRy structure PredictION ) neural network at ODU (Old Dominion University).
SCORPION is a neural network which is trained using the currently solved amino acid
sequences within the RCSB Protein Data Bank (Research Collaboratory for Structural
Bioinformatics database) which is accessible to the public at a subsection of ODU's website.
Scorpion's neural network should be regularly retrained to maintain the highest possible accuracy.
This is currently done manually which makes it time consuming, tedious, and doesn't allow
frequent retraining.
Additionally, Scorpion's website lacks the professional design and user
features to match its high quality of service.
STING (Streamlined Training In Neural-network GUI) and the new SCORPION website are
designed to address both of these issues. STING will automate the retraining of the neural network,
and the new SCORPION website will add multiple features for more user friendliness.
Lab 1 – SCORPION Product Description
2
SCORPION - 4
SCORPION PRODUCT DESCRIPTION (KELLY SHOTT)
SCORPION is a software designed to predict the secondary structure of proteins. A neural
network is a type of machine learning algorithm that uses a system of nodes that are designed to
predict the classification of a collection of data. The data structure that makes up the neural
network is a graph comprised of a series of nodes that are connected to one another threw a system
of weighted decision logic. Sometimes the nodes of the network are called ‘neurons’ because the
design of a neural network’s decision-making ability is based on that of a biological brain. The
classification begins when the instance data is passed into the root node where it will be processed
by some function that determines to what node it will be passed. The passing and processing of
data will continue until an output node is activated which will signal the end of processing and
indicate the determined classification. SCORPION is a feedforward neural network, which means
that data is only passed in the direction of input to output.
Neural networks must be trained before they can be used to make accurate
predictions. The process of training involves inputting a series of data instances for which the
classification is already known, called a training set. During the training process when data is
passed to the next neuron node, it receives a weight that indicates prediction probabilities. After
the training set has been completely processed, the testing set is input into the neural network to
ascertain how successful the training was. The test set is another set of instances with SCORPION is
a software designed to predict the secondary structure of proteins. A neural network is a type of
machine learning algorithm that uses a system of nodes that are designed to predict the
classification of a collection of data. The data structure that makes up the neural network is a graph
comprised of a series of nodes that are connected to one another threw a system of weighted
decision logic. Sometimes the nodes of the network are called ‘neurons’ because the design of a
neural network’s decision-making ability is based on that of a biological brain. The classification
begins when the instance data is passed into the root node where it will be processed by some
Lab 1 – SCORPION Product Description
SCORPION - 5
function that determines to what node it will be passed. The passing and processing of data will
continue until an output node is activated which will signal the end of processing and indicate the
determined classification. SCORPION is a feedforward neural network, which means that data is
only passed in the direction of input to output.
Neural networks must be trained before they can be used to make accurate predictions.
The process of training involves inputting a series of data instances for which the classification is
already known, called a training set. During the training process when data is passed to the next
neuron node, it receives a weight that indicates prediction probabilities. After the training set
known classifications. A correctly identified test set assures designers that the training correctly
calculated node weights, allowing activations to proceed in the appropriate sequence and resulting
in correct classifications. SCORPION received training a year ago and needs to be retrained to take
into account the larger set of known protein structures that have been added to the international
knowledge base. The current process for training SCORPION requires the preparation of the test
set and the calculation of data values by hand. The process is lengthy and limits that number of
times SCORPION can be retrained and the size of the test set that can reasonably be used.
2.1 KEY PRODUCT FEATURES AND CAPABILITIES (JACK MURATORE)
The overall goal of the process is to improve on the feature set of the SCORPION website
and training portion which has been named STING. The SCORPION website will be given a new
design and with improved functionality that users expect from a current to date website. User
website access statistics will be added to monitor its usage. An optional login feature that will ask
for volunteered information and offer a past submission collection. A separate database that will
not interfere with Administrator login and tools will be given to help gather and send information
to registered users.
STING is a new standalone application that can be run on a windows computer to add easy
access to new training automation features. In the past the training process was done manually but
Lab 1 – SCORPION Product Description
SCORPION - 6
by using STING different types of training options will be able to run by only a click of the mouse.
Schedules and past training runs will be stored to be viewed via the application. STING will be a one
click training solution that will allow administrators to spend their time on other pressing matters.
2.2 MAJOR COMPONENTS (HARDWARE/SOFTWARE) (STING: DAVID EASON, WEBSITE: JASMINE JONES)
The training portion of SCORPION will use three major hardware components. The first
component is the user’s personal computer(PC). This PC can either be a desktop or laptop and must
run a Windows operating system. The next component is the network connection. The network
connection can be an intranet, internet, or both. All network connections need to be broadband or
higher to support large file transfers. The third hardware component is a super computer. The
super computer must consist of a Graphical Processing Unit (GPU) and a Linux operating system.
With a GPU, the programs will be able take advantage of threading. Threading will allow multiple
processes to execute simultaneously.
There are five major software components required for training. Software components will
include graphical user interface (GUI), interfacing, tracking, data manipulation, and training. Each of
the five components will consist of separate algorithms. The GUI component will have algorithms
that change display options in the window, give and receive information, show tracking
information, and initialize the interfacing component. The next component is the interfacing
component. Interfacing will involve the passing of user account credentials. These credentials will
allow the connection from the user’s PC to the supercomputer through the GUI. Inside the
interfacing component, all remote procedure calls will occur. The first section of remote procedure
calls consist of the data manipulation component. The data manipulation component will consist of
six algorithms. These algorithms will prepare training data for neural network processing. The
training component will take the manipulated data, start a Matlab neural network application, and
feed the application the manipulated data seven times.
Lab 1 – SCORPION Product Description
SCORPION - 7
There are five main frontend algorithms associated with the GUI. The first algorithm,
WindowStart, will setup the basic style and organization of buttons in each window instance. This
algorithm will encompass all other algorithms involved in the frontend and initial interfacing
portion of the program. Depending on which actions are taking place between the user and the
application, some functions and buttons will disappear to avoid user confusion. The second
algorithm, Tracking, reads and writes to an XML file during a training instance. This algorithm will
also allow the user to create a data table of this data and allow the creation of a comma delimited
value file from the XML. The third algorithm, dataSelect, allows the user to select a folder containing
all initial training data from the Dunbrack lab protein databank. The algorithm will show a pop-up
window for file selection, only allow folders with files containing the .fasta file extension, and parse
the embedded files. The fourth algorithm involved in the frontend will initialize the interfacing
session between the user PC and the supercomputer. This algorithm, startSSH, will allow the user to
input their username and password for connecting, create a secure shell(SSH) tunnel connection
between systems, and call the remote procedure that exists on the supercomputer. The fifth and
final frontend algorithm, dataCleanse, will call a Perl script on the user’s PC that will normalize the
.fasta files that were selected by the dataSelect algorithm. After the scripts have been run, the
normalized files will be inserted to a shared network drive, used by both the user’s PC and the
super computer.
The backend is where operations are performed invisible to the user. This section of the
software will be performed entirely on the supercomputer, all of which will be started by a single
procedure call from the user’s personal computer. There are five algorithms associated with the
training process in the backend. The first algorithm, TrainingStart, will encompass all other
algorithms involved in the backend training. The algorithms will control all other algorithms and
ensure that they execute in a specific sequential order. If there are any errors or exceptions during
training, this algorithm will log them and notify the user of a problem. The first algorithm within
Lab 1 – SCORPION Product Description
SCORPION - 8
this process, PSIBLAST, will iterate through all the files in the folder specified by the frontend. Here,
each file will be changed from having a single line to multiple lines. The files will have their file
extension changed to .PSSM and will now be matrix files. After this operation is complete without
error, the next algorithm, contextClues will start. This algorithm will perform 3 sequential Perl
scripts that will add columns to each of the .PSSM files, essentially adding clues to the data for the
neural network. After successful completion, the next algorithm will start. This algorithm,
sevenFold, will take all of the .PSSM files, randomize them into seven folders, labeled training0
through training6. Each folder will be divided into 3 sub-folders where two folders will contain 1/7
of data each. One of these folders will be used for and labeled, testing, the other labeled and used for
validation. The other 5/7 of the data will be placed into the third folder labeled and used for
training. Once this is complete, the last algorithm will start a Matlab neural network application.
This algorithm, startNN, will instantiate a Matlab neural network via command line and create a
folder in the directory named WEIGHTS and the current day’s date.. Here, the Matlab will then be
piped each folder named training0 through training6, moving from each upon successful
completion. During this process, each training folder will be entered and each of the three subfolders will be parsed for .PSSM file input. After each training folder is successfully parsed, the
results will be put into a file named weight0 through weight6 inside the folder named WEIGHTS.
Once finished, an email will be sent to the customer for notification.
The website portion of SCORPION will require four hardware components, as shown in
Figure 1. The first component is a personal computer for the user to access the SCORPION website.
The second component is a PHP Web Server to host the website and allow added PHP functionality.
The third component is PSI-BLAST, a third party service which formats each submitted protein
sequence into a Position Specific Scoring Matrix (PSSM) so that it is ready for our fourth
component. Our fourth and final component is the SCORPION Neural Network. The SCORPION
Neural Network is the component that predicts the structure of the protein sequence. All four of
Lab 1 – SCORPION Product Description
SCORPION - 9
these hardware components are already in use with the current implementation of SCORPION.
Figure 1: Website Hardware Components
There will also be four Software requirements for the website portion of SCORPION.
The
first is a Web browser with Internet connection so that the user is able to access the website. The
second is a GUI or webpage template that will project ODU’s professional image and provide
accessibility to users with disabilities by complying with 508 standards. The third software
component is Google Analytics, a third party service which will record user statistics such as
webpage hits and visitor’s IP address. The fourth component is a database to store logged in user
information and protein sequence prediction results. Specifically, we will need a table in the
database to store user login Open ID’s, a table to store sequence submissions, a table to link Open
ID’s to sequence submissions, and a table to store optionally provided user information.
For the website portion of SCORPION, we will be using two algorithms, both of which
pertain to the protein sequence submission form. The first algorithm, shown in Figure 3, is an
algorithm that validates the input of a protein sequence submission. The user is not required to
Lab 1 – SCORPION Product Description
SCORPION - 10
provide an email address if they are logged in because they can choose to view their protein
structure prediction results in their user history area. Additionally, a logged in user can provide
their email address through their user account area. The second algorithm we will use, shown in
Figure 2, is the protein structure estimated prediction time algorithm. The estimated prediction
time will be displayed to the user on both the submission form page before they have submitted
their sequence, and on the thank you page after they have submitted their sequence. The estimated
prediction time algorithm is pretty simple; based on our timed experiment, we know that each
character (amino acid) will add approximately 2.13 seconds to the estimated prediction time. Once
we have the estimated prediction time in seconds we simply need to convert it to normalized form
of hours, minutes, and seconds.
Figure 3: Protein Sequence Validation Algorithm
Figure 2: Estimated Protein
Structure Prediction Time
Algorithm
Lab 1 – SCORPION Product Description
3
SCORPION - 11
IDENTIFICATION OF CASE STUDY (ALEXANDER RUDASI)
The SCORPION neural network was a project of Ashraf S. Yaseen, a PhD student at Old
Dominion University, for almost two years. His PhD advisor, Dr. Yaohang Li has been maintaining
the SCORPION neural network since it's relative completion a year ago. Dr. Li, looking for a better
method to retrain the neural network, initially sought out turning the project into a computer
science 410 class project.
While initially developing SCORPION, the challenges with finding
methods of increasing prediction accuracy were of higher priority than developing the system with
ease of retraining in mind. This means STING's only primary customers are Ashraf Yaseen, and Dr.
Yaohang Li. Neural networks are used for other prediction systems at Old Dominion University's
Computer Science Department which means it is possible that STING could be modified to automate
the training of any other type of bioinformatics neural network.
The original SCORPION website was developed as the SCORPION neural network was being
finished. It was initially designed with basic functionality in mind as the emphasis of SCORPION
was that it was the most accurate secondary structure prediction model to date.
The new
SCORPION website is aimed to add features which will encourage more usage from the general
public. The most likely people to use the new SCORPION website are other researchers working on
different or similar bioinformatics topics. The most likely application on business that SCORPION
could have would be for pharmaceutical companies trying to shorten the often decade long drug
testing process by predicting how an amino acid sequence will fold instead of using X ray
crystallography to manually see how the sequence would fold. This makes it very likely that the
new SCORPION website will be used by researchers and pharmaceutical companies who are
looking to verify their own prediction results.
4
SCORPION PRODUCT PROTOTYPE DESCRIPTION (KELLY SHOTT)
The prototype will demonstrate that the lengthy and complex process of training SCORPION
can be reduced to the simple and streamlined process of uploading a file and clicking a command
Lab 1 – SCORPION Product Description
SCORPION - 12
button. A user will be able to upload a file obtained from the PISCES server using a laptop and a GUI
interface running from an ODU server. This file will contain the raw un-manipulated training set.
Training sets can be quite large, so a reasonably shortened training set will be used for purposes of
demonstration. An algorithm will be designed to take the initial training data set and remove from
it any instances that are less than 40 residues long or are more than 1000 residues long. The
output of this data sanitizing algorithm will be a file of instances between 40 and 1000 residues
long in no specific order or pattern. The newly trimmed data set file will then be passed to a
routine designed to submit the set to the PSI-BLAST machine where a position specific scoring
matrix is generated for each instance. The training program will wait for these results. The time
taken for obtaining PSSM scores cannot be modified because it is dependent on the function of the
PSI-BLAST service which is not controlled in-house. Because of this, for purposes of the prototype
PSSM data will be provided and not obtained from PSI-BLAST. The new data file with PSSM scores
will be passed to a routine that will generate context-based scores for each for each residue
according the Dr. Li’s existing algorithms. The final training file will be organized into a matrix file
that can be accepted by the SCORPION software.
The prototype of the automated SCORPION training process will show that it is possible to
use code-based solutions to prepare and formulate the training data so that valuable time can be
saved. It is important that SCORPION be regularly re-trained to reflect the most current state of the
protein secondary structure knowledge base. If the time and complexity of the training process can
be reduced, it will make it possible for more frequent trainings without imposing undue strain on
Dr. Li’s team.
4.1
PROTOTYPE ARCHITECTURE (HARDWARE/SOFTWARE)(STING: DAVID EASON, WEBSITE: JACK MURATORE)
The prototype’s hardware architecture for the STING portion is where training of the neural
network will occur. The prototype will need a PC, network connection, and a separate Linux virtual
machine(VM). The PC and the network connection shall be the same as what the final version will
Lab 1 – SCORPION Product Description
SCORPION - 13
use. Instead of a supercomputer, a Linux VM will be used. The prototype will not be performing as
many calculations as the finished product, therefore will not need a GPU cluster for performance.
Also, to use a GPU cluster for the prototype would require another cluster which is separate from
the one being used currently. To use the prototype on the current cluster would potentially cause a
denial of service to the customer and anyone needing to use that resource.
The software required for the prototype will consist of a GUI for the frontend and a separate
executable for the backend. The GUI will exist on a PC with a Microsoft operating system. The other
executable will exist on the Linux VM where all but one algorithm shall be ran. A network
connection through the GUI will establish a connection between the PC and the Linux VM. The
connection will be created by user credentials via SSH.
The frontend will be a set of four windows. Three windows will be used for training
functionality and one for tracking. The first window will allow the means for the entire application
to initialize and for the user to navigate through training or tracking. When the user selects the
option to train, a window will appear allowing user input of credentials and to select a folder with a
small set of .fasta files within it. Once selected, the last window will open. This window will show a
status bar while the dataset is normalized on the PC and then put in a directory in the mapped Z
drive. Once complete, a secure shell connection will be made to the Linux VM, where the same Z
drive is mapped, and start the executable for the backend. Once the connection is made and the
executable started, the window will display a message notifying them that the window and
application may be closed. The user can then select the tracking option from the default window to
see the status and history of all training instances that have occurred through the application. The
user will also be able to export this data to a file of their choice.
The executable on the Linux VM shall run PSIBLAST on the small dataset, then three Perl
scripts that will add Dr. Li’s context based scores. Once completed the randomization algorithm will
separate the dataset into the seven fold cross validation training data. Due to slower performance
Lab 1 – SCORPION Product Description
SCORPION - 14
on the Linux VM than the GPU cluster, the executable will then start a single Matlab neural network
and then process one of the seven datasets. Upon completion, the resulting weights will be put into
a file and then emailed to the user, using their username for the SSH connection as a basis for an
email address.
Moving on to the SCORPION website portion, PHP was chosen for all server side
development. PHP is a standard in web development and has already been used by the SCOPRION
before. The solution allows module based programming so that additions can be turns on or off
based on an administrator's preference. Unit testing will be done on all functionality to help with
overall design and stability. The PHP code is the back end of the web experience. All database
queries will be done through PHP.
To store user information SQLite3 will be used to handle the database. SQLite3 is perfect for
sites that have moderate traffic but may potentially have large amount of data to store. It runs
ondemand through PHP so no separate process has to be maintained. Other types of databases
were explored such as MySQL or NOSQL based solutions. It was determined that due to familiarity
and low upkeep SQLlite3 was the best choice.
The choice to stick with HTML/Javascript was done to keep the site accessibility exactly the
same. This will allow users to be able to use the new features right away. Using a Java applet or
Flash would offer more power but at the cost of adding a new requirement. Many users keep Java
and Flash off for security concerns. HTML/Javascrip also allows a full functional RESTful API to be
made. This will facilitate other programs to access the back end database without the need of using
a web browser.
Lab 1 – SCORPION Product Description
SCORPION - 15
4.2 PROTOTYPE FEATURES AND CAPABILITIES (STANLEY ZHENG)
The prototype will demonstrate a two part solution to operate SCORPION. One half of the
solution would a GUI client to streamline training of the Neural Network. Another half would be the
redesign of the website for users to utilize and interact with SCORPION.
The training operation, codenamed STING, would simplify a tedious multistep process into
three screens. The program would be accessible by the stakeholder using their personal computer.
A compiled binary will be provided, from which the stakeholder can use to start the client. On start,
the user would be able to perform three fundamental actions:
1. Automatically train the neural network
2. View statistics and usage of the current Neural Network
3. Test the current neural network
Automatically training the neural network is a core feature in STING’s implementation. A
previous manual, multi-environment process would be streamlined into a single workflow. After
supplying data files from the protein database and credentials to the remote neural network, the
stakeholder can wait then wait the approximately two week process to finish. In the background,
the remote server is running multiple algorithms that run all the necessary scripts and processes.
Also for a system stakeholder, statistics about the current neural network training provide
insights into factors for retraining the model. This enables the stakeholder to make determinations
from recency, usage and effectiveness to determine retraining time. From collecting this data, the
prototype will be able to accommodate extensibility of an API to share statistics to other services. A
simple example of this would be the website last training data, updating automatically upon the
competition of retraining.
In addition, the system should demonstrate an ability for testers to analyze the current
training model through testing the system. Incorporating a similar one button automatic
experience, a user will be provided an option on client startup to run tests. Running tests would fire
Lab 1 – SCORPION Product Description
SCORPION - 16
off a battery of sequences to test responses from the current neural network. The results would be
available to be viewed by the GUI, for aforementioned tester to review.
Figure 4: Improved Process for Training
The user facing side of SCORPION would be a website application, to replace the current
design. The website will demonstrate 508 web accessibility standards to comply with federal
funding requirements held by the National Science Foundation. The redesign will incorporate a
similar user experience to the original website, and will offer a mirror to the original site if the user
prefers.
A feature of the redesign, would be client side validation and data sanitization. Ideally the
validation aims to ensure input sequence are formatted to FASTA specifications. The prototype
would be able to handle on input rule based highlighting that will aid the user in correcting
sequences. On highlight rules are primarily used to handle keyboard input.
Optional sanitization buttons would available for the first 3 rules to target large
copy/pasted sequences that could have multiple inconsistencies and formatting issues.
Rules
1. No special character
Lab 1 – SCORPION Product Description
SCORPION - 17
2. No numerics
3. All alphabetical characters excluding (BJOUX)
4. alphabet case insensitive
5. Minimum sequence length 40 characters
The prototype will demonstrate logged in user capabilities utilizing OpenID. Users who opt
in to login will be able to manage their data on the site and be informed about changes on the site.
This user login system will offer capabilities for the system stakeholder to measure usage on the
site.
The prototype will utilize google analytics embedded in pages. This will allow information
not only who is using SCORPION but also give insight into how. Google Analytics support GeoIP and
can determine from where the visitor origin. In addition they can provide tracking including
duration of visit and user utilization of the site. Google Analytics provides a solution for stakeholder
to review activity and all this associated information within the application.
Finally the user prototype will offer an API following RESTful principles. Programmatically
the user can submit requests and perform all actions available on the website.
Users submitting sequences with sufficient parameters will either be returned a token url to find
their results later on.
Since this is a multiple part solution there are system and subsystem risks. On a system
level the largest risks are failure to adoption and failure of implementation. We hope Dr. Li will be
able to utilize our end product in production to full specifications. To mitigate these system risks,
we utilize agile development with backwards compatibility in mind. We allow the client facing
systems to maintain a similar user interface or a opt in to use the original design. Using agile, we are
able to continuously mold our prototype with our stakeholders input in mind. In addition we will be
building on tried open source solutions making support and documentation plentiful.
Lab 1 – SCORPION Product Description
SCORPION - 18
In production for training, we hope to mitigate risk of training downtime by providing an
api endpoint that would reflect the current status of SCORPION. Also it would be difficult to mitigate
the multi week down period without just informing the user to visit later on.
4.3 PROTOTYPE DEVELOPMENT CHALLENGES (JASMINE JONES)
We have divided our project into two parts: STING and the SCORPION website. We will face
development challenges specific to each part. The challenges we will face when developing STING
involve efficiently utilizing the available server resources, correctly automating training of the
Neural Network and ensuring that our client is satisfied with our solution. Challenges we anticipate
in the development of the SCORPION website include complying with 508 standards, providing an
accurate estimated prediction time, providing a secure user login and ensuring our users remain
satisfied throughout our changes.
The first STING challenge we will face is efficiently utilizing the available server resources.
We will be adding more files to the server, and if we are not careful, this could overload the system.
To prevent this, we will ensure that there will be enough space available on the server prior to
adding any files. Additionally, we will be using larger data sets to train the neural network. This
could reduce the efficiency of the system and lead to slower processing. As a precaution, we will
allow only data sizes that are within the processing abilities of the system, and will warn users that
large data sets will require longer processing time. Further, it is beyond our control that during
training, the server must devote all of its resources to training and cannot accept protein sequence
submissions from the SCORPION website. While we cannot change this, there is a bigger problem
that can result: if the program is in the process of training and is then canceled by an administrator,
training will have to start all over from the beginning, causing longer server downtime. To prevent
this, our GUI for STING will display a warning if an administrator attempts to close the program
while training is in progress. Additionally, the STING GUI will display a progress bar indicating what
percentage of the training is complete.
Lab 1 – SCORPION Product Description
SCORPION - 19
Another STING challenge is ensuring that our automated solution to train the Neural
Network is correct. It is extremely important that our program runs the proper algorithms, in the
correct sequence and with the correct formatting. If our automated solution leads to improper
training of the Neural Network, it would produce incorrect protein structure prediction results, and
that would be catastrophic. We will extensively test our program to ensure that it functions
properly.
Our third STING challenge is ensuring that our client, Dr. Li, approves of our solution. If he
does not approve, our solution will not be implemented. Because our solution is specific to Dr. Li’s
work and because our prototype will be the actual product itself, if our solution is not implemented,
our efforts would go to waste. To prevent this, we will be working in an agile development
environment, ensuring that Dr. Li is happy at every step in the development process until
completion.
Moving on to our first SCORPION website challenge, we must ensure that the website is 508
compliant. If we fail to meet 508 compliancy standards, our website may be inaccessible to users
with disabilities. Further, because all federally funded websites are required to be 508 compliant,
we could lose federal funding. To make certain that the website is 508 compliant, we will
extensively test our website using 508 compliance verification tools, such as those found at
W3.org/WAI.
Our second challenge in the development of the SCORPION website is providing an accurate
protein structure estimated prediction time. If the estimated prediction time is inaccurate, the user
may become frustrated, especially if they have a deadline to meet. We have already completed
studies to analyze the duration of SCORPION’s protein structure prediction; however, we will
complete further studies with more data sets and over a greater period of time to ensure that our
estimations are the most accurate they can be. We also will provide a prediction time window
rather than a finite time. This will accommodate slight variations in the duration of the prediction.
Lab 1 – SCORPION Product Description
SCORPION - 20
Security of logged in user information is another challenge in the development of the
SCORPION website. It would be a big problem if user information were leaked. However, we believe
keeping user information secure shouldn’t be a big hurdle to tackle, because we will be using Open
ID for our user login, so no passwords will be stored on our system. Additionally, we will be
implementing SSL which will secure any additional information the user chooses to provide. For
our open ID login, we will be using very mainstream third-party options; however, if the user does
not have a third party account, they are not required to log in to benefit from SCORPION.
Our final SCORPION website challenge is addressing user satisfaction. Specifically,
addressing how users will react to all of the changes we will implement. Because we will be
providing a new webpage design, users will be unfamiliar to the new design and may resist the
change. We will counter this by keeping the design in a similar layout, announcing the new page
design on our homepage and providing updated instructions of how to use the SCORPION service.
Keeping our changes simple will also enable easy maintenance of the website after our initial
implementation. Additionally, STING may lead to more frequent training of the Neural Network
which means more frequent server downtime. To ensure user satisfaction during this time, we will
send an email notification to users and post an announcement on the website homepage in advance
of a scheduled training.
Lab 1 – SCORPION Product Description
SCORPION - 21
GLOSSARY
Amino Acids/Residues: The building blocks of proteins
API: Application Programmable Interface (abstract way for services to communicate)
Cross-validation Training: The process of dividing training data into k mutually exclusive subsets
(folds), of roughly equal size where some subsets are used for training, validating, and testing. The
process is repeated k times.
Data cleansing: The process of removing non-representative instances from the data set.
Dunbrack Lab: Part of the Fox Chase Cancer Research Center. Recognized for normalizing data
from the RCSB
ETL: Extract, Transform and Load. Referring to the manipulation of Data
FASTA: Format widely adopted in bioinformatic to make it easier to manipulate and parse
sequences
GeoIP: Uses a lookup table of Internet Protocol addresses with known municipalities and providers
to match IP origin
GUI: Graphical User Interface
NSF: National Science Foundation
PSI-BLAST: Position-Specific Iterative Basic Local Alignment Search Tool used for deriving the
PSSM
PSSM: Position-Specific Scoring Matrix which includes information about evolutionary relatives of
the original protein sequence
RCSB Protein Data Bank: Research Collaboratory for Structural Bioinformatics database. The
database holds all known and recognized protein sequences.
REST:A REST API is a set of operations that can be invoked by means of any the four verbs, using
the actual URI as parameters for your operations. Four verbs including (GET,POST,PUT,DELETE)
SCORPION: SeCOndaRy structure PredictION
STING: Streamlined Training In Neural-network GUI
Training set: Set of instances from the problem domain used to train the algorithm
508 Compliance: Adhering to guidelines established to make website content equally accessible to
people with disabilities
Lab 1 – SCORPION Product Description
SCORPION - 22
REFERENCES
Biological Macromolecular Resource. (n.d.). RCSB Protein Data Bank. Retrieved Feb. 20, 2014, from
http://www.rcsb.org/pdb/home/home.do
Blue Team. (n.d.). SCORPION Protein Prediction Timed Experiment. . Retrieved February 11, 2014,
from www.cs.odu.edu/~410blue/CS410SCORPIONProteinPredictionTimeEx periment.xlsx
Cancer Research Funding - National Cancer Institute. (2013, August 23). Cancer Research Funding National Cancer Institute. Retrieved May 8, 2014, from
http://www.cancer.gov/cancertopics/factsheet/NCI/research-funding
Freitas, R. (1998, January 1). Nanomedicine. Chapter 3 page 1. Retrieved May 8, 2014, from
http://www.foresight.org/Nanomedicine/Ch03_1.html
Murphy, S. (2013, May 8). Deaths: Final Data for 2010. . Retrieved May 8, 2014, from
http://www.cdc.gov/nchs/data/nvsr/nvsr61/nvsr61_04.pdf
RCSB PDB - Histograms. (n.d.). RCSB PDB - Histograms. Retrieved May 8, 2014, from
http://www.rcsb.org/pdb/statistics/histogram.do?mdcat=mvStructure&mditem=residueC
ount&name=Residue%20Count
Section 508 . (n.d.). United States Department of Health and Human Services. Retrieved March 15,
2014, from http://www.hhs.gov/web/508/index.html
Section 508 Of The Rehabilitation Act. (n.d.). Section 508 Home. Retrieved March 15, 2014, from
http://www.section508.gov/Section-508-Of-The-Rehabilitation-Act
Yaseen, A., & Li, Y. Context-based Features Enhance Protein Secondary Structure Prediction
Accuracy.
Download