Computational Analysis of Protein Structure Prediction

advertisement
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
Computational Analysis of Protein Structure
Prediction and Folding
D. Ramyachitra
Assistant Professor
Department of Computer Science
Bharathiar University
V.Veeralakshmi
M.Phil Research Scholar
Department of Computer Science
Bharathiar University
Coimbatore, India
Coimbatore, India
jaichitra1@yahoo.co.in
veeralakshmi13@gmail.com
ABSTRACT: Protein structure prediction (PSP)
problem is a computationally challenging problem. To
predict the protein structure from sequence
information, is of massive significance and also the
properties of proteins are critically determined by
their structures. All information is necessary to fold a
protein to its native structure is contained in its
amino-acid sequence. The native structure of the
protein is clearly not known. Protein folding problem
is predicting the proteins tertiary structure is folding
problem. Misfolding occurs, when the protein folds
into a 3D structure that does not represent its correct
native structure. In the HP model each amino acid is
classified and it is based on its hydrophobicity as an H
(hydrophobic or non-polar) or a P (hydrophilic or
polar). The HP energy model is focusing the search
towards exploring structures that have hydrophobic
cores. To solve the PSP problems many of the
algorithms are used to find out the lowest energy
conformations. In this paper we go through the
protein structure prediction problem and some of the
techniques to predict the structure.
measures, determining optimal or close-to-optimal
structures for a given amino-acid sequence (Krasnogor,
Hart, Smith, & Pelta, 1999). The computational approach
of the protein structure is very attractive [1, 2].
The optimal conformation in the HP model is
the one that has the maximum number of H–H (Fig 2)
contacts which gives the lowest energy value [2]. The
protein folding problem in the 2D HP model has been
proved to be NP-hard. In 1993 unger and moult found the
native conformation in a number of simplified models in
the NP hard problems [3]. In an AB off-lattice model, the
hydrophobic residues were labeled by A and the polar or
hydrophilic ones by B. Fibonacci sequences of A and B
residues were studied by using potentials including
bending and Lennard–Jones energy [4]. In 2D HP
models, many algorithms have been explored to find the
minimum energy configuration for small protein.
The remaining sections of this paper are
organized as follows. Section 2 describes the overview of
protein structure and section 3 describes the performance
metrics. Finally section 4 gives the conclusion.
Key words: Protein structure prediction, HP energy
model, Protein folding.
I. INTRODUCTION
The primary structure of a protein is a linear
sequence of amino acids connected together via peptide
bond. The protein structures are determined by
techniques such as MRI (magnetic resonance imaging)
and X-ray crystallography. These techniques require
isolation, purification and crystallization of the target
protein [1]. The levels of the protein structures can be
given in Fig 1.
Fig.1The Levels of Protein Structure Prediction
The protein structure prediction problem was
solved by two major sources: (1) finding good measures
for the quality of candidate structures, and (2) given such
Fig.2 An optimal conformation for the sequence
‘‘(HP)2PH(HP)2(PH)2HP(PH)2” in a 2D lattice model [2]
II. AN OVERVIEW OF PROTEIN STRUCTURE
The Proteins perform a variety of biological
tasks. Protein structure determines its function. Protein
structure is more conserved than protein sequence, and
more closely interconnected to function.
A protein is a linear polypeptide chain
composed of 20 different kinds of amino acids
represented by a sequence of letters (left) (Fig 3). It
folds into a tertiary (3-D) structure (middle) composed of
three kinds of local secondary structure elements (helix –
red; beta strand– yellow; loop – green).
116
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
Fig.3 Protein sequence-structure-function
relationship [5]
The protein with its native 3-D structure can carry out
several biological functions in the cell (right).
A. PROTEIN STRUCTURE HIERARCHY:
The four levels (shown in Fig 4) of the protein
structure are Primary, secondary, tertiary and quaternary
structure.
a) Primary Structure:
A protein is a sequence of amino acid building
blocks arranged in a linear chain and joined together by
peptide bonds. The linear polypeptide series is called the
primary structure of the protein. The primary structure is
typically represented by a sequence of letters over a 20letter alphabet associated with the 20 naturally occurring
amino acids [6]. Protein sequences differ in length from
30 to 30,000 amino acids, mostly a few hundreds.
b) Secondary Structure:
Secondary structure prediction is a task for
predicting the conformational state of each amino acid in
a protein sequence [7]. The protein folds into local
secondary structures including alpha helices (H), beta
strands (E). They may be connected by loop regions or
coils.
Thang N. Bui et al., proposed an efficient
genetic algorithm for the protein folding problem used by
the HP model in the two-dimensional square lattice [41].
The algorithm performs very well against existing
evolutionary algorithms and Monte Carlo algorithms.
Fig.4 Protein Structure Hierarchy
Alpha helix:
An alpha helix is a tightly coiled, rod like
structure. It is formed from one continuous region
through the formation of hydrogen bonds between
carboxy [8] group of residue in the position i and NH
group of residue i+4.
L. Howard Holley and Martin Karplus assigned
helix to any group of four or more contiguous residues,
the minimum helix in Kabsch and Sander classifications,
having helix output values greater than sheet outputs and
greater than threshold value [9].
Beta strand:
A beta strand is just a fragment sheet
like structure. Beta sheets are formed by linking 2 or
more Beta strands by H bonds side chain of adjacent
residues point in opposite directions only trans peptide
bonds give R groups on opposite sides cannot exist as a
single Beta strand; must be 2 or more in proteins, 4-5
strands make up a beta sheet. Beta sheets may consist of
parallel strands, anti parallel strands or out of a
mixture of parallel and anti parallel strands.
Qian et al investigates the maximum overall
prediction accuracy on the training set is 63.2%. An
increase in prediction accuracy for residues near the
amino-terminus and for highly buried versus partially
exposed b-strands, residues with higher output activities
were found to be more accurately predicted [10].
Richardson produces the b-Turns are a specific
class of chain reversals localized over a four-residue
sequence, network predictions for b-turns begin with the
hypothesis that the information necessary to force the
sequence of amino acids into a b-turn exists locally in a
small window of residues. The low values for the overall
prediction accuracy reflect the stringent requirement that
all four residues in the b-turn must be correctly predicted
[11].
Coils:
Coils have no fixed regular shape. The super
secondary structure, which are commonly found on
secondary structure arrangements such as helix-loophelix.
L. Howard Holley and Martin Karplus defined
residues that are not assigned to helices or Beta-strands
are considered coil. By maximizing the accuracy of
secondary structure assignment the threshold parameter
value is adjusted for the training set [9].
c) Tertiary Structure:
The tertiary structure is described [7] by the x, y
and z coordinates of all the atoms [12] of a protein or, in
a more coarse description, by the coordinates of the
backbone atoms. The three dimensional conformations
resulted from secondary structures folding together.
Ivan Kondov proposed a Particle swarm
optimization for computer aided prediction of proteins’
three dimensional structure. An asynchronous
parallelization speeds up the simulation better than the
synchronous one and reduces the effective time for
predictions [14].
d) Quaternary Structure:
117
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
A protein with a quaternary structure consists of
more than one practically identical sub-unit, not joined
by strong bonds. It describes the spatial packing of
several folded polypeptides [13]. Not all proteins have a
quaternary level of structure. An example of a
quaternary structure is human hemoglobin, which is
made up of four distinct subunits, each an individual
chain of amino acids, but functions as a single complex.
B. PROTEIN MODELS:
The protein structure can be specified at
different levels of the hierarchy. Due to the complexity of
the problem simplified models are used to accommodate
limited computing resources to represent a protein
structure using two categories (given in Fig 5).
All Atom Model:
Protein structures are represented by list of 3D
coordinates of the all atoms in a protein. An atom model
is desired in the structure prediction. It is very difficult to
identify similar sub structures across different proteins
and generalization and abstraction.
Ivan Kondov [14] use all-atom force field space
to improve the performance of the Method Periodic
boundary conditions applied to the search space. The
standard algorithm, as implemented in the ArFlock
library is the low-energy conformations of several
peptides.
Fig.5 Protein structure models
Simplified Models:
All atom models are not feasible so the
simplified model is used to produce the approximate
solutions. Each amino acid of the sequence occupies a
point on the lattice to form a continuous chain of selfavoiding walk [15]. A simplified model ranges from a
very abstract model such as HP model. Simplified
models classifications are shown in Fig 6.
Dill, K. A used the HP model in the 3D square
lattice as the 3D HP model. Each amino acid is classified
based on its hydrophobicity as an H (hydrophobic or nonpolar) or a P (hydrophilic or polar). The objective of the
protein folding problem is to determine a confirmation of
minimum energy. Conformation of a protein in the HP
model is embedded as a self-avoiding walk in either a
two-dimensional or a three-dimensional lattice [16].
Mahmood A. Rashid et al., developed a genetic
algorithm that mainly uses a high resolution energy
model for protein structure evaluation but uses a low
resolution HP energy model in focusing the search
towards exploring structures that have hydrophobic cores
[17].
Berger et al., used the protein folding problem
in the HP model called HP-Protein Folding problem to
find a given protein a valid conformation [18] on the
Cartesian lattice such that the energy is minimum. The
HP-Protein Folding problem is NP hard.
Mahmood A Rashid et al., used HP based
energy model on 3D FCC lattice to simplify the problem.
In GA+, using 3 enhancements are i) an exhaustive
generation approach to diversify the search ii) a novel
hydrophobic core-directed macro move to intensify the
search and iii) a random-walk based approach to recover
from stagnation. The state-of-the-art results on facecentered cubic (FCC) lattice based hydrophobic-polar
(HP) energy model have been achieved by local search
(LS) methods [15].
Alena Shmygelska et al., used the HP Protein
Folding Problem that incorporates a local search phase
that takes the initially built protein conformation and
attempts to optimize its energy, using probabilistic longrange moves [19].
Cheng-Jian Lin et al., used an efficient hybrid
Taguchi-genetic algorithm (HTGA) for solving the
protein folding problem in the 2D HP model. The
Taguchi method is used to improve the crossover
operation to select better genes. The merits of PSO were
used to improve the mutation mechanism [2].
Off Lattice:
Xiaolong Zhang et al., proposed a genetic tabu
search method for predicting the protein structure. PSP
has important issues which are designs of the structure
model and the optimization technology. The structure
model is the complexity of the realistic protein structure.
In this study the simplified model, which is called AB off
lattice is used to search the best conformation of a protein
sequence [20].
Jingfa Liu et al., developed a heuristic-based
tabu search (HTS) algorithm for integrating the heuristic
initialization mechanism, the heuristic conformation
updating mechanism, and the gradient method into the
improved TS algorithm. The HTS algorithm is quite
promising in ground states for AB off-lattice model
proteins [4].
HP Model:
118
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
Jian lin et al., proposed an efficient artificial bee
colony algorithm for protein structure prediction on
lattice models. Here the modified
ABC algorithm for protein folding has been
applied to the protein folding problem based on
hydrophobic-polar lattice model [1].
C.
PROTEIN
TECHNIQUES:
Fig.6 Classification of simplified models
STRUCTURE
PREDICTION
The difficulty of protein structure prediction is usually
tackled in 2 main steps:
1. Protein secondary structure prediction
2. Protein tertiary structure prediction.
a) Protein Secondary Structure Prediction:
Lattice:
Many of the techniques are used to solve the
protein secondary structure prediction problem. Some of
the techniques are given in fig 7
Fig 7 Protein Structure Prediction Techniques
STATISTICAL METHOD:
Chou-Fasman (CF) Method:
The Chou-Fasman [21] method is the one of the
first method for the implementation of protein secondary
structure prediction. The method involves a matrix of two
values: propensity values, a given amino acid will appear
within the structure, and frequency values, found in a
hairpin turn for a given amino acid. Taking these values
into account the method then predicts regions of α-helices,
regions of β-sheets, and positions where β-turns may
appear.
Chou, P.Y. and Fasman G.D., is used to predict
the Alpha-helices and beta-strands predicted by setting a
cut for the total propensity for a slice of four residues. The
values of the residues were classified into helix or strand
breakers and formers. In formers the residues positively
contribute to the formation of the structural element.
Breakers are used to prevent or stop its formation [22].
Garnier-Osguthorpe-Robson (GOR) Method:
JEAN GARNIER et al., proposed the GOR
method [23] one of the most popular of the secondary
structure prediction. This method is the real first prediction
of secondary structure implemented as a computer
program. The addition of homologous sequence
information through multiple alignments has given a
significant boost to the accuracy of secondary structure
predictions.
Taner Z. Sen et al., developed the GOR V web
server for protein secondary structure prediction. This
algorithm combines Bayesian statistics, information theory
and evolutionary information. GOR V has been among the
most successful methods, its online unavailability has been
a restraint to its popularity [24].
A. Kloczkowski et al., generated a new algorithm
GOR V [25] released on online prediction server. By
limiting the prediction to 375 sequences that having 59 PSIBLAST alignments.
MACHINE LEARNING ALGORITHM:
Jacek Błażewicz et al., proposes new machine
learning methods [26] such as lad, lem2, and modlem have
119
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
been used for secondary protein structure prediction to
handle a huge amount of data sets. LEM2 and MODLEM
are rule induction algorithms that generate a minimal set of
rules given a set of positive examples and a set of negative
examples. The aim is to identify which method is more
suitable for analyzed and to find the rules would predict the
secondary structure. The best average results were obtained
using the LAD algorithm.
The two types of the machine learning algorithm
are shown in Fig 8
Fig 8 Machine Learning Algorithm Types
Support Vector Machine:
J. J. Ward et al., developed a reliable prediction
method using an alternative technique and to investigate the
applicability of SVM. The SVM executes similarly to the
‘state-of-the-art’ PSIPRED prediction method on a nonhomologous test set of 121 proteins in spite of being trained
on considerably fewer examples. An uncomplicated
consent of the SVM, PSIPRED and PROFsec achieves
higher prediction accuracy than the individual methods
[27].
Minh N. Nguyen et al., investigates the multi-class
SVM methods involved to resolve a much larger
optimization problem and are applicable to small datasets.
The multi-class SVM methods are more suitable for protein
secondary structure (PSS) [28] prediction than the other
methods, including binary SVMs. It is feasible to extend
the prediction accuracy by adding a second-stage multiclass SVM to capture the contextual information among
secondary structural elements
Long-Hui Wang et al., proposed a kernel method support vector machine takes into account of the physicalchemical properties and structure properties of amino acids.
The SVM classifiers would also be improved by using
larger training sets that contain new protein structures, and
also it requires more memory to store data points. It is one
of the top range methods for predict the protein secondary
structure [29].
Hae-Jin Hu et al., investigate the SVM learning
machine which is applied for the improvement of the
prediction accuracy of the secondary structure. In the first
approach, the new encoding schemes are applied and
optimized. In the second approach, a new tertiary classifier
combines the results of one-versus-one binary classifiers is
designed and its efficiency is compared with the existing
tertiary classifiers. The tertiary classification can be
decomposed into a set of binary classifications. To improve
the performance in many other areas such as pattern
recognition, data mining, and machine learning [30].
Blaise Gassend et al., proposed the Hidden
Markov Support Vector Machines (HM-SVMs) [31], The
HMM is trained using a Support Vector Machine method
which iteratively picks a cost function based on a set of
constraints, and uses the predictions resulting from this cost
function to generate a new constraints for the next iteration.
Unlike most secondary structure methods, used to predict
not only the residues participate in a beta sheet, also these
residues are forming hydrogen bonds between adjacent
sheets.
Sujun Hua et al ., represented a new approach to
supervised pattern classification applied to a pattern
recognition problems, including object recognition, speaker
identification, gene function prediction with microarray
expression profile, etc. The SVM method achieved a good
performance of segment overlap accuracy SOV, through
sevenfold cross validation on a database of 513 nonhomologous protein chains with multiple sequence
alignments [32].
Neural Network:
Pierre Baldi et al., proposed several classes of
recursive artificial neural networks (RNN) [33]
architectures for large-scale applications that are derived
using the directed acyclic graphs (DAG-RNN) approach.
To derive state-of-the-art predictors for protein structural
features such as secondary structure (1D) and both fineand coarse-grained contact maps (2D) and the internal
deterministic dynamics allows efficient propagation of
information, and l training by gradient descent, to tackle
large-scale problems.
L. Howard Holley et al., investigates the neural
network are applied to the protein secondary structure
prediction. Specialization of a neural network to a
particular problem involves the network topology that is,
the number of layers, the size of the layer, and the pattern
connections-and the connection strengths to each pair of
connected units and of thresholds to each unit. The method
achieved helix, sheet, and coil [9].
Ning Qian et al., developed a new method for
predicting the secondary structure of globular proteins
based on non-linear neural network models. The goal of the
method uses the available information in the database of
known protein structures to help predict the secondary
structure of proteins for which no homologous structures
exists [10].
FUZZY SETS:
Armando Blanco et al., proposed a fuzzy adaptive
neighborhood search (FANS) to analyze one of the most
important problems in the computational biology area: the
protein structure prediction problem. The same results
could be potentially obtained discarding the population and
applying mutations to a unique individual onto the
application of heuristics to the PSP [34].
Rajkumar Bondugula et al., proposed a prediction
system that is based on a generalized Nearest Neighbor
method by using the position specific scoring matrices
120
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
(PSSMs) of the query protein sequence as input to the
prediction system.
Jyh-Shing Roger Jang was proposed Adaptive
neuro-fuzzy inference systems (ANFIS) which is one of the
most popular types of fuzzy neural networks [37], it
combines the advantages of fuzzy system and neural
network, in modeling non-linear control System. Yongxian
Wang described a method of hybrid neural network and
fuzzy system and the three-class secondary structure
prediction of the protein using the ANFIS to produce a
better result.
ARTIFICIAL IMMUNE SYSTEM (AIS):
Sree PK et al., proposed an Artificial Immune
System (AIS-MACA) a novel computational intelligence
technique that can be used for strengthening the automated
protein prediction system [44].
A. Tantara et al., proposed a bi criterion parallel
hybrid genetic algorithm (GA) which is used to efficiently
solve the problem using the computational grid. It is used
by defining not only the ground-state energy conformation
of a molecule but also the ensemble of potential low-energy
conformations [40].
Trent Higgs et al., present a feature based re
sampling genetic algorithm to refine structures that are
outputted by PSP software. The two structural measures are
RMSD and TM-Score [42].
Mahmood A. Rashid et al., represented a genetic
algorithm for protein structure prediction on 3D facecentered-cubic lattice. A low resolution energy model could
effectively bias the search towards certain promising
directions [15].
SWARM INTELLIGENCE:
Artificial Bee Colony Algorithm:
EVOLUTIONARY ALGORITHM:
Genetic Algorithm:
Subhendu Bhusan Rout et al., proposed a Genetic
Algorithm technique for the prediction of protein structure.
This technique helps to work with huge amount of data and
for the prediction of protein structure in a large scale. To
analyze the changes of protein structure and providing a
metaphor of the processes the genetic algorithm is very
useful for designing the drugs, after processing of
enormous amount data with less amount of time [38].
Mahmood A Rashid et al., proposed a new genetic
algorithm for protein structure prediction problem using
face-centered cubic lattice [17] and hydrophobic-polar
energy model. The results was compared with the state-ofthe-art local search algorithm for simplified PSP and final
algorithm GA+ that use a combination of all the three
enhancements discussed in the HP energy model.
Cheng-Jian Lin et al., developed an efficient
hybrid Taguchi-genetic algorithm that combines genetic
algorithm, Taguchi method, and particle swarm
optimization (PSO). The PSO inspired by a mutation
mechanism in a genetic algorithm and the GA has the
capability of powerful global exploration, though the
Taguchi method can utilize the optimum offspring. It can
be applied successfully to the protein folding problem
based on the hydrophobic-hydrophilic lattice model and the
simulation results performs very well against existing
evolutionary algorithm [2].
Camelia Chira et al., proposed to address the
hydrophobic - polar model of the protein folding problem
based on hill-climbing genetic operators. The crossover and
mutation are applied using a steepest-ascent hill-climbing
approach [39]. The evolutionary algorithm with hillclimbing operators is successfully applied to the protein
structure prediction problem for a set of difficult bi
dimensional instances from lattice models.
Karaboga et al., presented the Artificial Bee
Colony (ABC) algorithm for constrained optimization
problems. The performances of the Artificial Bee Colony
(ABC) algorithm is used for solving constrained
optimization problems and produce the best results [43].
c) Protein Tertiary Structure Prediction:
For many proteins and protein domains, prediction
of their three-dimensional (3D) or “tertiary” structure from
the amino acid sequence should be feasible and an
increasing number of sequences. Tertiary structure
prediction techniques are shown in fig 7
TEMPLATE MODELING:
Homology Modeling:
Zhexin Xiang investigates the homology
modeling. In homology modeling, detecting the
homologues distant is aligning sequences with template
structures, modeling of loops and side chains, as well as
detecting errors in a model, has contributed to reliable
prediction of protein structure [45].
Threading:
C.A. Floudas investigates threading that
generalizes the technique of homology modeling and aligns
the unknown sequence. It is also known as ‘fold
recognition’ algorithm [49] or ‘inverse folding’. Threading
methods aim at fitting a target sequence to a known
structure in a library of folds.
TEMPLATE FREE MODELING:
Ab Initio Structure Prediction:
121
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
David Baker and Andrej Sali classified the models
for protein structure prediction into two main categories,
without relying on similarity at the fold level between the
target sequence and those of the known structures [46].
Jooyoung Lee et al., used an ab initio modelling,
for a complete solution to the protein structure prediction
problem. Predicting protein 3D structures from the amino
acid sequence and ab initio modeling help us to understand
the physicochemical principle of how proteins fold in
nature [47].
M. Meissner et al., used ab initio prediction of a
set of small protein structures that require the usefulness of
PSO in applied protein structure prediction. The use of an
appropriate energy function ab initio protein structure
prediction should be feasible [48].
EVOLUTIONARY ALGORITHM:
Genetic Algorithm:
Xiaolong Zhang et al., investigates the genetic
tabu search algorithm to develop an efficient optimization
algorithm. The crossover and mutation operators can
improve the local search capability and variable population
size strategy can maintain the diversity of the population,
and the ranking selection strategy [20].
SWARM INTELLIGENCE:
Ant Colony Optimization Algorithm:
Stefka Fidanova and Ivan Lirkov develop an ant
algorithm for 3D HP protein folding problem. The
components of an algorithm contribute to its performance
and the performance is affected by the heuristic function
and selectivity of pheromone updating. The aim is to
achieve more realistic folding [50].
Alena Shmygelska et al., investigate a new
algorithm, dubbed ACO-HPPFP-3, and are based on very
simple structure components. The run-time required by
ACO-HPPFP-3 for finding best known energy
conformations scales worse with sequence length than
PERM in 3D [19].
Artificial Bee Colony Algorithm:
C. Vargas et al., proposed a parallel artificial bee
colony algorithm approaches for protein structure
prediction using 3dhp-sc model [51]. Two parallel
approaches for the ABC are: master-slave and hybridhierarchical relations. The parallel models achieve good
level of efficiency, and the hybrid hierarchical approach
improved the quality of solutions.
Particle Swarm Optimization Algorithm:
Nashat Mansour et al., presented a particle swarm
optimization (PSO) based algorithm for predicting protein
structures in the 3D hydrophobic polar model. The PSO
algorithm performs better than previous algorithms by
finding lower energy structures or by performing fewer
numbers of energy evaluations [52].
Xin Chen et al., introduced a levy flight to
improve the precision and enhance the capability of the
local optima through particle mutation mechanism [53].
M. Meissner et al., introduced Particle Swarm
Optimization (PSO) to protein structure prediction. Finding
the global optimum in the free energy landscape of protein
structures and yielding near native structures for two small
sample proteins [48].
PROTEIN DATABASES
Some of the protein databases are used to predict
the protein structure, which are given below.
a) Protein Data Bank (PDB):
The PDB is a key resource in areas of structural
biology. The Protein Data Bank (PDB) is a repository
for the 3D structural data of huge natural molecules, such
as proteins and nucleic acids. The file format initially used
by the PDB was called the PDB file format [54, 55].
b) PDBsum:
The PDBsum is a pictorial database that provides
at-a-glance overview of the contents of each 3D structure
deposited in the Protein Data Bank (PDB). Entries are
accessed either by their 4-character PDB code.[54, 55, 56,
57].
c) SCOP:
SCOP is a structural classification of proteins. The
scop hierarchy contains four main levels: class, fold, super
family and family. The SCOP database, created by manual
check up and abetted by a battery of computerized
methods, aims to provide an in depth and comprehensive
description of the structural and evolutionary relationships
between all proteins [58].
d) SwissProt:
It is a protein sequence database that provides a
high level of integration with other databases and also has a
very low level of redundancy [59].
e) NCBI:
The National Center for Biotechnology
Information advances science and health by providing
access to biomedical and genomic information. The NCBI
has a series of databases relevant to biotechnology and
biomedicine [60].
f) PDBe:
122
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
PDBe is the European resource for the collection,
association and spreading of data on biological
macromolecular structures. PDBe also works actively with
the X-ray crystallography, Nuclear Magnetic Resonance
(NMR) spectroscopy and cryo-Electron Microscopy (EM)
communities [55, 56, 57, 61].
III. EVALUATION METRICS
A. HP Energy Model
The HP energy model is based on the
hydrophobicity of the amino acids. In the HP model, when
two non-consecutive hydrophobic amino acids become
topologically neighbours, they release a certain amount of
energy, which for simplicity is shown as −1. The total
free-energy E of a conformation, based on the HP model,
becomes the sum of the energy released by all pairs of
non-consecutive hydrophobic amino acids [15].
g) Protein Quaternary Structure Database (PQS):
The Protein Quaternary Structure file server (PQS)
is an internet resource that makes available coordinates for
likely quaternary states for structures contained in the
Brookhaven Protein Data Bank that were determined by Xray crystallography [55, 61].
h) Homology-derived Structures of Proteins (HSSP):
HSSP is a derived database that merges structural
and sequence protein information. Proteins commencing
the Protein Data Bank are correlated with sequence
homologues which share the same 3D structures [61].
(1)
Here, cij = 1 if ith and jth amino acids are nonconsecutive in the sequence but are neighbours on the
lattice, otherwise 0; and eij = −1 if ith and jth amino acids
are both hydrophobic, otherwise 0.
i) Research Collaboratory for Structural Bioinformatics
(RCSB):
The Research Collaboratory for Structural
Bioinformatics (RCSB) is a non-profit consortium
enthusiastic to improving the understanding of the function
of biological systems through the study of the 3-D structure
of biological macromolecules [56, 61].
j) Protein Data Bank Japan (PDBj):
PDBj (Protein Data Bank Japan) maintains a
centralized PDB archieve of macromolecular structures and
provides integrated tools, in alliance with the RSCB and
PDBe in EU. PDBj is supported by JST-NBDC and Osaka
University [54, 56].
B. Free Energy
The most popular lattice model is HP lattice
model. The HP model has 2 bead types. The black beads
denote the hydrophobic amino acid and white beads
denotes the hydrophilic. The dotted line denotes the H-H
contacts in the conformation. The free energy is minimum,
the number of H-H contacts is maximum [1]. The assigned
free energy value is -1.The optimal conformation in the HP
model (Fig 9) has the maximum number of H-H contacts
which gives the lowest energy value.
The free energy for the protein can be intended by,
k) OCA:
OCA is a browser database for protein
structure/function. The OCA integrates information from
from Kyoto Encyclopedia of Genes and Genomes or
K.E.G.G., as it is commonly called; a collection of online
database dealing with genomes and biological chemicals
OMIM, PDBselect, Pfam, PubMed etc [57, 61].
(2)
(3)
where the parameter
l) TOPSAN:
The TOPSAN project was residential to collect,
share, and dispense information about protein 3D
structures [57].
(4
)
123
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
Hence, the protein folding problem can be
transformed into an optimization problem, i.e., to calculate
the minimal free energy of the protein folding
conformation. HP sequence s=s2, s2, , , , , sn, find an energy
conformation of s; to find c*
such that
E(c*)=min{E(c)|c
, where C(s) is the set of valid
conformations.
The minimum free energy function of the 2D HP lattice
[42] model with calculation conditions as follows:
an orthogonal array and instead use the signal signal-tonoise ratio as the mainly import valuation criteria.
D. Measure of prediction accuracy:
Root Mean Square Deviation (RMSD) measures
the average distance between corresponding atoms after the
predicted and the real [42, 62] structure have been
optimally super imposed on each other. The formula is
given
n=length of the protein
sequence
RMSD (a, b) =
(8)
(5)
Where rai and rbi are the position of the atom i structure a, b
respectively.
.
IV. CONCLUSION:
The intention of the protein structure prediction
problem is to find out the structure from a given amino acid
sequence. In this paper gone all the way, through many of
the evolutionary algorithms, and these algorithms are used
to anticipate the structure, and also the protein databases,
tools are listed out in this paper. Based on the protein
database it can easily find the particular protein id and all
those information about the specific protein. The tools are
used to guess the secondary structure, alpha turn and coil
values. And finally the performance measures for
evaluating the algorithms.
Fig.9 An optimal conformation for the sequence
“(HP)2PH(HP)2 (PH)2HP(PH)2"; the 2D HP lattice model
[1]
C. Signal to Noise Ratio
The signal-to-noise ratio is a quality index. It is
used in the communications industry to evaluate
communications systems. [2].The SNR is an index of
robustness, it measures the quality of energy
transformation. Depending on the type of characteristic the
SNR has several categories, lower is better (LB), normal is
best (NB), and higher is best (HB). The equations for
calculating SNR ( ) for LB and HB characteristics are:
REFERENCES:
1.
Cheng-Jian Lin and Shih-Chieh Su, “Using An Efficient
Artificial Bee Colony Algorithm For Protein Structure
Prediction On Lattice Models”, International Journal of
Innovative Computing, Information and Control, ICIC
International c⃝ 2012 ISSN 1349-4198, Volume 8, Number
3(B).
2.
Cheng-Jian Lin, Ming-Hua Hsieh, “An efficient hybrid
Taguchi-genetic algorithm for protein folding simulation”,
Expert Systems with Applications (2009) 36, 12446–12453.
3.
Jacek Blazewick, Ken Dill, Piotr Lukasiak and Maciej
Milostan, “A Tabu Search Strategy For Finding Low Energy
Structures Of Proteins In Hp-Model”, computational methods
in science and technology (2004), 10, 7-19.
4.
Jingfa Liu, Yuanyuan Sun, Gang Li, Beibei Song, Weibo
Huang, “Heuristic-based tabu search algorithm for folding
two-dimensional
AB
off-lattice
model
proteins”
,Computational Biology and Chemistry (2013) 47, 142–148.
(i)Lower is Better (LB):
(6)
(ii)Higher is Better (HB):
)
(7)
An orthogonal array is used for optimization, i.e.,
to maximize the signal-to-noise ratio. It’s necessary to use
124
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
Copyright
(2012)
Springer
doi.org/10.1007/978-3-642-35101-3_10.
5.
Jianlin Cheng, Allison N. Tegge, and Pierre Baldi,” Machine
Learning Methods for Protein Structure Prediction”, IEEE
Reviews In Biomedical Engineering (2008) Vol. 1.
Berlin/Heidelberg.
16. Dill, K. A., “Theory for the Folding and Stability of Globular
Proteins,” Biochemistry, 24(6), March (1985), pp. 1501–
1509.
17. Mahmood A. Rashid,. Hakim Newton, M. A., Md. Tamjidul
6.
Pauling, L., and Corey, R. B., “The pleated sheet, a new layer
configuration of the polypeptide chain”, Proc. Nat. Acad. Sci
(1951) 37, pp. 251–256.
7.
Ashish Ghosh, Bijnan Parai, “Protein secondary structure
prediction using distance based classifiers”, International
Journal of Approximate Reasoning (2008), 47, 37–44,
doi:10.1016/j.ijar.2007.03.007.
Hoque, and Abdul Sattar, “Mixing Energy Models in Genetic
Algorithms for On-Lattice Protein Structure Prediction”,
Hindawi Publishing Corporation,
BioMed Research
International, Volume (2013) , Article ID 924137, 15 pages,
http://dx.doi.org/10.1155/2013/924137.
B., Leight, T., “Protein folding in the
hydrophobichydrophilic (HP) model is NP-complete," J.
Comp. Biol (1998) V5, N1, pp. 2740.
18. Berger,
8.
Pauling, L., Corey, R. B., and Branson, H. R., “The structure
of proteins: Two hydrogen bonded helical configurations of
the polypeptide chain”, Proc. Nat. Acad. Sci (1951) Vol 37,
pp. 205–211.
19. Alena Shmygelska, and Holger H Hoos, “An ant colony
optimization algorithm for the 2D and 3D hydrophobic polar
protein folding problem”, BMC Bioinformatics (2005),
doi:10.1186/1471-2105-6-30.
9.
10.
Howard Holley, L., and Martin Karplus, “Protein secondary
structure prediction with a neural network”, Proc. Nati. Acad.
Sci. (1989), USA, Vol. 86, pp. 152-156, Biophysics.
Ning Qian and Terrence J. Sejnowski, “Predicting the
Secondary Structure of Globular Proteins Using Neural
Network Models “, J. Mol. Biol (1988), 202, 865-884.
20. Xiaolong Zhang, Ting Wang, Huiping Luo, Jack Y Yang,
Youping Deng, Jinshan Tang, Mary Qu Yang, “3D Protein
structure prediction with genetic tabu search algorithm”,
BMC
Systems
Biology
(2010),
4(Suppl1):S6,
http://www.biomedcentral.com/1752-0509/4/S1/S6.
21. Chou P. Y., and Fasman G. D., “Conformational Parameters
for Amino Acids in Helical, β-Sheet, and Random Coil
Regions Calculated from Proteins”, Biochemistry (1974),
13(2), 211-222.
11. Richardson, J. S.,”The Anatomy and Taxonomy of Protein
Structure”, Adv. in Prot. Chem., 34, 167-339. (Tertiary
Structure Used)
22. Chou, P.Y. and Fasman G.D., “The Chou-Fasman Method
for Secondary Structure Prediction”, Prediction of protein
conformation, Biochemistry 13(2), 222-45 (1974), Protein
Physics SI2700 - Spring 2012.
12. Kendrew, C., Dickerson, Strandberg, B. E., Hart, R. J.,
Davies, D. R., Phillips, D. C., and Shore, V.C., “Structure of
myoglobin: A three-dimensional Fourier synthesis at 2_a
resolution”, Nature (1960), vol.185, pp. 422–427.
23. Jean Garnier, Jean-Franqois Gibra, T., and Barry Robson,
“GOR Method for Predicting Protein Secondary Structure
from Amino Acid Sequence”, Methods In Enzymology, Vol.
266.
13. file:///F:/charcteristic/Protein%20Structure%20%20Primary,
%20Secondary,%20Tertiary,%20Quatemary%20Structures.h
tm
24.
14. Ivan Kondov, “Protein structure prediction using distributed
parallel particle swarm optimization”, Nat Comput (2013),
12:29–41, DOI 10.1007/s11047-012-9325-x.
15. Mahmood A Rashid, Md Tamjidul Hoque, Hakim Newton
M.A., Duc Nghia Pham, Abdul Sattar,” A New Genetic
Algorithm for Simplified Protein Structure Prediction”,
Taner, Z., Sen, Robert, L., Jernigan, Jean Garnier and
Andrzej Kloczkowski, “GOR V server for protein secondary
structure prediction”, APPLICATIONS NOTE (2005) Vol.
21
no.
11,
pages
2787–2788,
doi:10.1093/bioinformatics/bti408.
25. Kloczkowski, A., Ting, K-L., Jernigan, R.L., and Garnier, J.,
“Information for Protein Secondary Structure Prediction
125
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
From Amino Acid Sequence”, Proteins: Structure, Function,
35. Rajkumar Bondugula, Ognen Duzlevski, And Dong Xu , “
and Genetics (2002) 49:154–166.
Profiles And Fuzzy K-Nearest Neighbor Algorithm For
Protein Secondary Structure Prediction”, In Proc. of the
Third Asia Pacific Bioinformatics Conference , 2005.
26. Jacek Błażewicz, Piotr Łukasiak and Szymon Wilk, “New
machine learning methods for prediction of protein
secondary structures”, Control and Cybernetics, vol. 36
(2007) No. 1.
36. Seung-Yeon Kim, Jaehyun Sim, and Julian Lee D.-S. Huang,
K. Li, and G.W. Irwin, “ Fuzzy k-Nearest Neighbor Method
for Protein Secondary Structure Prediction and Its Parallel
Implementation”, ICIC 2006, LNBI 4115, pp. 444–453,
2006 copyright @Springer-Verlag Berlin Heidelberg.
27. Ward, J. J., McGuffin, L. J., Buxton B. F., and Jones, D. T.,
“Secondary structure prediction with support vector
machines”, (2003) Vol.19 no.13, pages 1650–1655, DOI:
10.1093/bioinformatics/btg223.
37. Jyh-Shing Roger Jang. ANFIS: Adaptive-network-based
fuzzy inference system. IEEE Transactions on Systems, Man
and Cybernetics, 23(0018- 9472):665–685, 1993.
28. Minh, N., Nguyen Jagath, C., Rajapakse , “Multi-Class
Support Vector Machines for Protein Secondary Structure
Prediction”, Genome Informatics (2003) 14: 218–227.
29. Long-Hui Wang, Juan Liu, “Predicting Protein Secondary
Structure by a Support Vector Machine Based on a New
Coding Scheme”, Genome Informatics (2004) 15(2): 181–
190,181.
38. Subhendu Bhusan Rout, Satchidananda Dehury, Bhabani
Sankar Prasad Mishra, “Protein Structure Prediction using
Genetic Algorithm”, IJCSMC, Vol. 2, Issue 6, June 2013,
pg.187 – 192.
Chira, Dragos Horvath, “Dumitru Dumitrescu
Evolutionary Computation, Machine Learning and Data
Mining in Bioinformatics”, Lecture Notes in Computer
Science Volume 6023 (2010), pp 38-49.
39. Camelia
30. Hae-Jin Hu, Yi Pan, Robert Harrison, and Phang C. Tai,
“Improved Protein Secondary Structure Prediction Using
Support Vector Machine With a New Encoding Scheme and
an Advanced Tertiary Classifier”, IEEE Transactions On
Nano bio science, December (2004) Vol. 3, No. 4, 265.
40. Tantara, A., Melaba, N., Talbia, G., Parentb, B., Horvathb,
D.,“ A parallel hybrid genetic algorithm for protein structure
prediction on the computational grid”, Future Generation
Computer Systems 23 (2007) 398–409.
31. Blaise Gassend, Charles O'Donnell, W., William Thies,
Andrew Lee, Marten van Dijk, and Srinivas Devadas,
“Predicting Secondary Structure of All-Helical Proteins
Using Hidden Markov Support Vector Machines”, copyright
Springer-verlag Berlin Heidelberg (2006), pp. 93 104.
32. Sujun Hua and Zhirong Sun, “A Novel Method of Protein
41. Thang N. Bui and Gnanasekaran Sundarraj, “An Efficient
Genetic Algorithm for Predicting Protein Tertiary Structures
in the 2D HP Model”, GECCO ’05 Proceedings of the 7th
annual conference on Genetic and Evolutionary computation,
Pages
385-392,
ISBN:1-59593-010-8,
doi:10.1145/1068009.1068072.
Secondary Structure Prediction with High Segment Overlap
Measure: Support Vector Machine Approach”, J. Mol. Biol.
(2001) 308, 397±407, doi:10.1006/jmbi.2001.4580.
42. Trent Higgs, Bela Stantic, Md Tamjidul Hoque and Abdul
33. Pierre Baldi and Gianluca Pollastri , “The Principled Design
Sattar, “Genetic Algorithm Feature-Based Re sampling for
Protein Structure Prediction”, WCCI 2010 IEEE World
Congress on Computational Intelligence July, (2010) 18-23 CCIB, Barcelona, Spain.
of Large-Scale Recursive Neural Network Architectures–
DAG-RNNs and the Protein Structure Prediction Problem”,
Journal of Machine Learning Research 4 (2003) 575-602
Submitted 2/02; Revised 4/03; Published 9/03.
43. Karaboga N, Cetinkaya MB, “A novel and efficient
algorithm for adaptive filtering: Artificial bee colony
algorithm”. Turk J Electr Eng Comput Sci 19 (2011)
(1):175–190.
34. Armando Blanco, David A. Pelta, Jos -L. Verdegay,
“Applying a Fuzzy Sets-based Heuristic to the Protein
Structure Prediction Problem”, International Journal Of
Intelligent Systems (2002), Vol. 17, 629–643, DOI:
10.002/int.10042.
44. Sree PK, Babu IR, Devi NS., “Investigating an Artificial
Immune System to strengthen protein structure prediction
and protein coding region identification using the cellular
126
IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555
Vol. 4, No.5, October 2014
classifier”, Int J Bioinform Res Appl
57. file:///F:/algorithms/extra/Protein%20structure%20database%
automata
(2009);5(6):647-62.
20-%20Wikipedia,%20the%20free%20encyclopedia.htm
58. http://scop.mrc-lmb.cam.ac.uk.
45. Zhexin Xiang, “ Advance in protein homology modeling”,
59. http://www.bioinformaticsweb.net/data.html
Curr Protein Pept Sci (2006) june; 7(3):217-227.
60. file:///F:/Untitled%20Document.htm
46. David Baker and Andrej Sali, “Protein structure prediction
61. file:///F:/allover/algorithms/extra/PDBsum%20entry%20%20
1g8p.htm
and structural Genomics”, Science (2001) 294(5540):93–96.
62. Fogel, G.B., and Corne, D.W., “Evolutionary Computation in
Bioinformatics”, Elsevier, 2003.
47. Jooyoung Lee, Sitao Wu, and Yang Zhang , “Ab Initio
Protein Structure Prediction”, © Springer Science + Business
Media B.V (2009).
M., and Schneider, G., “Protein Folding
Simulation by Particle Swarm Optimization”, The Open
Structural Biology Journal (2007) 1, 1-6.
48. Meissner,
49. C.A. Floudas, “Computational Methods in Protein Structure
Prediction”, Biotechnol. Bioeng (2007), 97: 207–213, Wiley
Periodicals, Inc.
Fidanova, Ivan Lirkov, “Ant Colony System
Approach for Protein Folding”, Proceedings of the
International Multiconference on Computer Science and
Information Technology, Technology pp. 887–891, ISBN
978-83-60810-14-9, ISSN 1896-7094.
50. Stefka
51. Vargas Benitez, C., and Lopes, H.,”Parallel artificial bee
colony algorithm approaches for protein structure prediction
using the 3dhp-sc model”, Intelligent Distributed Computing,
4 (2010) 255-264.
52. Nashat Mansour, Fatima Kanj, Hassan Khachfe, “Particle
swarm optimization approach for protein structure
prediction in the 3D HP model“, Interdisciplinary Sciences:
Computational Life Sciences September (2012), Volume
4, Issue 3, pp 190-200.
53. Xin Chen, Mingwei Lv, Lihui Zhao and Xudong Zhang, “An
Improved Particle Swarm Optimization for Protein Folding
Prediction “, I.J. Information Engineering and Electronic
Business (2011) 1, 1-8.
54. http://www.bioinformaticsweb.net/datalink.html
55. http://www.science.co.il/Biomedical/Structure-Databases.asp
56. http://en.wikipedia.org/wiki/List_of_biological_databases#Pr
otein_structure_databases
127
Download