IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 Computational Analysis of Protein Structure Prediction and Folding D. Ramyachitra Assistant Professor Department of Computer Science Bharathiar University V.Veeralakshmi M.Phil Research Scholar Department of Computer Science Bharathiar University Coimbatore, India Coimbatore, India jaichitra1@yahoo.co.in veeralakshmi13@gmail.com ABSTRACT: Protein structure prediction (PSP) problem is a computationally challenging problem. To predict the protein structure from sequence information, is of massive significance and also the properties of proteins are critically determined by their structures. All information is necessary to fold a protein to its native structure is contained in its amino-acid sequence. The native structure of the protein is clearly not known. Protein folding problem is predicting the proteins tertiary structure is folding problem. Misfolding occurs, when the protein folds into a 3D structure that does not represent its correct native structure. In the HP model each amino acid is classified and it is based on its hydrophobicity as an H (hydrophobic or non-polar) or a P (hydrophilic or polar). The HP energy model is focusing the search towards exploring structures that have hydrophobic cores. To solve the PSP problems many of the algorithms are used to find out the lowest energy conformations. In this paper we go through the protein structure prediction problem and some of the techniques to predict the structure. measures, determining optimal or close-to-optimal structures for a given amino-acid sequence (Krasnogor, Hart, Smith, & Pelta, 1999). The computational approach of the protein structure is very attractive [1, 2]. The optimal conformation in the HP model is the one that has the maximum number of H–H (Fig 2) contacts which gives the lowest energy value [2]. The protein folding problem in the 2D HP model has been proved to be NP-hard. In 1993 unger and moult found the native conformation in a number of simplified models in the NP hard problems [3]. In an AB off-lattice model, the hydrophobic residues were labeled by A and the polar or hydrophilic ones by B. Fibonacci sequences of A and B residues were studied by using potentials including bending and Lennard–Jones energy [4]. In 2D HP models, many algorithms have been explored to find the minimum energy configuration for small protein. The remaining sections of this paper are organized as follows. Section 2 describes the overview of protein structure and section 3 describes the performance metrics. Finally section 4 gives the conclusion. Key words: Protein structure prediction, HP energy model, Protein folding. I. INTRODUCTION The primary structure of a protein is a linear sequence of amino acids connected together via peptide bond. The protein structures are determined by techniques such as MRI (magnetic resonance imaging) and X-ray crystallography. These techniques require isolation, purification and crystallization of the target protein [1]. The levels of the protein structures can be given in Fig 1. Fig.1The Levels of Protein Structure Prediction The protein structure prediction problem was solved by two major sources: (1) finding good measures for the quality of candidate structures, and (2) given such Fig.2 An optimal conformation for the sequence ‘‘(HP)2PH(HP)2(PH)2HP(PH)2” in a 2D lattice model [2] II. AN OVERVIEW OF PROTEIN STRUCTURE The Proteins perform a variety of biological tasks. Protein structure determines its function. Protein structure is more conserved than protein sequence, and more closely interconnected to function. A protein is a linear polypeptide chain composed of 20 different kinds of amino acids represented by a sequence of letters (left) (Fig 3). It folds into a tertiary (3-D) structure (middle) composed of three kinds of local secondary structure elements (helix – red; beta strand– yellow; loop – green). 116 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 Fig.3 Protein sequence-structure-function relationship [5] The protein with its native 3-D structure can carry out several biological functions in the cell (right). A. PROTEIN STRUCTURE HIERARCHY: The four levels (shown in Fig 4) of the protein structure are Primary, secondary, tertiary and quaternary structure. a) Primary Structure: A protein is a sequence of amino acid building blocks arranged in a linear chain and joined together by peptide bonds. The linear polypeptide series is called the primary structure of the protein. The primary structure is typically represented by a sequence of letters over a 20letter alphabet associated with the 20 naturally occurring amino acids [6]. Protein sequences differ in length from 30 to 30,000 amino acids, mostly a few hundreds. b) Secondary Structure: Secondary structure prediction is a task for predicting the conformational state of each amino acid in a protein sequence [7]. The protein folds into local secondary structures including alpha helices (H), beta strands (E). They may be connected by loop regions or coils. Thang N. Bui et al., proposed an efficient genetic algorithm for the protein folding problem used by the HP model in the two-dimensional square lattice [41]. The algorithm performs very well against existing evolutionary algorithms and Monte Carlo algorithms. Fig.4 Protein Structure Hierarchy Alpha helix: An alpha helix is a tightly coiled, rod like structure. It is formed from one continuous region through the formation of hydrogen bonds between carboxy [8] group of residue in the position i and NH group of residue i+4. L. Howard Holley and Martin Karplus assigned helix to any group of four or more contiguous residues, the minimum helix in Kabsch and Sander classifications, having helix output values greater than sheet outputs and greater than threshold value [9]. Beta strand: A beta strand is just a fragment sheet like structure. Beta sheets are formed by linking 2 or more Beta strands by H bonds side chain of adjacent residues point in opposite directions only trans peptide bonds give R groups on opposite sides cannot exist as a single Beta strand; must be 2 or more in proteins, 4-5 strands make up a beta sheet. Beta sheets may consist of parallel strands, anti parallel strands or out of a mixture of parallel and anti parallel strands. Qian et al investigates the maximum overall prediction accuracy on the training set is 63.2%. An increase in prediction accuracy for residues near the amino-terminus and for highly buried versus partially exposed b-strands, residues with higher output activities were found to be more accurately predicted [10]. Richardson produces the b-Turns are a specific class of chain reversals localized over a four-residue sequence, network predictions for b-turns begin with the hypothesis that the information necessary to force the sequence of amino acids into a b-turn exists locally in a small window of residues. The low values for the overall prediction accuracy reflect the stringent requirement that all four residues in the b-turn must be correctly predicted [11]. Coils: Coils have no fixed regular shape. The super secondary structure, which are commonly found on secondary structure arrangements such as helix-loophelix. L. Howard Holley and Martin Karplus defined residues that are not assigned to helices or Beta-strands are considered coil. By maximizing the accuracy of secondary structure assignment the threshold parameter value is adjusted for the training set [9]. c) Tertiary Structure: The tertiary structure is described [7] by the x, y and z coordinates of all the atoms [12] of a protein or, in a more coarse description, by the coordinates of the backbone atoms. The three dimensional conformations resulted from secondary structures folding together. Ivan Kondov proposed a Particle swarm optimization for computer aided prediction of proteins’ three dimensional structure. An asynchronous parallelization speeds up the simulation better than the synchronous one and reduces the effective time for predictions [14]. d) Quaternary Structure: 117 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 A protein with a quaternary structure consists of more than one practically identical sub-unit, not joined by strong bonds. It describes the spatial packing of several folded polypeptides [13]. Not all proteins have a quaternary level of structure. An example of a quaternary structure is human hemoglobin, which is made up of four distinct subunits, each an individual chain of amino acids, but functions as a single complex. B. PROTEIN MODELS: The protein structure can be specified at different levels of the hierarchy. Due to the complexity of the problem simplified models are used to accommodate limited computing resources to represent a protein structure using two categories (given in Fig 5). All Atom Model: Protein structures are represented by list of 3D coordinates of the all atoms in a protein. An atom model is desired in the structure prediction. It is very difficult to identify similar sub structures across different proteins and generalization and abstraction. Ivan Kondov [14] use all-atom force field space to improve the performance of the Method Periodic boundary conditions applied to the search space. The standard algorithm, as implemented in the ArFlock library is the low-energy conformations of several peptides. Fig.5 Protein structure models Simplified Models: All atom models are not feasible so the simplified model is used to produce the approximate solutions. Each amino acid of the sequence occupies a point on the lattice to form a continuous chain of selfavoiding walk [15]. A simplified model ranges from a very abstract model such as HP model. Simplified models classifications are shown in Fig 6. Dill, K. A used the HP model in the 3D square lattice as the 3D HP model. Each amino acid is classified based on its hydrophobicity as an H (hydrophobic or nonpolar) or a P (hydrophilic or polar). The objective of the protein folding problem is to determine a confirmation of minimum energy. Conformation of a protein in the HP model is embedded as a self-avoiding walk in either a two-dimensional or a three-dimensional lattice [16]. Mahmood A. Rashid et al., developed a genetic algorithm that mainly uses a high resolution energy model for protein structure evaluation but uses a low resolution HP energy model in focusing the search towards exploring structures that have hydrophobic cores [17]. Berger et al., used the protein folding problem in the HP model called HP-Protein Folding problem to find a given protein a valid conformation [18] on the Cartesian lattice such that the energy is minimum. The HP-Protein Folding problem is NP hard. Mahmood A Rashid et al., used HP based energy model on 3D FCC lattice to simplify the problem. In GA+, using 3 enhancements are i) an exhaustive generation approach to diversify the search ii) a novel hydrophobic core-directed macro move to intensify the search and iii) a random-walk based approach to recover from stagnation. The state-of-the-art results on facecentered cubic (FCC) lattice based hydrophobic-polar (HP) energy model have been achieved by local search (LS) methods [15]. Alena Shmygelska et al., used the HP Protein Folding Problem that incorporates a local search phase that takes the initially built protein conformation and attempts to optimize its energy, using probabilistic longrange moves [19]. Cheng-Jian Lin et al., used an efficient hybrid Taguchi-genetic algorithm (HTGA) for solving the protein folding problem in the 2D HP model. The Taguchi method is used to improve the crossover operation to select better genes. The merits of PSO were used to improve the mutation mechanism [2]. Off Lattice: Xiaolong Zhang et al., proposed a genetic tabu search method for predicting the protein structure. PSP has important issues which are designs of the structure model and the optimization technology. The structure model is the complexity of the realistic protein structure. In this study the simplified model, which is called AB off lattice is used to search the best conformation of a protein sequence [20]. Jingfa Liu et al., developed a heuristic-based tabu search (HTS) algorithm for integrating the heuristic initialization mechanism, the heuristic conformation updating mechanism, and the gradient method into the improved TS algorithm. The HTS algorithm is quite promising in ground states for AB off-lattice model proteins [4]. HP Model: 118 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 Jian lin et al., proposed an efficient artificial bee colony algorithm for protein structure prediction on lattice models. Here the modified ABC algorithm for protein folding has been applied to the protein folding problem based on hydrophobic-polar lattice model [1]. C. PROTEIN TECHNIQUES: Fig.6 Classification of simplified models STRUCTURE PREDICTION The difficulty of protein structure prediction is usually tackled in 2 main steps: 1. Protein secondary structure prediction 2. Protein tertiary structure prediction. a) Protein Secondary Structure Prediction: Lattice: Many of the techniques are used to solve the protein secondary structure prediction problem. Some of the techniques are given in fig 7 Fig 7 Protein Structure Prediction Techniques STATISTICAL METHOD: Chou-Fasman (CF) Method: The Chou-Fasman [21] method is the one of the first method for the implementation of protein secondary structure prediction. The method involves a matrix of two values: propensity values, a given amino acid will appear within the structure, and frequency values, found in a hairpin turn for a given amino acid. Taking these values into account the method then predicts regions of α-helices, regions of β-sheets, and positions where β-turns may appear. Chou, P.Y. and Fasman G.D., is used to predict the Alpha-helices and beta-strands predicted by setting a cut for the total propensity for a slice of four residues. The values of the residues were classified into helix or strand breakers and formers. In formers the residues positively contribute to the formation of the structural element. Breakers are used to prevent or stop its formation [22]. Garnier-Osguthorpe-Robson (GOR) Method: JEAN GARNIER et al., proposed the GOR method [23] one of the most popular of the secondary structure prediction. This method is the real first prediction of secondary structure implemented as a computer program. The addition of homologous sequence information through multiple alignments has given a significant boost to the accuracy of secondary structure predictions. Taner Z. Sen et al., developed the GOR V web server for protein secondary structure prediction. This algorithm combines Bayesian statistics, information theory and evolutionary information. GOR V has been among the most successful methods, its online unavailability has been a restraint to its popularity [24]. A. Kloczkowski et al., generated a new algorithm GOR V [25] released on online prediction server. By limiting the prediction to 375 sequences that having 59 PSIBLAST alignments. MACHINE LEARNING ALGORITHM: Jacek Błażewicz et al., proposes new machine learning methods [26] such as lad, lem2, and modlem have 119 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 been used for secondary protein structure prediction to handle a huge amount of data sets. LEM2 and MODLEM are rule induction algorithms that generate a minimal set of rules given a set of positive examples and a set of negative examples. The aim is to identify which method is more suitable for analyzed and to find the rules would predict the secondary structure. The best average results were obtained using the LAD algorithm. The two types of the machine learning algorithm are shown in Fig 8 Fig 8 Machine Learning Algorithm Types Support Vector Machine: J. J. Ward et al., developed a reliable prediction method using an alternative technique and to investigate the applicability of SVM. The SVM executes similarly to the ‘state-of-the-art’ PSIPRED prediction method on a nonhomologous test set of 121 proteins in spite of being trained on considerably fewer examples. An uncomplicated consent of the SVM, PSIPRED and PROFsec achieves higher prediction accuracy than the individual methods [27]. Minh N. Nguyen et al., investigates the multi-class SVM methods involved to resolve a much larger optimization problem and are applicable to small datasets. The multi-class SVM methods are more suitable for protein secondary structure (PSS) [28] prediction than the other methods, including binary SVMs. It is feasible to extend the prediction accuracy by adding a second-stage multiclass SVM to capture the contextual information among secondary structural elements Long-Hui Wang et al., proposed a kernel method support vector machine takes into account of the physicalchemical properties and structure properties of amino acids. The SVM classifiers would also be improved by using larger training sets that contain new protein structures, and also it requires more memory to store data points. It is one of the top range methods for predict the protein secondary structure [29]. Hae-Jin Hu et al., investigate the SVM learning machine which is applied for the improvement of the prediction accuracy of the secondary structure. In the first approach, the new encoding schemes are applied and optimized. In the second approach, a new tertiary classifier combines the results of one-versus-one binary classifiers is designed and its efficiency is compared with the existing tertiary classifiers. The tertiary classification can be decomposed into a set of binary classifications. To improve the performance in many other areas such as pattern recognition, data mining, and machine learning [30]. Blaise Gassend et al., proposed the Hidden Markov Support Vector Machines (HM-SVMs) [31], The HMM is trained using a Support Vector Machine method which iteratively picks a cost function based on a set of constraints, and uses the predictions resulting from this cost function to generate a new constraints for the next iteration. Unlike most secondary structure methods, used to predict not only the residues participate in a beta sheet, also these residues are forming hydrogen bonds between adjacent sheets. Sujun Hua et al ., represented a new approach to supervised pattern classification applied to a pattern recognition problems, including object recognition, speaker identification, gene function prediction with microarray expression profile, etc. The SVM method achieved a good performance of segment overlap accuracy SOV, through sevenfold cross validation on a database of 513 nonhomologous protein chains with multiple sequence alignments [32]. Neural Network: Pierre Baldi et al., proposed several classes of recursive artificial neural networks (RNN) [33] architectures for large-scale applications that are derived using the directed acyclic graphs (DAG-RNN) approach. To derive state-of-the-art predictors for protein structural features such as secondary structure (1D) and both fineand coarse-grained contact maps (2D) and the internal deterministic dynamics allows efficient propagation of information, and l training by gradient descent, to tackle large-scale problems. L. Howard Holley et al., investigates the neural network are applied to the protein secondary structure prediction. Specialization of a neural network to a particular problem involves the network topology that is, the number of layers, the size of the layer, and the pattern connections-and the connection strengths to each pair of connected units and of thresholds to each unit. The method achieved helix, sheet, and coil [9]. Ning Qian et al., developed a new method for predicting the secondary structure of globular proteins based on non-linear neural network models. The goal of the method uses the available information in the database of known protein structures to help predict the secondary structure of proteins for which no homologous structures exists [10]. FUZZY SETS: Armando Blanco et al., proposed a fuzzy adaptive neighborhood search (FANS) to analyze one of the most important problems in the computational biology area: the protein structure prediction problem. The same results could be potentially obtained discarding the population and applying mutations to a unique individual onto the application of heuristics to the PSP [34]. Rajkumar Bondugula et al., proposed a prediction system that is based on a generalized Nearest Neighbor method by using the position specific scoring matrices 120 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 (PSSMs) of the query protein sequence as input to the prediction system. Jyh-Shing Roger Jang was proposed Adaptive neuro-fuzzy inference systems (ANFIS) which is one of the most popular types of fuzzy neural networks [37], it combines the advantages of fuzzy system and neural network, in modeling non-linear control System. Yongxian Wang described a method of hybrid neural network and fuzzy system and the three-class secondary structure prediction of the protein using the ANFIS to produce a better result. ARTIFICIAL IMMUNE SYSTEM (AIS): Sree PK et al., proposed an Artificial Immune System (AIS-MACA) a novel computational intelligence technique that can be used for strengthening the automated protein prediction system [44]. A. Tantara et al., proposed a bi criterion parallel hybrid genetic algorithm (GA) which is used to efficiently solve the problem using the computational grid. It is used by defining not only the ground-state energy conformation of a molecule but also the ensemble of potential low-energy conformations [40]. Trent Higgs et al., present a feature based re sampling genetic algorithm to refine structures that are outputted by PSP software. The two structural measures are RMSD and TM-Score [42]. Mahmood A. Rashid et al., represented a genetic algorithm for protein structure prediction on 3D facecentered-cubic lattice. A low resolution energy model could effectively bias the search towards certain promising directions [15]. SWARM INTELLIGENCE: Artificial Bee Colony Algorithm: EVOLUTIONARY ALGORITHM: Genetic Algorithm: Subhendu Bhusan Rout et al., proposed a Genetic Algorithm technique for the prediction of protein structure. This technique helps to work with huge amount of data and for the prediction of protein structure in a large scale. To analyze the changes of protein structure and providing a metaphor of the processes the genetic algorithm is very useful for designing the drugs, after processing of enormous amount data with less amount of time [38]. Mahmood A Rashid et al., proposed a new genetic algorithm for protein structure prediction problem using face-centered cubic lattice [17] and hydrophobic-polar energy model. The results was compared with the state-ofthe-art local search algorithm for simplified PSP and final algorithm GA+ that use a combination of all the three enhancements discussed in the HP energy model. Cheng-Jian Lin et al., developed an efficient hybrid Taguchi-genetic algorithm that combines genetic algorithm, Taguchi method, and particle swarm optimization (PSO). The PSO inspired by a mutation mechanism in a genetic algorithm and the GA has the capability of powerful global exploration, though the Taguchi method can utilize the optimum offspring. It can be applied successfully to the protein folding problem based on the hydrophobic-hydrophilic lattice model and the simulation results performs very well against existing evolutionary algorithm [2]. Camelia Chira et al., proposed to address the hydrophobic - polar model of the protein folding problem based on hill-climbing genetic operators. The crossover and mutation are applied using a steepest-ascent hill-climbing approach [39]. The evolutionary algorithm with hillclimbing operators is successfully applied to the protein structure prediction problem for a set of difficult bi dimensional instances from lattice models. Karaboga et al., presented the Artificial Bee Colony (ABC) algorithm for constrained optimization problems. The performances of the Artificial Bee Colony (ABC) algorithm is used for solving constrained optimization problems and produce the best results [43]. c) Protein Tertiary Structure Prediction: For many proteins and protein domains, prediction of their three-dimensional (3D) or “tertiary” structure from the amino acid sequence should be feasible and an increasing number of sequences. Tertiary structure prediction techniques are shown in fig 7 TEMPLATE MODELING: Homology Modeling: Zhexin Xiang investigates the homology modeling. In homology modeling, detecting the homologues distant is aligning sequences with template structures, modeling of loops and side chains, as well as detecting errors in a model, has contributed to reliable prediction of protein structure [45]. Threading: C.A. Floudas investigates threading that generalizes the technique of homology modeling and aligns the unknown sequence. It is also known as ‘fold recognition’ algorithm [49] or ‘inverse folding’. Threading methods aim at fitting a target sequence to a known structure in a library of folds. TEMPLATE FREE MODELING: Ab Initio Structure Prediction: 121 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 David Baker and Andrej Sali classified the models for protein structure prediction into two main categories, without relying on similarity at the fold level between the target sequence and those of the known structures [46]. Jooyoung Lee et al., used an ab initio modelling, for a complete solution to the protein structure prediction problem. Predicting protein 3D structures from the amino acid sequence and ab initio modeling help us to understand the physicochemical principle of how proteins fold in nature [47]. M. Meissner et al., used ab initio prediction of a set of small protein structures that require the usefulness of PSO in applied protein structure prediction. The use of an appropriate energy function ab initio protein structure prediction should be feasible [48]. EVOLUTIONARY ALGORITHM: Genetic Algorithm: Xiaolong Zhang et al., investigates the genetic tabu search algorithm to develop an efficient optimization algorithm. The crossover and mutation operators can improve the local search capability and variable population size strategy can maintain the diversity of the population, and the ranking selection strategy [20]. SWARM INTELLIGENCE: Ant Colony Optimization Algorithm: Stefka Fidanova and Ivan Lirkov develop an ant algorithm for 3D HP protein folding problem. The components of an algorithm contribute to its performance and the performance is affected by the heuristic function and selectivity of pheromone updating. The aim is to achieve more realistic folding [50]. Alena Shmygelska et al., investigate a new algorithm, dubbed ACO-HPPFP-3, and are based on very simple structure components. The run-time required by ACO-HPPFP-3 for finding best known energy conformations scales worse with sequence length than PERM in 3D [19]. Artificial Bee Colony Algorithm: C. Vargas et al., proposed a parallel artificial bee colony algorithm approaches for protein structure prediction using 3dhp-sc model [51]. Two parallel approaches for the ABC are: master-slave and hybridhierarchical relations. The parallel models achieve good level of efficiency, and the hybrid hierarchical approach improved the quality of solutions. Particle Swarm Optimization Algorithm: Nashat Mansour et al., presented a particle swarm optimization (PSO) based algorithm for predicting protein structures in the 3D hydrophobic polar model. The PSO algorithm performs better than previous algorithms by finding lower energy structures or by performing fewer numbers of energy evaluations [52]. Xin Chen et al., introduced a levy flight to improve the precision and enhance the capability of the local optima through particle mutation mechanism [53]. M. Meissner et al., introduced Particle Swarm Optimization (PSO) to protein structure prediction. Finding the global optimum in the free energy landscape of protein structures and yielding near native structures for two small sample proteins [48]. PROTEIN DATABASES Some of the protein databases are used to predict the protein structure, which are given below. a) Protein Data Bank (PDB): The PDB is a key resource in areas of structural biology. The Protein Data Bank (PDB) is a repository for the 3D structural data of huge natural molecules, such as proteins and nucleic acids. The file format initially used by the PDB was called the PDB file format [54, 55]. b) PDBsum: The PDBsum is a pictorial database that provides at-a-glance overview of the contents of each 3D structure deposited in the Protein Data Bank (PDB). Entries are accessed either by their 4-character PDB code.[54, 55, 56, 57]. c) SCOP: SCOP is a structural classification of proteins. The scop hierarchy contains four main levels: class, fold, super family and family. The SCOP database, created by manual check up and abetted by a battery of computerized methods, aims to provide an in depth and comprehensive description of the structural and evolutionary relationships between all proteins [58]. d) SwissProt: It is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy [59]. e) NCBI: The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. The NCBI has a series of databases relevant to biotechnology and biomedicine [60]. f) PDBe: 122 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 PDBe is the European resource for the collection, association and spreading of data on biological macromolecular structures. PDBe also works actively with the X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy and cryo-Electron Microscopy (EM) communities [55, 56, 57, 61]. III. EVALUATION METRICS A. HP Energy Model The HP energy model is based on the hydrophobicity of the amino acids. In the HP model, when two non-consecutive hydrophobic amino acids become topologically neighbours, they release a certain amount of energy, which for simplicity is shown as −1. The total free-energy E of a conformation, based on the HP model, becomes the sum of the energy released by all pairs of non-consecutive hydrophobic amino acids [15]. g) Protein Quaternary Structure Database (PQS): The Protein Quaternary Structure file server (PQS) is an internet resource that makes available coordinates for likely quaternary states for structures contained in the Brookhaven Protein Data Bank that were determined by Xray crystallography [55, 61]. h) Homology-derived Structures of Proteins (HSSP): HSSP is a derived database that merges structural and sequence protein information. Proteins commencing the Protein Data Bank are correlated with sequence homologues which share the same 3D structures [61]. (1) Here, cij = 1 if ith and jth amino acids are nonconsecutive in the sequence but are neighbours on the lattice, otherwise 0; and eij = −1 if ith and jth amino acids are both hydrophobic, otherwise 0. i) Research Collaboratory for Structural Bioinformatics (RCSB): The Research Collaboratory for Structural Bioinformatics (RCSB) is a non-profit consortium enthusiastic to improving the understanding of the function of biological systems through the study of the 3-D structure of biological macromolecules [56, 61]. j) Protein Data Bank Japan (PDBj): PDBj (Protein Data Bank Japan) maintains a centralized PDB archieve of macromolecular structures and provides integrated tools, in alliance with the RSCB and PDBe in EU. PDBj is supported by JST-NBDC and Osaka University [54, 56]. B. Free Energy The most popular lattice model is HP lattice model. The HP model has 2 bead types. The black beads denote the hydrophobic amino acid and white beads denotes the hydrophilic. The dotted line denotes the H-H contacts in the conformation. The free energy is minimum, the number of H-H contacts is maximum [1]. The assigned free energy value is -1.The optimal conformation in the HP model (Fig 9) has the maximum number of H-H contacts which gives the lowest energy value. The free energy for the protein can be intended by, k) OCA: OCA is a browser database for protein structure/function. The OCA integrates information from from Kyoto Encyclopedia of Genes and Genomes or K.E.G.G., as it is commonly called; a collection of online database dealing with genomes and biological chemicals OMIM, PDBselect, Pfam, PubMed etc [57, 61]. (2) (3) where the parameter l) TOPSAN: The TOPSAN project was residential to collect, share, and dispense information about protein 3D structures [57]. (4 ) 123 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 Hence, the protein folding problem can be transformed into an optimization problem, i.e., to calculate the minimal free energy of the protein folding conformation. HP sequence s=s2, s2, , , , , sn, find an energy conformation of s; to find c* such that E(c*)=min{E(c)|c , where C(s) is the set of valid conformations. The minimum free energy function of the 2D HP lattice [42] model with calculation conditions as follows: an orthogonal array and instead use the signal signal-tonoise ratio as the mainly import valuation criteria. D. Measure of prediction accuracy: Root Mean Square Deviation (RMSD) measures the average distance between corresponding atoms after the predicted and the real [42, 62] structure have been optimally super imposed on each other. The formula is given n=length of the protein sequence RMSD (a, b) = (8) (5) Where rai and rbi are the position of the atom i structure a, b respectively. . IV. CONCLUSION: The intention of the protein structure prediction problem is to find out the structure from a given amino acid sequence. In this paper gone all the way, through many of the evolutionary algorithms, and these algorithms are used to anticipate the structure, and also the protein databases, tools are listed out in this paper. Based on the protein database it can easily find the particular protein id and all those information about the specific protein. The tools are used to guess the secondary structure, alpha turn and coil values. And finally the performance measures for evaluating the algorithms. Fig.9 An optimal conformation for the sequence “(HP)2PH(HP)2 (PH)2HP(PH)2"; the 2D HP lattice model [1] C. Signal to Noise Ratio The signal-to-noise ratio is a quality index. It is used in the communications industry to evaluate communications systems. [2].The SNR is an index of robustness, it measures the quality of energy transformation. Depending on the type of characteristic the SNR has several categories, lower is better (LB), normal is best (NB), and higher is best (HB). The equations for calculating SNR ( ) for LB and HB characteristics are: REFERENCES: 1. Cheng-Jian Lin and Shih-Chieh Su, “Using An Efficient Artificial Bee Colony Algorithm For Protein Structure Prediction On Lattice Models”, International Journal of Innovative Computing, Information and Control, ICIC International c⃝ 2012 ISSN 1349-4198, Volume 8, Number 3(B). 2. Cheng-Jian Lin, Ming-Hua Hsieh, “An efficient hybrid Taguchi-genetic algorithm for protein folding simulation”, Expert Systems with Applications (2009) 36, 12446–12453. 3. Jacek Blazewick, Ken Dill, Piotr Lukasiak and Maciej Milostan, “A Tabu Search Strategy For Finding Low Energy Structures Of Proteins In Hp-Model”, computational methods in science and technology (2004), 10, 7-19. 4. Jingfa Liu, Yuanyuan Sun, Gang Li, Beibei Song, Weibo Huang, “Heuristic-based tabu search algorithm for folding two-dimensional AB off-lattice model proteins” ,Computational Biology and Chemistry (2013) 47, 142–148. (i)Lower is Better (LB): (6) (ii)Higher is Better (HB): ) (7) An orthogonal array is used for optimization, i.e., to maximize the signal-to-noise ratio. It’s necessary to use 124 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 Copyright (2012) Springer doi.org/10.1007/978-3-642-35101-3_10. 5. Jianlin Cheng, Allison N. Tegge, and Pierre Baldi,” Machine Learning Methods for Protein Structure Prediction”, IEEE Reviews In Biomedical Engineering (2008) Vol. 1. Berlin/Heidelberg. 16. Dill, K. A., “Theory for the Folding and Stability of Globular Proteins,” Biochemistry, 24(6), March (1985), pp. 1501– 1509. 17. Mahmood A. Rashid,. Hakim Newton, M. A., Md. Tamjidul 6. Pauling, L., and Corey, R. B., “The pleated sheet, a new layer configuration of the polypeptide chain”, Proc. Nat. Acad. Sci (1951) 37, pp. 251–256. 7. Ashish Ghosh, Bijnan Parai, “Protein secondary structure prediction using distance based classifiers”, International Journal of Approximate Reasoning (2008), 47, 37–44, doi:10.1016/j.ijar.2007.03.007. Hoque, and Abdul Sattar, “Mixing Energy Models in Genetic Algorithms for On-Lattice Protein Structure Prediction”, Hindawi Publishing Corporation, BioMed Research International, Volume (2013) , Article ID 924137, 15 pages, http://dx.doi.org/10.1155/2013/924137. B., Leight, T., “Protein folding in the hydrophobichydrophilic (HP) model is NP-complete," J. Comp. Biol (1998) V5, N1, pp. 2740. 18. Berger, 8. Pauling, L., Corey, R. B., and Branson, H. R., “The structure of proteins: Two hydrogen bonded helical configurations of the polypeptide chain”, Proc. Nat. Acad. Sci (1951) Vol 37, pp. 205–211. 19. Alena Shmygelska, and Holger H Hoos, “An ant colony optimization algorithm for the 2D and 3D hydrophobic polar protein folding problem”, BMC Bioinformatics (2005), doi:10.1186/1471-2105-6-30. 9. 10. Howard Holley, L., and Martin Karplus, “Protein secondary structure prediction with a neural network”, Proc. Nati. Acad. Sci. (1989), USA, Vol. 86, pp. 152-156, Biophysics. Ning Qian and Terrence J. Sejnowski, “Predicting the Secondary Structure of Globular Proteins Using Neural Network Models “, J. Mol. Biol (1988), 202, 865-884. 20. Xiaolong Zhang, Ting Wang, Huiping Luo, Jack Y Yang, Youping Deng, Jinshan Tang, Mary Qu Yang, “3D Protein structure prediction with genetic tabu search algorithm”, BMC Systems Biology (2010), 4(Suppl1):S6, http://www.biomedcentral.com/1752-0509/4/S1/S6. 21. Chou P. Y., and Fasman G. D., “Conformational Parameters for Amino Acids in Helical, β-Sheet, and Random Coil Regions Calculated from Proteins”, Biochemistry (1974), 13(2), 211-222. 11. Richardson, J. S.,”The Anatomy and Taxonomy of Protein Structure”, Adv. in Prot. Chem., 34, 167-339. (Tertiary Structure Used) 22. Chou, P.Y. and Fasman G.D., “The Chou-Fasman Method for Secondary Structure Prediction”, Prediction of protein conformation, Biochemistry 13(2), 222-45 (1974), Protein Physics SI2700 - Spring 2012. 12. Kendrew, C., Dickerson, Strandberg, B. E., Hart, R. J., Davies, D. R., Phillips, D. C., and Shore, V.C., “Structure of myoglobin: A three-dimensional Fourier synthesis at 2_a resolution”, Nature (1960), vol.185, pp. 422–427. 23. Jean Garnier, Jean-Franqois Gibra, T., and Barry Robson, “GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence”, Methods In Enzymology, Vol. 266. 13. file:///F:/charcteristic/Protein%20Structure%20%20Primary, %20Secondary,%20Tertiary,%20Quatemary%20Structures.h tm 24. 14. Ivan Kondov, “Protein structure prediction using distributed parallel particle swarm optimization”, Nat Comput (2013), 12:29–41, DOI 10.1007/s11047-012-9325-x. 15. Mahmood A Rashid, Md Tamjidul Hoque, Hakim Newton M.A., Duc Nghia Pham, Abdul Sattar,” A New Genetic Algorithm for Simplified Protein Structure Prediction”, Taner, Z., Sen, Robert, L., Jernigan, Jean Garnier and Andrzej Kloczkowski, “GOR V server for protein secondary structure prediction”, APPLICATIONS NOTE (2005) Vol. 21 no. 11, pages 2787–2788, doi:10.1093/bioinformatics/bti408. 25. Kloczkowski, A., Ting, K-L., Jernigan, R.L., and Garnier, J., “Information for Protein Secondary Structure Prediction 125 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 From Amino Acid Sequence”, Proteins: Structure, Function, 35. Rajkumar Bondugula, Ognen Duzlevski, And Dong Xu , “ and Genetics (2002) 49:154–166. Profiles And Fuzzy K-Nearest Neighbor Algorithm For Protein Secondary Structure Prediction”, In Proc. of the Third Asia Pacific Bioinformatics Conference , 2005. 26. Jacek Błażewicz, Piotr Łukasiak and Szymon Wilk, “New machine learning methods for prediction of protein secondary structures”, Control and Cybernetics, vol. 36 (2007) No. 1. 36. Seung-Yeon Kim, Jaehyun Sim, and Julian Lee D.-S. Huang, K. Li, and G.W. Irwin, “ Fuzzy k-Nearest Neighbor Method for Protein Secondary Structure Prediction and Its Parallel Implementation”, ICIC 2006, LNBI 4115, pp. 444–453, 2006 copyright @Springer-Verlag Berlin Heidelberg. 27. Ward, J. J., McGuffin, L. J., Buxton B. F., and Jones, D. T., “Secondary structure prediction with support vector machines”, (2003) Vol.19 no.13, pages 1650–1655, DOI: 10.1093/bioinformatics/btg223. 37. Jyh-Shing Roger Jang. ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man and Cybernetics, 23(0018- 9472):665–685, 1993. 28. Minh, N., Nguyen Jagath, C., Rajapakse , “Multi-Class Support Vector Machines for Protein Secondary Structure Prediction”, Genome Informatics (2003) 14: 218–227. 29. Long-Hui Wang, Juan Liu, “Predicting Protein Secondary Structure by a Support Vector Machine Based on a New Coding Scheme”, Genome Informatics (2004) 15(2): 181– 190,181. 38. Subhendu Bhusan Rout, Satchidananda Dehury, Bhabani Sankar Prasad Mishra, “Protein Structure Prediction using Genetic Algorithm”, IJCSMC, Vol. 2, Issue 6, June 2013, pg.187 – 192. Chira, Dragos Horvath, “Dumitru Dumitrescu Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics”, Lecture Notes in Computer Science Volume 6023 (2010), pp 38-49. 39. Camelia 30. Hae-Jin Hu, Yi Pan, Robert Harrison, and Phang C. Tai, “Improved Protein Secondary Structure Prediction Using Support Vector Machine With a New Encoding Scheme and an Advanced Tertiary Classifier”, IEEE Transactions On Nano bio science, December (2004) Vol. 3, No. 4, 265. 40. Tantara, A., Melaba, N., Talbia, G., Parentb, B., Horvathb, D.,“ A parallel hybrid genetic algorithm for protein structure prediction on the computational grid”, Future Generation Computer Systems 23 (2007) 398–409. 31. Blaise Gassend, Charles O'Donnell, W., William Thies, Andrew Lee, Marten van Dijk, and Srinivas Devadas, “Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines”, copyright Springer-verlag Berlin Heidelberg (2006), pp. 93 104. 32. Sujun Hua and Zhirong Sun, “A Novel Method of Protein 41. Thang N. Bui and Gnanasekaran Sundarraj, “An Efficient Genetic Algorithm for Predicting Protein Tertiary Structures in the 2D HP Model”, GECCO ’05 Proceedings of the 7th annual conference on Genetic and Evolutionary computation, Pages 385-392, ISBN:1-59593-010-8, doi:10.1145/1068009.1068072. Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach”, J. Mol. Biol. (2001) 308, 397±407, doi:10.1006/jmbi.2001.4580. 42. Trent Higgs, Bela Stantic, Md Tamjidul Hoque and Abdul 33. Pierre Baldi and Gianluca Pollastri , “The Principled Design Sattar, “Genetic Algorithm Feature-Based Re sampling for Protein Structure Prediction”, WCCI 2010 IEEE World Congress on Computational Intelligence July, (2010) 18-23 CCIB, Barcelona, Spain. of Large-Scale Recursive Neural Network Architectures– DAG-RNNs and the Protein Structure Prediction Problem”, Journal of Machine Learning Research 4 (2003) 575-602 Submitted 2/02; Revised 4/03; Published 9/03. 43. Karaboga N, Cetinkaya MB, “A novel and efficient algorithm for adaptive filtering: Artificial bee colony algorithm”. Turk J Electr Eng Comput Sci 19 (2011) (1):175–190. 34. Armando Blanco, David A. Pelta, Jos -L. Verdegay, “Applying a Fuzzy Sets-based Heuristic to the Protein Structure Prediction Problem”, International Journal Of Intelligent Systems (2002), Vol. 17, 629–643, DOI: 10.002/int.10042. 44. Sree PK, Babu IR, Devi NS., “Investigating an Artificial Immune System to strengthen protein structure prediction and protein coding region identification using the cellular 126 IRACST - International Journal of Computer Science and Information Technology & Security (IJCSITS), ISSN: 2249-9555 Vol. 4, No.5, October 2014 classifier”, Int J Bioinform Res Appl 57. file:///F:/algorithms/extra/Protein%20structure%20database% automata (2009);5(6):647-62. 20-%20Wikipedia,%20the%20free%20encyclopedia.htm 58. http://scop.mrc-lmb.cam.ac.uk. 45. Zhexin Xiang, “ Advance in protein homology modeling”, 59. http://www.bioinformaticsweb.net/data.html Curr Protein Pept Sci (2006) june; 7(3):217-227. 60. file:///F:/Untitled%20Document.htm 46. David Baker and Andrej Sali, “Protein structure prediction 61. file:///F:/allover/algorithms/extra/PDBsum%20entry%20%20 1g8p.htm and structural Genomics”, Science (2001) 294(5540):93–96. 62. Fogel, G.B., and Corne, D.W., “Evolutionary Computation in Bioinformatics”, Elsevier, 2003. 47. Jooyoung Lee, Sitao Wu, and Yang Zhang , “Ab Initio Protein Structure Prediction”, © Springer Science + Business Media B.V (2009). M., and Schneider, G., “Protein Folding Simulation by Particle Swarm Optimization”, The Open Structural Biology Journal (2007) 1, 1-6. 48. Meissner, 49. C.A. Floudas, “Computational Methods in Protein Structure Prediction”, Biotechnol. Bioeng (2007), 97: 207–213, Wiley Periodicals, Inc. Fidanova, Ivan Lirkov, “Ant Colony System Approach for Protein Folding”, Proceedings of the International Multiconference on Computer Science and Information Technology, Technology pp. 887–891, ISBN 978-83-60810-14-9, ISSN 1896-7094. 50. Stefka 51. Vargas Benitez, C., and Lopes, H.,”Parallel artificial bee colony algorithm approaches for protein structure prediction using the 3dhp-sc model”, Intelligent Distributed Computing, 4 (2010) 255-264. 52. Nashat Mansour, Fatima Kanj, Hassan Khachfe, “Particle swarm optimization approach for protein structure prediction in the 3D HP model“, Interdisciplinary Sciences: Computational Life Sciences September (2012), Volume 4, Issue 3, pp 190-200. 53. Xin Chen, Mingwei Lv, Lihui Zhao and Xudong Zhang, “An Improved Particle Swarm Optimization for Protein Folding Prediction “, I.J. Information Engineering and Electronic Business (2011) 1, 1-8. 54. http://www.bioinformaticsweb.net/datalink.html 55. http://www.science.co.il/Biomedical/Structure-Databases.asp 56. http://en.wikipedia.org/wiki/List_of_biological_databases#Pr otein_structure_databases 127