A simple and fast secondary structure prediction method with hidden neural networks Authors: Kuang Lin, Victor A. Simossis, Willam R. Taylor and Jaap Heringa Presented by: Rashmi Singh Sowmya Tirukkovaluru Motivation • We introduce a secondary structure prediction method called YASPIN. • YASPIN uses a single neural network for predicting the secondary structure elements in a 7-state local structure scheme (Hb, H, He, C, Eb, E and Ee) . • Optimizes the output using a hidden Markov model. Results • YASPIN was compared with the current secondary structure prediction methods, such as PHDpsi, PROFsec, JNET, SSPro2, and PSIPRED. • YASPIN shows the highest accuracy in terms of Q3 and SOV scores for strand prediction. HISTORY • Qian and Sejnowski (1988) introduced one of the earliest artificial neural network based methods which use single sequence as input. • From the 1990’s upto the present time, secondary structure prediction accuracy has improved to over 70% by using the evolutionary information found in MSA. • MSA unlike single sequence input, offers a much improved means to recognise positional physiochemical features such as hydrophobicity patterns. • However, the improvement in the secondary structure prediction accuracy by using MSA’s is also directly connected to database size and search accuracy. PSI-BLAST • The top-performing methods till date, PHD, PHDpsi, PROFSec, SSPro2, JNET, PSIPRED use various types of neural networks for prediction and employ the database-searching tool PSI-BLAST with same search parameters • In YASPIN, all sequences were sequentially used as queries in PSI-BLAST search . • PSI-BLAST is a position specific Iterative BLAST technique and generates a profile of homologues sequences in the form of position specific scoring matrices (PSSM). • All involved secondary structure prediction methods were tested on the same PSI-BLAST results to make the comparison as unbiased as possible YASPIN • Uses a single Neural Network in contrast with many other NN-based methods which use feed forward-multilayer perceptron networks, which are trained using back propagation algorithm • Back-propagation algorithm 1) Feed forward from Input to output. 2) Calculate and back-propagate the error (the difference between the network output and the target output). 3) Adjust weights in steepest descent direction along the error surface to decrease the error. w (t+1)= w (t) - r [∂E/∂w] + μ * Δw(t-1) where r is a positive constant called learning rate, and μ is the momentum term. YASPIN • The problem with using single NN method is that its predictions results are broken secondary structure elements, even elements of one residue. • This is not desirable as most observed secondary structures are composed of more than three residues . • In YASPIN, we overcome this problem by using Hidden Markov model to filter the predicted secondary structure elements from the NN. • Finally, prediction results are converted into 3-state secondary structure predictions (H-Helix, E-Strand, and ‘-’ Other). Algorithm • YASPIN uses a single feed forward perceptron network with one hidden layer. • YASPIN NN uses the softmax transition function with a window of 15 residues. X1 x2 Activation function W1 Input function: Ak = ∑j Wj xj . W2 ∑jWj xj Wn xn Input links Input function g o Output soft max activation function: Output e Ak Ak ' e k' where k’ is the kth ouput unit Algorithm • For each residue in the window, 20 units are used for scores in the PSSM( In PSSM ,each column is a vector of 20 specifying the frequencies of 20 amino acids appearing in that column of MSA) and 1 unit to mark where the window spans the terminals of protein chains. • The input layer has 21*15 = 315 units. • Hidden layer has 15 units. • Output layer has 7 units corresponding to seven local structure states: helix beginning (Hb), Helix (H), Helix ending (He), Strand beginning (Eb), Strand (E), Strand ending (Ee) ,coil (C) . State Diagram •The 7-state output of the NN is passed through a HMM . • It uses viterbi algorithm to optimally segment the 7-state predictions Test and training datasets • YASPIN was trained and tested using the SCOP1.65 (structural classification of proteins) database. • PDB25 was constructed in which the maximal pairwise identity was limited to 25% making it a non-redundant set. • All transmembrane entries (they do not have structural information) were removed from the PDB25 resulting in a set of 4256 proteins with known structures. • The test and training sets were built using the PDB25 set grouped together by ASTRAL(it. PDB 25 test set • TEST SET: The test set was extracted before training by random selection by complete PDB25 at a ratio of about 1:8. • The 535 sequences selected for the test set were at most 25% identical to the training set due to the nature of PDB25 dataset. • Were not part of the same superfamily as any of the remaining 3721 sequences of the training set. EVA5 sequence set • EVA5 is the independent ‘common_set 5’ dataset from EVA consisting of 217 sequences. • To make a more accurate comparison between all methods including YASPIN, we further benchmarked all methods on EVA5. • The final YASPIN training set set contained 3553 sequences with known structures, after removing all sequences found in the EVA5 sequenceset. NN Training • On-line back-propagation algorithm and 6-fold crossvalidation is used to train YASPIN Neural Network. • In a single iteration, each of the 6 subsets was used for testing and the remaining 5 are used for training NN. • At the end of each iteration, the average prediction error of the networks over all the six test subsets was recorded. • The training was stopped when the average prediction error started increasing. • A momentum term of 0.5 and a learning rate of 0.0001 was used. HMM Training • The Secondary structure states used to train the HMM are obtained using DSSP. • The DSSP 8-state secondary structure representation(H,G,E,B,I,S,T,-) was grouped according to the 3 state scheme (H and G) as Helix(H) (E and B) as Strand(E) and others as Coil(C). • These 3 state definitions were then converted into 7 state local structure scheme. Reliability scores • YASPIN provides four different position-specific prediction confidence scores which are generated based on the NN-predicted probabilities of each residue being in one of the 7 states . • The first 3 scores are secondary structure specific scores, representing Helix,Strand and Coil prediction confidence and are generated as a sum of probabilities of each respective secondary structure type.They are normalized to 9. • The fourth score is the position-specific prediction confidence number ,which represents the score of the state the viterbi algorithm has chosen in its optimal segmentation path. Prediction Accuracy Scores • • Q3: Qi = (number of residues correctly predicted in state i / number of residues observed in state I) * 100 • • • SOV: Segment Overlap quantity measure for a single conformational state SOV(i) = 1 MINOV(S1;S2) + DELTA(S1;S2) ------ SUM ---------------------------------------------- * LEN(S1) N(i) MAXOV(S1;S2) S(i) • MCC: Testing with PDB25 test set • The 535 sequences of PDB25 test set were used to compare YASPIN to current top performing methods. • 409 sequences were found to be common to all methods • This comparison was relatively unfair for YASPIN since many of these state of the art methods have used sequences from this test set for their training . • YASPIN is the best in strand prediction and also out performs most methods in Helix prediction,except SSPro2 and PSIPRED which are clearly superior to YASPIN. The average Q3 and SOV scores for the predictions of 409 PDB25 Common sequences from the testing set Testing with EVA5 test set • EVA 5 sequences were removed from the YASPIN training set and also from other methods. • From the 217 sequences in the EVA 5 test set, 188 were found to be common for all methods. • The Q3 prediction accuracy results for separate SSEs showed that PSIPRED and SSOPro2 were best in the prediction of helix, while YASPIN was better than all remaining methods • In addition, YASPIN was better at strand prediction. The average Q3 and SOV scores for the prediction of 188 Common sequences from the EVA 5 set (a) Represents Q3 (b) represents SOV Calculating prediction errors Prediction errors for helix and strand are classified into 4 types: 1) Wrong Prediction (w) 2) Over Prediction (o) 3) Under-prediction (u) 4) Length-errors (l) Different error types and MCCs for YASPIN Investigation of errors made by all methods on EVA 5 test set showed that 1) All methods are missing out strand segments at the same rate (EU) 2) PSIPRED and SSPro2 more frequently over-predict, while rest under-predict, helix segments (Ho/Hu) 3) YASPIN mistakes helics for strand (HW), over-elongating strand segments (Elo) and keeping helices too short (Hlu) MCCs showed that YASPIN and PROFsec are equivalent in prediction quality, which suggests that the prediction error types made by each method are not accurately reflected in Q3 and SOV scores Different error types and MCCs for YASPIN Investigation of types of errors •All the methods are more or less missing out strand segments at the same rate –EU • PSIPRED and SSPro2 more frequently over-predict while others underpredict helix segment – Ho/Hu • YASPIN prediction mistakes helices for strand – HW, over-elongated strand segments – Elo, and keep helices too short - Hlu The extent of errors made by each method on independent EVA5 common data set H/EW :wrong prediction H/EO: overprediction H/EU:underprediction H/Elo: length overprediction H/Elu: length underprediction YASPIN position specific reliability measures • The reliability scoring scheme applied in YASPIN correlated well with the Q3. • The relationship between the assigned reliability scores and their corresponding average prediction accuracy (Q3) was almost linear. • This means the YASPIN confidence scoring scheme accurately describes the reliability of each prediction. • In approx. 48% of the predicted residues, showing confidence value of 5 or greater, 90% were accurately predicted. Q3 and the percentage of residues against cumulative Reliability index from the YASPIN method Reliability (greater or equal to value shown) DISCUSSION • The difference YASPIN and other classical NN-based programs is the HNN model • In YASPIN, the NN and HMM components of the HNN model were trained separately while in other approaches they were trained in combination. • In YASPIN, the initial predictions are 7- state predictions of protein local structures, instead of commonly used 3 states. • The importance is that the termini of SSEs, esp., Helices, have statistically significant different composition from other parts of the protein sequence. • Hence, the network used in the YASPIN is trained to capture these differences and provide additional information by producing these 7-state predictions DISCUSSION • The problem with our model is that the strands predicted by YASPIN must be of at least 3 residues but according to DSSP definition, β bridges can have only one residue. • To overcome this problem, two different Markov Models were designed, each having fewer states of strand structures than those currently used (Eb, E and Ee), but there is a decrease in the prediction accuracy • This suggests that the sequence signal of the strand termini are important for the prediction.