BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S051054 Introduction Introduction Metrials and Methods Method Results and Discussions Result and Discussion Conclusions Conclusion Introduction Importance Most important tasks in computational biology Fill in this gap The detection of homologies with low sequence identity remains a challenging problem Some solutions The general sequence comparison methods Thread the query sequence onto the template structures Improve prediction performance by either incorporating new features or developing novel algorithms. Introduction Problem Traditional sequence comparison methods fail to identify reliable homologies with low sequence identity The taxonomic methods are effective alternatives, but their prediction accuracies are around 70%, which are still relatively low for practical usage. Autor's solution Protein sequences have univariate direction from beginning to end Is analogous to time sequences of process data Autocross covariance (ACC) transformation SVM Introduction PSSM: each squence PSI-BLAST PSSM position-specific score matrices The element Si,j in the matrix reflects the probability of amino acid i occurring at the position j ACC fixed-length vector SVM classification results Feature: Only the evolutionary information represented in the form of PSSM It alone can achieve promising results Materilas and methods To evaluate the proposed method and compare it with existing methods,five datasets are used here: the D-B dataset the extended D-B dataset, the F86 datasets the F199 datasets the Lindahl dataset The D-B dataset : 311 proteins for training 383 proteins for test. <40% identity Each fold has at least seven members. <35% identity in training set. Classes: all α, all β, α/β, α+β and small proteins. The extended D-B dataset : 27folds <40% identity. contains 3202 sequences. The fold names and the number of proteins : F86 and F199: The F86 dataset contains 86 folds and 5671 sequences, each fold has at least 25 members. The F199 dataset contains 199 folds and 7437 sequences each fold has at least 10 members The Lindahl dataset : is used as a benchmark to compare the taxonomic fold recognition methods with the threading methods. 976 sequences in this dataset identity <40%. ACC transformation ACC can transform the PSSMs of different lengths into fixed-length vectors by measuring the correlation between any two properties ACC results in two kinds of variables auto-covariance(AC):between the same property cross-covariance(CC):between two different properties. AC variable: the correlation of the same property between two residues separated by a distance of lg along thesequence,which can be calculated as: i :is one of the residues L :is the length of the protein sequence Si,j :is the PSSM score of amino acid i at position j Si:is the average score for amino acid i along the whole sequence In such a way, the number of AC variables can be calculated as 20∗LG, where LG is the maximum of lg (lg=1,2,...,LG). CC variable :measures the correlation of two different properties between two residues separated by lg along the sequence i1,i2 :are two different amino acids The total number of CC variables : 380∗LG. Each protein sequence is represented as a vector of either AC variable or ACC variable that is a combination of AC and CC. Materilas and methods Support vector machine Performance metrics The overall accuracy (Q) sensitivity (Sn) and specificity (Sp): RESULTS AND DISCUSSIONS The impact of LG Performance comparison with existing taxonomy-based method Performance comparison with threading methods The impact of LG The maximum value of LG is the length of the shortest sequence minus one. D-B dataset: the optimal values of LG forAC and Extended dataset: the best values of LG is 10. Performance comparison with existing taxonomy-based methods Results on the D-B dataset The detailed results are given in the Supplementary Material: Results on the D-B dataset To give a more comprehensive comparison, we consider several other methods in the literature. The proposed ACCFold method outperforms these methods by 2–14%. Results on the extended D-B dataset Extended D-B dataset: The same folds more sequences:3202 All the methods get improved Higher than the other methods by 9–17%. Results on the extended D-B dataset Especially, the performance of the folds in the α/β, α+β and small proteins classes are significantly improved. Results on the F86 and F199 datasets More folds: 86 folds,199folds Time complexity: SWPSSM:O(n^2 * L^2) ACC:O(n*L^2+n^2*L). The results indicate that the proposed method can be applied to the cases of large number of folds without significantly affecting its performance, as long as the number of samples in each fold is not too small. Performance comparison with threading methods Threading methods:use the sequence–template alignments to detect the remote homologies of proteins. Results on the Lindahl dataset At the family level, we select the families that contain at least two samples Performance comparison with threading methods Taxonomic methods are not as good as threading methods Difficult to be applied to practical fold recognition However the total number of folds are limited the number of proteins with known structure increases more space and chance to exploit the taxonomic methods to develope ffective fold cognition system. Conclusions Combines SVM with ACC is introduced for taxonomic protein fold recognition. ACC transformation is used to convert the PSSMs into fixed-length vectors The results obtained here stand for the state-of-theart performance of taxonomic protein fold recognition