alignment method for fold recognition, one could also Protein Fold Recognition use information retrieval methods that leverages features A Coursework Project on Data Mining extracted Department of Computer Science University of Missouri-Columbia tools or structure the protein pairs. Instead, it is about using the already extracted features, (1) to derive a model that will represent the examples, and (2) to evaluate the process was model derived. Abstract Protein fold recognition is a well-known problem in the bioinformatics. approaches and There algorithms for are many predicting different whether a given protein sequence belongs to a protein family, superfamily, or more specifically to a fold. Here we discuss a classification approach using support vector machines for predicting whether two proteins (a pair) belong to same fold/superfamily/family (any), using existing dataset. Radial basis function was chosen as the kernel. Through 10-fold cross-validation approach of training and testing the best value of gamma and c parameter alignment This project is not about extracting features values from 5/10/2012 of existing prediction tools (Cheng & Baldi, 2006). Badri Adhikari field using were found to be 0.017 and 0.5 2 Methods 2.1 Data Description The input data used for the mining obtained from a previously done research (Cheng & Baldi, 2006) available at http://casp.rnet.missouri.edu/download/linda_lob.bin2. The feature values in data were computed using different methods and tools. The sequences and models (Lindahl & Elofsson, 2000) used to generate these features are derivation of SCOP database. respectively. Results show that although the accuracy The dataset has 951,600 examples out of which 7,438 and is are +1 labeled examples and the rest 944,162 are -1 extremely low, below 5%. The area under the curve for labeled examples. There are 84 features that have the final model was found to be 0.874. The project is different information gain values. Each example has two primarily based on previously published work (Cheng & rows: title row - that begins with a hash which has Baldi, 2006) and is being continued as a research protein pair name along with the structure classification project. and feature row - that has feature values used to precision are above 80%, the precision Supplementary Information: Supplementary report, codes and the source are data, available this describe the pair of protein in the title row. at http://web.missouri.edu/~bap54/protein_fold_recognition_ 2012/ 1 Introduction Protein fold recognition methods are the methods developed to know the structural relationship between proteins. These methods are usually either predictionbased methods, structural methods or sequence-based methods or combination of these methods (Lindahl & Elofsson, 2000). Instead of developing one specialized Figure 1 A portion of input dataset. Title rows begin with a hash and others are feature rows. 2.2 Data Preprocessing The examples were evenly split into 10 subsets for 10fold cross-validation. However, this was not done 1 of 4 randomly. At first, examples having the same query protein were grouped together. This resulted in a cluster of size 976 because there are 976 unique proteins in the available data. Ten subsets of data were created for 10-fold cross-validation, using these 976 clusters of examples, without breaking the clusters. This was done so that the training dataset does not contain any of the query proteins that will be used in testing. 2.3 Training and Classification While using SVMlight, instead of randomly picking up kernels and trying different parameters, a support vector classification guide (Hsu, Chang, & Lin, 2010) was followed. The RBF kernel considered for K(x, y) generating the = was model. Then, 10-fold To deal with the problem of biased examples that we cross-validation was performed to determine the value had, a test was conducted to see if balancing helps of gamma and c. In each iteration, 9 subsets were improve used for training and the remaining set was used as the performance. Two random training and testing sets were prepared. Using the previously known values of gamma, learning and classification were performed. The area under the curve were computed for the two models and were found to be 0.82 and 0.87 as shown in the ROC curve below. the test set. To find the value of gamma, at first gamma values between 0 and 1 were considered with the step size of 0.1. Better performance was observed at the values close to 0.1, so more precise values of gamma were Keeping the two test sets as is, the training sets were used balanced performance by filtering away most of the negative examples so that there were equal number of negative to repeat of the the process. classifier Figure for 3 values shows of the gamma between 0 and 0.19 with step-size of 0.01. and positive examples. Again, using the same value of gamma, learning and classification were performed, with exactly same test sets. The area under the curve was found to be 0.94 and 0.93. Surprisingly, it is observed that balancing of examples actually improves the performance of the training and classification results. Sensitivity Accuracy Precision 120 100 80 60 40 20 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0 Figure 3 Average values of sensitivity versus specificity for different values of gamma. Figure 2 Comparison of four models: filtered 01 and 02 are models generated from balanced training data and unfiltered 01 and 02 are models generated from the training data as it. The same test data were used for the models generated. 2 of 4 Sensivity Sensitivity Specificity 85.2 100 85 80 84.8 60 84.6 Accuracy Precision 40 84.4 20 0 0.0025 0.0050 0.0075 0.0100 0.0125 0.0150 0.0175 0.0200 0.0225 0.0250 0.0275 0.0300 0.0325 0.0350 0.0375 0.0400 0.0425 0.0450 0.0475 84 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 84.2 C parameter gamma Figure 4 More precise values of gamma against average sensitivity and specificity. Figure 5 Average values of sensitivity, accuracy and precision against a range of values of c parameter of the RBF kernel function. Upon finding that the classifier is performing best near the gamma values of 0.01, another round of training and testing was conducted with gamma starting from 0.0025 to 0.0475, to find a more precise value of gamma. Figure 4 shows the performance. It is evident that there is no distinguishably best value of gamma. However, we can observe that the sensitivity and specificity are highest at gamma equal to 0.0175. To determine the best value of c, training The support vector machine implementation tool used was SVMlight. The tool is basically two programs: svm_learn for training and svm_classify for classification. and classification was performed with a range of values of c from 0 to 1, as shown in figure 5, keeping gamma constant as 0.0175. We observe that, at c-values higher than 0.05, there is not much impact of the values in any of the measurements. Any value of c greater than 0.1 and less than 1 seems appropriate. 2.4 Tools Used Using the tool is quite straightforward. $svm_learn example1/train.dat example1/model $svm_classify example1/test.dat example1/model example1/prediction Perl was used as the scripting language for all data transformations, pre-processing and calculations. R was used to plot the ROC curves and for calculating area under the curves. For other graphs, MS Excel was used. 3 Results The final model was built using gamma equal to 0.0175 and c value equal to 0.5. Figure 6 shows that the area under the curve for the model is 0.874. 3 of 4 Cheng, J., & Baldi, P. (2006). A machine learning information retireval approach to protein fold recognition. Bioinformatics. Hsu, C.-W., (2010). A Chang, Practical C.-C., Guide to & Lin, C.-J. Support Vector Classi. National Taiwan University. Lindahl, E., & Elofsson, A. (2000). Identification of Related Proteins on Family, Superfamily and Fold Level. JMB. Figure 6 ROC curve for the final model with area under curve equal to 0.874 shown with different cutoff values represented by different colors. 4 Future Works Following are the future works planned: 1. Perform selectively more precise removing balancing the data of data by instead of randomly filtering them out. Instead of applying filtering on the whole training set, it should be applied to individual cluster of examples. This could make the examples more discriminative. 2. Perform grid-search approach to find the best values of gamma and c for the RBF kernel. 3. Perform classification at more specific levels: family level, superfamily level and fold level. 4. Apply neural network algorithms for classification. 5. Apply random forest algorithm for classification. 6. Use different feature selection methods to improve accuracy. 7. Generate new features for each pair of proteins to improve prediction accuracy. References 4 of 4