Protein Fold Recognition as a Data Mining Coursework Project Badri Adhikari Department of Computer Science University of Missouri-Columbia Overview • • • • • Problem Description Data Description and Specification Data Preprocessing Methodology Discussion of Results The Problem Protein [Structure] Database Identify if two proteins belong to same fold [Binary Classification] Protein 1 Protein 3 Protein 2 Protein Pair Feature 1 Feature 2 Feature 3 Same Fold? Protein1-Protein2 0.005 0.065 0.79 Y Protein2-Protein3 0.034 0.034 0.152 N The Problem Customer Identification Customer 0001 Customer 0002 Customer 0003 Identify if two proteins belong to same fold [Binary Classification] Feature 1 0.005 0.034 0.005 Feature 2 0.065 0.034 0.065 Feature 3 0.79 0.152 0.79 Potential Customer Y Y N Recognizing Potential New Customers Protein Pair Feature 1 Feature 2 Protein1-Protein2 0.005 0.065 0.79 Y Protein2-Protein3 0.034 0.034 0.152 N Protein fold recognition Feature 3 Same Fold? Data Specification • • • • • File size: 1.5G Examples count: 951600 Positive(+1) labels:7438 Negative(-1) labels: 944162 Number of features: 84 Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Protein query-target pair as example id Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Labels Data Description #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 #119l-d119l 1gal-d1gal_1 -1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684 #119l-d119l 1gbs-d1gbs +1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472 #119l-d119l 1iov-d2dln_2 -1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952 #119l-d119l 1kte-d1kte -1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214 #119l-d119l 2baa-d2baa +1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990 #119l-d119l 6lyt-d193l +1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485 #119l-d119l 1sly-d1sly_2 +1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707 #119l-d119l 1sly-d1sly_2 Feature values for each examples Preprocessing Task 1: Group related data rows Problem : All The records are not independent of each other. #119l-d119l 1alo-d1alo_1 -1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926 #119l-d119l 1chka-d1chka +1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097 #1aab-d1aab 1cyx-d1cyx -1 1:0.83 2:1.58 3:0.771248274633033 4:0.362259086646671 5:0.824117832571076 6:0.248160387073783 #119l-d119l 1fcdc-d1fcdc1 -1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886 Solution: Group records with same query template together, so that they are together either in the test data set or training data set. Preprocessing Task 2: Balance the positive and negative data • Problem: Dataset has just 0.78% positive examples All All Examples Random All +ve examples Balanced examples -ve examples (equal to +ve) Remaining -ve examples Used only for testing Balancing the number of positive and negative examples Methodology SVMlight as the mining tool • SVMlight is an implementation of Support Vector Machines (SVMs) in C $svm_learn example1/train.dat example1/model $svm_classify example1/test.dat example1/model example1/prediction Methodology Process for deciding the Kernel Function • Many different kernels: linear, polynomial, radial basis function, or user defined. • Consider the RBF kernel K(x, y) = • Parameters to consider: -m -g -c memory size of cache for kernel evaluations gamma value for rbf kernel trade-off between training error and margin • Use cross-validation to find the best parameter C and ϒ • Use the best parameter C and ϒ to train the whole training set Methodology Parameter Determination • Determining gamma parameter: – Ran training and testing for 100 gamma values between 0 and 1 – Found gamma = 0.15 as the best value – Ran again to find more precise gamma for 120 values from 0 to 0.3 – Found best value of gamma as 0.1 • Used default C value of 0 Evaluation with 10-fold cross-validation • For different values of threshold average sensitivity and specificity was computed from values in each fold • ROC curve 1 0.9 0.8 0.7 sensitivity 1.2 1 0.8 0.6 0.5 0.3 0.6 Sensitivity 0.2 0.4 Specificity 0.1 0 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1-specificity 0 -1.2 -1.1 -1 -0.9 -0.8 -0.7 -0.6 -0.5 ROC Curve 0.4 -0.4 -0.2 threshold For threshold = -1.02 Threshold Sensivity Specificity FPR -1.02 0.786 0.769 0.231 Accuracy Precision 0.769 0.027 References • • • • A machine learning information retrieval approach to protein fold recognition by Jianlin Cheng and Pierre Baldi A Practical Guide to Support Vector Classification by Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin available at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf Cross-Validation by PAYAM REFAEILZADEH, LEI TANG, HUAN LIU available at http://www.public.asu.edu/~ltang9/papers/ency-cross-validation.pdf Classroom slides at http://people.cs.missouri.edu/~chengji/datamining2012/Chapter4_Classification_Prediction. ppt Thank you for your time Questions and comments are welcome.