Protein Fold Recognition as a Data Mining Coursework Project

advertisement
Protein Fold Recognition
as a Data Mining Coursework Project
Badri Adhikari
Department of Computer Science
University of Missouri-Columbia
Overview
•
•
•
•
•
Problem Description
Data Description and Specification
Data Preprocessing
Methodology
Discussion of Results
The Problem
Protein [Structure]
Database
Identify if two proteins belong to same fold [Binary Classification]
Protein 1
Protein 3
Protein 2
Protein Pair
Feature 1
Feature 2
Feature 3
Same Fold?
Protein1-Protein2
0.005
0.065
0.79
Y
Protein2-Protein3
0.034
0.034
0.152
N
The Problem
Customer
Identification
Customer 0001
Customer 0002
Customer 0003
Identify if two proteins belong to same fold [Binary Classification]
Feature 1
0.005
0.034
0.005
Feature 2
0.065
0.034
0.065
Feature 3
0.79
0.152
0.79
Potential
Customer
Y
Y
N
Recognizing Potential New Customers
Protein Pair
Feature 1
Feature 2
Protein1-Protein2
0.005
0.065
0.79
Y
Protein2-Protein3
0.034
0.034
0.152
N
Protein fold recognition
Feature 3
Same Fold?
Data Specification
•
•
•
•
•
File size: 1.5G
Examples count: 951600
Positive(+1) labels:7438
Negative(-1) labels: 944162
Number of features: 84
Data Description
#119l-d119l 1alo-d1alo_1
-1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926
#119l-d119l 1chka-d1chka
+1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097
#119l-d119l 1fcdc-d1fcdc1
-1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886
#119l-d119l 1gal-d1gal_1
-1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684
#119l-d119l 1gbs-d1gbs
+1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472
#119l-d119l 1iov-d2dln_2
-1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952
#119l-d119l 1kte-d1kte
-1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214
#119l-d119l 2baa-d2baa
+1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990
#119l-d119l 6lyt-d193l
+1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707
#119l-d119l 1sly-d1sly_2
Data Description
#119l-d119l 1alo-d1alo_1
-1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926
#119l-d119l 1chka-d1chka
+1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097
#119l-d119l 1fcdc-d1fcdc1
-1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886
#119l-d119l 1gal-d1gal_1
-1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684
#119l-d119l 1gbs-d1gbs
+1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472
#119l-d119l 1iov-d2dln_2
-1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952
#119l-d119l 1kte-d1kte
-1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214
#119l-d119l 2baa-d2baa
+1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990
#119l-d119l 6lyt-d193l
+1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707
#119l-d119l 1sly-d1sly_2
Protein query-target pair as example id
Data Description
#119l-d119l 1alo-d1alo_1
-1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926
#119l-d119l 1chka-d1chka
+1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097
#119l-d119l 1fcdc-d1fcdc1
-1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886
#119l-d119l 1gal-d1gal_1
-1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684
#119l-d119l 1gbs-d1gbs
+1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472
#119l-d119l 1iov-d2dln_2
-1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952
#119l-d119l 1kte-d1kte
-1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214
#119l-d119l 2baa-d2baa
+1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990
#119l-d119l 6lyt-d193l
+1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707
#119l-d119l 1sly-d1sly_2
Labels
Data Description
#119l-d119l 1alo-d1alo_1
-1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926
#119l-d119l 1chka-d1chka
+1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097
#119l-d119l 1fcdc-d1fcdc1
-1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886
#119l-d119l 1gal-d1gal_1
-1 1:1.62 2:3.22 3:0.917750359046243 4:0.658669958746871 5:0.901110932746401 6:0.415795355014216 7:0.1805356684
#119l-d119l 1gbs-d1gbs
+1 1:1.62 2:1.85 3:0.933021070719294 4:0.703993132134445 5:0.911155732036236 6:0.358955863036017 7:0.1302872472
#119l-d119l 1iov-d2dln_2
-1 1:1.62 2:2.1 3:0.893588888139836 4:0.603748017846831 5:0.886157496493048 6:0.374372299060332 7:0.15693565952
#119l-d119l 1kte-d1kte
-1 1:1.62 2:1.05 3:0.884958967781326 4:0.527597954487768 5:0.883935799166059 6:0.238710961583473 7:0.0332240052
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.900757073008603 6:0.294119634790149 7:0.0624425214
#119l-d119l 2baa-d2baa
+1 1:1.62 2:2.43 3:0.856257831918934 4:0.450623745727366 5:0.869561412073741 6:0.378932373372537 7:0.1569895990
#119l-d119l 6lyt-d193l
+1 1:1.62 2:1.29 3:0.904095075521014 4:0.602608563805903 5:0.893625220106668 6:0.337268512070013 7:0.1408736485
#119l-d119l 1sly-d1sly_2
+1 1:1.62 2:1.68 3:0.914572798274142 4:0.607628373774305 5:0.90075707
#119l-d119l 1sly-d1sly_2
Feature values for each examples
Preprocessing Task 1: Group related data rows
Problem : All The records are not independent of each other.
#119l-d119l 1alo-d1alo_1
-1 1:1.62 2:1.13 3:0.913193847041375 4:0.592859664812671 5:0.900250016135814 6:0.373223645145762 7:0.1888487926
#119l-d119l 1chka-d1chka
+1 1:1.62 2:2.38 3:0.940068662681886 4:0.750255408177719 5:0.915009668297068 6:0.42843991595408 7:0.21728602097
#1aab-d1aab 1cyx-d1cyx
-1 1:0.83 2:1.58 3:0.771248274633033 4:0.362259086646671 5:0.824117832571076 6:0.248160387073783
#119l-d119l 1fcdc-d1fcdc1
-1 1:1.62 2:0.8 3:0.817167643633653 4:0.232414817929789 5:0.856456855019105 6:0.218979560580544 7:0.03189285886
Solution: Group records with same query template together, so
that they are together either in the test data set or training data
set.
Preprocessing Task 2: Balance the positive and negative data
• Problem: Dataset has just 0.78% positive examples
All
All
Examples
Random
All +ve
examples
Balanced
examples
-ve examples
(equal to +ve)
Remaining -ve
examples
Used only
for testing
Balancing the number of positive and negative examples
Methodology
SVMlight as the mining tool
• SVMlight is an implementation of Support
Vector Machines (SVMs) in C
$svm_learn
example1/train.dat example1/model
$svm_classify
example1/test.dat example1/model example1/prediction
Methodology Process for deciding the Kernel Function
• Many different kernels: linear, polynomial, radial basis
function, or user defined.
• Consider the RBF kernel K(x, y) =
• Parameters to consider:
-m
-g
-c
memory size of cache for kernel evaluations
gamma value for rbf kernel
trade-off between training error and margin
• Use cross-validation to find the best parameter C and ϒ
• Use the best parameter C and ϒ to train the whole
training set
Methodology Parameter Determination
• Determining gamma parameter:
– Ran training and testing for 100 gamma values
between 0 and 1
– Found gamma = 0.15 as the best value
– Ran again to find more precise gamma for 120 values from 0
to 0.3
– Found best value of gamma as 0.1
• Used default C value of 0
Evaluation with 10-fold cross-validation
• For different values of threshold average sensitivity and specificity
was computed from values in
each fold
• ROC curve
1
0.9
0.8
0.7
sensitivity
1.2
1
0.8
0.6
0.5
0.3
0.6
Sensitivity
0.2
0.4
Specificity
0.1
0
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1-specificity
0
-1.2
-1.1
-1
-0.9
-0.8
-0.7
-0.6
-0.5
ROC Curve
0.4
-0.4
-0.2
threshold
For threshold = -1.02
Threshold Sensivity Specificity FPR
-1.02
0.786
0.769 0.231
Accuracy Precision
0.769
0.027
References
•
•
•
•
A machine learning information retrieval approach to protein fold recognition by Jianlin
Cheng and Pierre Baldi
A Practical Guide to Support Vector Classification by Chih-Wei Hsu, Chih-Chung Chang, and
Chih-Jen Lin available at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
Cross-Validation by PAYAM REFAEILZADEH, LEI TANG, HUAN LIU available at
http://www.public.asu.edu/~ltang9/papers/ency-cross-validation.pdf
Classroom slides at
http://people.cs.missouri.edu/~chengji/datamining2012/Chapter4_Classification_Prediction.
ppt
Thank you
for your time
Questions and comments are welcome.
Download