An LP-based classifier

advertisement
Wendy Testaburger
Math 364
December 5, 2013
An LP-based classifier
Introduction
This project requires the student to determine whether or not a chemical is usable for
development of a particular drug. The chemical has 40 descriptors which describes the
characteristics of the chemical. The students were given 90 different chemicals with different
descriptors and the students will use the data in order to fit a linear function that can be used to
predict whether or not a chemical is usable. The linear function that will need to be fitted is
where x is the descriptor. The student must then use a linear programming model to find and
that will be the best fit for the linear function to determine whether the chemical is usable or
not.
Data Organization
The first thing I did with the training set data was use excel to organize the data from usable
chemicals to un-usable chemicals. For each descriptor I took the average, minimum, and
maximum, and median for all the usable chemicals and un-usable chemicals separately.
For the usable data set I took the differences between the minimum and the maximum of each of
the descriptors. From there I examined the differences and all but descriptors 2, 3, and 17 had
differences of about 90 and above. It led me to believe that at a particular descriptor, let's say
descriptor 1, the data did not have a set of values that would create a strong argument that the
chemical would be usable because of that descriptor. I then took the minimum and maximum
differences for the non-usable chemicals. And the same results were yielded in the fact that only
descriptors 2, 3, and 17 had pretty constant descriptor values. Below is a table that provides
values:
Average
Minimum
Maximum
Median
Min/Max Difference
Usable Chemicals
Descriptor 2
Descriptor 3
9.04125
-32.7916
8.006
-34.884
9.998
-30.335
8.9875
-32.81
-1.992
-4.549
Table 1: Usable Chemical Information
Descriptor 17
8.090725
3.03
12.889
8.407
-9.859
Un-Usable Chemicals
Descriptor 2
Descriptor 3
-10.8772
28.02896
-11.948
25.319
-10.005
46.287
-10.8515
27.649
-1.943
-20.968
Table 2: Un-Usable Chemical Information
Average
Minimum
Maximum
Median
Min/Max Difference
Descriptor 17
-17.28
-22.886
12.526
-17.128
-35.412
In order to double check that the values vary from all the descriptors I took a the data from all the
descriptors and reorganized the values from least to greatest. As I suspected, usually when the
minimum and maximum difference were large, the data did actually spread a wide range and did
not give a clear value for that descriptor.
When plotting the descriptors with a large minimum and maximum difference, I found that it
was hard to distinguish where the chemicals would be usable or un-usable because the values
were scattered. In descriptors 2, 3, and 17, it was easy to distinguish whether or not the chemical
was usable or not un-usable. The plot below displays data of just descriptors 2, 3, and 17.
Usable vs. Un-usable Chemical data
40
30
Descriptor Values
20
X2-Usable
10
X3-Usable
X17-Usable
0
0
5
10
15
20
X2-NonUsable
-10
X3-NonUsable
-20
X17-NonUsable
-30
-40
Descriptors
Figure 1: Displays the locations of the usable and unusable data
Organizing the data allowed me to get a feel of where all the important data points are and where
I can predict where the linear function will need to look in order to determine the classification
of the chemicals. But it should noted that this method would be a great start but it would be
better to have all the data in the plot in order to see where the ranges and error spacing that is
needed to predict as many usable chemicals.
LP Development
The LP that I have based my model off as is
Max 1 1 In the model I have created using the AMPL software, I have two variables that represent the
number of chemicals and the number of descriptors in the training set data. Then I have a
parameter called TrData, that will set up a matrix for the chemicals and the descriptors. But I
will add 1 to the descriptors of what needs to be read so that the matrix can skip the first column
which contains the validation of whether or not the chemical is useful or not. I have also set
parameters for the chemicals and descriptors in the testing data. Then I set epsilon so that it will
be the spacing error for the chemical data and the w parameter will be the spacing error of the
descriptor data. The spacing error is where there will be a region of interest for either the
number of chemicals or descriptors, if the data falls into the region then the chemicals will be
considered usable otherwise unusable. The smaller the spacing error, then the smaller the region
will become and most likely the linear function will not be able to detect that a chemical is
usable.
I then chose the delta and u parameters and switched the values and took a sample. The delta
variable ranged from 1 to 5. And the u variable ranged from 1 to 5 as well.
The Ampl code is available in the appendix under the section "LPBasedClass.mod"
Results/ Conclusion
Delta
Values
Number of correct predictions in the training set data
U values
1
2
3
4
1
50
62
50
61
2
50
65
50
50
3
50
64
68
50
4
50
50
66
68
5
50
50
56
50
Table 3: The number of correct predictions in the training set data
5
66
65
50
50
50
I first obtained the results of the number of correct predictions in the training set data. From the
results, I have noticed that when the u value was increased to 3 and the delta value was increased
to 3 as well, the LP was able to predict the classification correctly 68 of 90 times. The second
highest were when the U values were at 3 and 5, and Delta values were 4 and 1 respectively.
Table 3 displays the appropriate data.
Delta
Values
Number of correct predictions in the test set data
U values
1
2
3
4
1
4
4
4
4
2
4
5
4
4
3
4
5
5
4
4
4
4
5
4
5
4
4
5
4
Table 4: The number of correct predictions in the test set data
5
4
4
4
4
4
Now I looked at the results of the number of correct predictions in the test set data. Using the
information from table 3, I am not primarily looking at when the delta and u variables were both
set to 3, and when the U values were at 3 and 5, and Delta values were 4 and 1 respectively.
Table 3 displays the appropriate data.
When the delta and the variable are both set to 3, I find that the LP was able to predict correctly
the chemical specification 5 out of the 10 times. When the U value was at 3 and the delta value
was at 4, then the LP was able to predict 4 out of 10 correctly. And when the U value was set to
5 and the delta value was set to 1, then the LP was able to predict 4 out of 10 correctly.
By comparing both of the tables, it is clear that when I set the variable u to 3 and the delta to 3,
then I would be able to obtain the most accurate classification model. With those variables, the
LP was able to predict 68 out of 90 correctly for the testing set, and it was able to predict 5 out of
10 correctly for the testing file.
So in conclusion, in order to obtain the most correct predictions, I would need to select the u
variable to 3 and the delta variable to 3.
Appendix
x2u = xlsread('TS.xlsx', 'B2:B41');
y2u = xlsread('TS.xlsx', 'C2:C41');
x3u = xlsread('TS.xlsx', 'D2:D41');
y3u = xlsread('TS.xlsx', 'E2:E41');
x17u = xlsread('TS.xlsx', 'F2:F41');
y17u = xlsread('TS.xlsx', 'G2:G41');
x2un = xlsread('TS.xlsx', 'B45:B95');
y2un = xlsread('TS.xlsx', 'C45:C95');
x3un = xlsread('TS.xlsx', 'D45:D95');
y3un = xlsread('TS.xlsx', 'E45:E95');
x17un = xlsread('TS.xlsx', 'F45:F95');
y17un = xlsread('TS.xlsx', 'G45:G95');
x2test = xlsread('TS.xlsx', 'B99:B108');
y2test = xlsread('TS.xlsx', 'C99:C108');
x3test = xlsread('TS.xlsx', 'D99:D108');
y3test = xlsread('TS.xlsx', 'E99:E108');
x17test = xlsread('TS.xlsx', 'F99:F108');
y17test = xlsread('TS.xlsx', 'G99:G108');
plot (x2u, y2u, 'go', x3u, y3u, 'go', x17u, y17u, 'go',...
x2un, y2un, 'bo', x3un, y3un, 'bo', x17un, y17un, 'bo',...
x2test, y2test, 'ro', x3test, y3test, 'ro', x17test, y17test, 'ro')
xlabel('Discriptors')
ylabel('Discriptor Values')
title ('Testing Fit')
grid on
axis ([0 20 -40 40])
chem1x2 = xlsread('TS.xlsx','Sheet1','C99');
chem1x17 = xlsread('TS.xlsx','Sheet1','G99');
chem2x2 = xlsread('TS.xlsx','Sheet1','C100');
chem2x17 = xlsread('TS.xlsx','Sheet1','G100');
chem3x2 = xlsread('TS.xlsx','Sheet1','C101');
chem3x17 = xlsread('TS.xlsx','Sheet1','G101');
chem4x2 = xlsread('TS.xlsx','Sheet1','C102');
chem4x17 = xlsread('TS.xlsx','Sheet1','G102');
chem5x2 = xlsread('TS.xlsx','Sheet1','C103');
chem5x17 = xlsread('TS.xlsx','Sheet1','G103');
chem6x2 = xlsread('TS.xlsx','Sheet1','C104');
chem6x17 = xlsread('TS.xlsx','Sheet1','G104');
chem7x2 = xlsread('TS.xlsx','Sheet1','C105');
chem7x17 = xlsread('TS.xlsx','Sheet1','G105');
chem8x2 = xlsread('TS.xlsx','Sheet1','C106');
chem8x17 = xlsread('TS.xlsx','Sheet1','G106');
chem9x2 = xlsread('TS.xlsx','Sheet1','C107');
chem9x17 = xlsread('TS.xlsx','Sheet1','G107');
chem10x2 = xlsread('TS.xlsx','Sheet1','C108');
chem10x17 = xlsread('TS.xlsx','Sheet1','G108');
if ((chem1x2>=8.006)&(chem1x2<=9.998))&...
((chem1x17>=3.03)&(chem1x17<=12.889))
y1 = 1
else
y1 = -1
end
if ((chem2x2>=8.006)&(chem2x2<=9.998))&...
((chem2x17>=3.03)&(chem2x17<=12.889))
y2 = 1
else
y2 = -1
end
if ((chem3x2>=8.006)&(chem3x2<=9.998))&...
((chem3x17>=3.03)&(chem3x17<=12.889))
y3 = 1
else
y3 = -1
end
if ((chem4x2>=8.006)&(chem4x2<=9.998))&...
((chem4x17>=3.03)&(chem4x17<=12.889))
y4 = 1
else
y4 = -1
end
if ((chem5x2>=8.006)&(chem5x2<=9.998))&...
((chem5x17>=3.03)&(chem5x17<=12.889))
y5 = 1
else
y5 = -1
end
if ((chem6x2>=8.006)&(chem6x2<=9.998))&...
((chem6x17>=3.03)&(chem6x17<=12.889))
y6 = 1
else
y6 = -1
end
if ((chem7x2>=8.006)&(chem7x2<=9.998))&...
((chem7x17>=3.03)&(chem7x17<=12.889))
y7 = 1
else
y7 = -1
end
if ((chem8x2>=8.006)&(chem8x2<=9.998))&...
((chem8x17>=3.03)&(chem8x17<=12.889))
y8 = 1
else
y8 = -1
end
if ((chem9x2>=8.006)&(chem9x2<=9.998))&...
((chem9x17>=3.03)&(chem9x17<=12.889))
y9 = 1
else
y9 = -1
end
if ((chem10x2>=8.006)&(chem10x2<=9.998))&...
((chem10x17>=3.03)&(chem10x17<=12.889))
y10 = 1
else
y10 = -1
end
LPBasedClass.mod
Model File
param chemicals;
param descriptors;
#Number of chemicals
#Number of descriptors
param TrData {1..chemicals, 1..descriptors+1};
identifications
param x{i in 1..chemicals, j in 1..descriptors} := TrData [i, j+1];
param y{i in 1..chemicals} := TrData [i, 1];
#Read Text file and descriptors plus 1 to remove
#Assign data in training file to parameter
#Assign data in training file to parameter
param u;
param delta;
param Test_chem;
param Test_desc;
#Test Data chemicals
#Test Data Descriptors
param Test_Data{1..Test_chem, 1..Test_desc+1};
identifications
param Testx{i in 1..Test_chem} := Test_Data [i,1];
param Testy{i in 1..Test_chem, j in 1..Test_desc} := Test_Data [i, j+1];
var epsilon{1..chemicals};
var w{0..descriptors};
#Read Test file and descriptors plus 1 to remove
#Assign data in test file to parameter
#Assign data in test file to parameter
#Spacing error for the chemical data
#Spacing error for the descriptor data w[0] w[j]
var prefn{1..chemicals};
var TrainingUsable;
var TestUsable;
#Predefined of the chemicals
#Number of Usuable chemicals in the training set
#Number of usable chemicals in the testing set
var Test_Func {1..Test_chem};
maximize Sum_Epsilon: sum{i in 1..chemicals} epsilon[i];
subject to Constraint{i in 1..chemicals}: y[i]*(sum{j in 1..descriptors}x[i, j]*w[j]+w[0]) >= 1 + epsilon[i];
subject to UpperBound {i in 1..chemicals}: epsilon[i] <= u;
subject to LowerBound {i in 1..chemicals}: epsilon[i] >= -u;
Command Window
The following code will only represent 1 sample, since it is basically the same code for the rest of the data, the
only things that change are the delta and u variables.
sw: ampl
ampl: option solver cplex;
ampl: model C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\LPBasedClass.mod.txt;
ampl: let chemicals := 90;
ampl: let descriptors := 40;
ampl: let u := 6;
ampl: let Test_chem :=10;
ampl: let Test_desc := 40;
ampl: read {i in 1..chemicals, j in 1..descriptors+1} TrData[i,j] < C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\TrainingSet.txt;
ampl: close C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\TrainingSet.txt;
ampl: read {i in 1..Test_chem, j in 1..Test_desc+1} Test_Data[i,j] < C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\Testing.txt;
ampl: close C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\Testing.txt;
ampl: solve;
CPLEX 12.5.1.0: optimal solution; objective 190.1754694
158 dual simplex iterations (0 in phase I)
ampl: let delta := 6;
ampl: let {i in 1..chemicals} prefn [i] := (if sum{j in 1..descriptors}x[i,j]*w[j]+w[0] >= delta then 1 else -1);
ampl: let TrainingUsable := sum{i in 1..chemicals} (if prefn[i] = y[i] then 1 else 0);
ampl: display TrainingUsable;
TrainingUsable = 69
ampl: let {i in 1..Test_chem} Test_Func [i] := (if sum{j in 1..descriptors}Testy[i,j]*w[j]+w[0] >= delta then 1 else -1);
ampl: let TestUsable := sum{i in 1..Test_chem} (if Test_Func [i] = Testx[i] then 1 else 0);
ampl: display TestUsable;
TestUsable = 4
Download