Wendy Testaburger Math 364 December 5, 2013 An LP-based classifier Introduction This project requires the student to determine whether or not a chemical is usable for development of a particular drug. The chemical has 40 descriptors which describes the characteristics of the chemical. The students were given 90 different chemicals with different descriptors and the students will use the data in order to fit a linear function that can be used to predict whether or not a chemical is usable. The linear function that will need to be fitted is where x is the descriptor. The student must then use a linear programming model to find and that will be the best fit for the linear function to determine whether the chemical is usable or not. Data Organization The first thing I did with the training set data was use excel to organize the data from usable chemicals to un-usable chemicals. For each descriptor I took the average, minimum, and maximum, and median for all the usable chemicals and un-usable chemicals separately. For the usable data set I took the differences between the minimum and the maximum of each of the descriptors. From there I examined the differences and all but descriptors 2, 3, and 17 had differences of about 90 and above. It led me to believe that at a particular descriptor, let's say descriptor 1, the data did not have a set of values that would create a strong argument that the chemical would be usable because of that descriptor. I then took the minimum and maximum differences for the non-usable chemicals. And the same results were yielded in the fact that only descriptors 2, 3, and 17 had pretty constant descriptor values. Below is a table that provides values: Average Minimum Maximum Median Min/Max Difference Usable Chemicals Descriptor 2 Descriptor 3 9.04125 -32.7916 8.006 -34.884 9.998 -30.335 8.9875 -32.81 -1.992 -4.549 Table 1: Usable Chemical Information Descriptor 17 8.090725 3.03 12.889 8.407 -9.859 Un-Usable Chemicals Descriptor 2 Descriptor 3 -10.8772 28.02896 -11.948 25.319 -10.005 46.287 -10.8515 27.649 -1.943 -20.968 Table 2: Un-Usable Chemical Information Average Minimum Maximum Median Min/Max Difference Descriptor 17 -17.28 -22.886 12.526 -17.128 -35.412 In order to double check that the values vary from all the descriptors I took a the data from all the descriptors and reorganized the values from least to greatest. As I suspected, usually when the minimum and maximum difference were large, the data did actually spread a wide range and did not give a clear value for that descriptor. When plotting the descriptors with a large minimum and maximum difference, I found that it was hard to distinguish where the chemicals would be usable or un-usable because the values were scattered. In descriptors 2, 3, and 17, it was easy to distinguish whether or not the chemical was usable or not un-usable. The plot below displays data of just descriptors 2, 3, and 17. Usable vs. Un-usable Chemical data 40 30 Descriptor Values 20 X2-Usable 10 X3-Usable X17-Usable 0 0 5 10 15 20 X2-NonUsable -10 X3-NonUsable -20 X17-NonUsable -30 -40 Descriptors Figure 1: Displays the locations of the usable and unusable data Organizing the data allowed me to get a feel of where all the important data points are and where I can predict where the linear function will need to look in order to determine the classification of the chemicals. But it should noted that this method would be a great start but it would be better to have all the data in the plot in order to see where the ranges and error spacing that is needed to predict as many usable chemicals. LP Development The LP that I have based my model off as is Max 1 1 In the model I have created using the AMPL software, I have two variables that represent the number of chemicals and the number of descriptors in the training set data. Then I have a parameter called TrData, that will set up a matrix for the chemicals and the descriptors. But I will add 1 to the descriptors of what needs to be read so that the matrix can skip the first column which contains the validation of whether or not the chemical is useful or not. I have also set parameters for the chemicals and descriptors in the testing data. Then I set epsilon so that it will be the spacing error for the chemical data and the w parameter will be the spacing error of the descriptor data. The spacing error is where there will be a region of interest for either the number of chemicals or descriptors, if the data falls into the region then the chemicals will be considered usable otherwise unusable. The smaller the spacing error, then the smaller the region will become and most likely the linear function will not be able to detect that a chemical is usable. I then chose the delta and u parameters and switched the values and took a sample. The delta variable ranged from 1 to 5. And the u variable ranged from 1 to 5 as well. The Ampl code is available in the appendix under the section "LPBasedClass.mod" Results/ Conclusion Delta Values Number of correct predictions in the training set data U values 1 2 3 4 1 50 62 50 61 2 50 65 50 50 3 50 64 68 50 4 50 50 66 68 5 50 50 56 50 Table 3: The number of correct predictions in the training set data 5 66 65 50 50 50 I first obtained the results of the number of correct predictions in the training set data. From the results, I have noticed that when the u value was increased to 3 and the delta value was increased to 3 as well, the LP was able to predict the classification correctly 68 of 90 times. The second highest were when the U values were at 3 and 5, and Delta values were 4 and 1 respectively. Table 3 displays the appropriate data. Delta Values Number of correct predictions in the test set data U values 1 2 3 4 1 4 4 4 4 2 4 5 4 4 3 4 5 5 4 4 4 4 5 4 5 4 4 5 4 Table 4: The number of correct predictions in the test set data 5 4 4 4 4 4 Now I looked at the results of the number of correct predictions in the test set data. Using the information from table 3, I am not primarily looking at when the delta and u variables were both set to 3, and when the U values were at 3 and 5, and Delta values were 4 and 1 respectively. Table 3 displays the appropriate data. When the delta and the variable are both set to 3, I find that the LP was able to predict correctly the chemical specification 5 out of the 10 times. When the U value was at 3 and the delta value was at 4, then the LP was able to predict 4 out of 10 correctly. And when the U value was set to 5 and the delta value was set to 1, then the LP was able to predict 4 out of 10 correctly. By comparing both of the tables, it is clear that when I set the variable u to 3 and the delta to 3, then I would be able to obtain the most accurate classification model. With those variables, the LP was able to predict 68 out of 90 correctly for the testing set, and it was able to predict 5 out of 10 correctly for the testing file. So in conclusion, in order to obtain the most correct predictions, I would need to select the u variable to 3 and the delta variable to 3. Appendix x2u = xlsread('TS.xlsx', 'B2:B41'); y2u = xlsread('TS.xlsx', 'C2:C41'); x3u = xlsread('TS.xlsx', 'D2:D41'); y3u = xlsread('TS.xlsx', 'E2:E41'); x17u = xlsread('TS.xlsx', 'F2:F41'); y17u = xlsread('TS.xlsx', 'G2:G41'); x2un = xlsread('TS.xlsx', 'B45:B95'); y2un = xlsread('TS.xlsx', 'C45:C95'); x3un = xlsread('TS.xlsx', 'D45:D95'); y3un = xlsread('TS.xlsx', 'E45:E95'); x17un = xlsread('TS.xlsx', 'F45:F95'); y17un = xlsread('TS.xlsx', 'G45:G95'); x2test = xlsread('TS.xlsx', 'B99:B108'); y2test = xlsread('TS.xlsx', 'C99:C108'); x3test = xlsread('TS.xlsx', 'D99:D108'); y3test = xlsread('TS.xlsx', 'E99:E108'); x17test = xlsread('TS.xlsx', 'F99:F108'); y17test = xlsread('TS.xlsx', 'G99:G108'); plot (x2u, y2u, 'go', x3u, y3u, 'go', x17u, y17u, 'go',... x2un, y2un, 'bo', x3un, y3un, 'bo', x17un, y17un, 'bo',... x2test, y2test, 'ro', x3test, y3test, 'ro', x17test, y17test, 'ro') xlabel('Discriptors') ylabel('Discriptor Values') title ('Testing Fit') grid on axis ([0 20 -40 40]) chem1x2 = xlsread('TS.xlsx','Sheet1','C99'); chem1x17 = xlsread('TS.xlsx','Sheet1','G99'); chem2x2 = xlsread('TS.xlsx','Sheet1','C100'); chem2x17 = xlsread('TS.xlsx','Sheet1','G100'); chem3x2 = xlsread('TS.xlsx','Sheet1','C101'); chem3x17 = xlsread('TS.xlsx','Sheet1','G101'); chem4x2 = xlsread('TS.xlsx','Sheet1','C102'); chem4x17 = xlsread('TS.xlsx','Sheet1','G102'); chem5x2 = xlsread('TS.xlsx','Sheet1','C103'); chem5x17 = xlsread('TS.xlsx','Sheet1','G103'); chem6x2 = xlsread('TS.xlsx','Sheet1','C104'); chem6x17 = xlsread('TS.xlsx','Sheet1','G104'); chem7x2 = xlsread('TS.xlsx','Sheet1','C105'); chem7x17 = xlsread('TS.xlsx','Sheet1','G105'); chem8x2 = xlsread('TS.xlsx','Sheet1','C106'); chem8x17 = xlsread('TS.xlsx','Sheet1','G106'); chem9x2 = xlsread('TS.xlsx','Sheet1','C107'); chem9x17 = xlsread('TS.xlsx','Sheet1','G107'); chem10x2 = xlsread('TS.xlsx','Sheet1','C108'); chem10x17 = xlsread('TS.xlsx','Sheet1','G108'); if ((chem1x2>=8.006)&(chem1x2<=9.998))&... ((chem1x17>=3.03)&(chem1x17<=12.889)) y1 = 1 else y1 = -1 end if ((chem2x2>=8.006)&(chem2x2<=9.998))&... ((chem2x17>=3.03)&(chem2x17<=12.889)) y2 = 1 else y2 = -1 end if ((chem3x2>=8.006)&(chem3x2<=9.998))&... ((chem3x17>=3.03)&(chem3x17<=12.889)) y3 = 1 else y3 = -1 end if ((chem4x2>=8.006)&(chem4x2<=9.998))&... ((chem4x17>=3.03)&(chem4x17<=12.889)) y4 = 1 else y4 = -1 end if ((chem5x2>=8.006)&(chem5x2<=9.998))&... ((chem5x17>=3.03)&(chem5x17<=12.889)) y5 = 1 else y5 = -1 end if ((chem6x2>=8.006)&(chem6x2<=9.998))&... ((chem6x17>=3.03)&(chem6x17<=12.889)) y6 = 1 else y6 = -1 end if ((chem7x2>=8.006)&(chem7x2<=9.998))&... ((chem7x17>=3.03)&(chem7x17<=12.889)) y7 = 1 else y7 = -1 end if ((chem8x2>=8.006)&(chem8x2<=9.998))&... ((chem8x17>=3.03)&(chem8x17<=12.889)) y8 = 1 else y8 = -1 end if ((chem9x2>=8.006)&(chem9x2<=9.998))&... ((chem9x17>=3.03)&(chem9x17<=12.889)) y9 = 1 else y9 = -1 end if ((chem10x2>=8.006)&(chem10x2<=9.998))&... ((chem10x17>=3.03)&(chem10x17<=12.889)) y10 = 1 else y10 = -1 end LPBasedClass.mod Model File param chemicals; param descriptors; #Number of chemicals #Number of descriptors param TrData {1..chemicals, 1..descriptors+1}; identifications param x{i in 1..chemicals, j in 1..descriptors} := TrData [i, j+1]; param y{i in 1..chemicals} := TrData [i, 1]; #Read Text file and descriptors plus 1 to remove #Assign data in training file to parameter #Assign data in training file to parameter param u; param delta; param Test_chem; param Test_desc; #Test Data chemicals #Test Data Descriptors param Test_Data{1..Test_chem, 1..Test_desc+1}; identifications param Testx{i in 1..Test_chem} := Test_Data [i,1]; param Testy{i in 1..Test_chem, j in 1..Test_desc} := Test_Data [i, j+1]; var epsilon{1..chemicals}; var w{0..descriptors}; #Read Test file and descriptors plus 1 to remove #Assign data in test file to parameter #Assign data in test file to parameter #Spacing error for the chemical data #Spacing error for the descriptor data w[0] w[j] var prefn{1..chemicals}; var TrainingUsable; var TestUsable; #Predefined of the chemicals #Number of Usuable chemicals in the training set #Number of usable chemicals in the testing set var Test_Func {1..Test_chem}; maximize Sum_Epsilon: sum{i in 1..chemicals} epsilon[i]; subject to Constraint{i in 1..chemicals}: y[i]*(sum{j in 1..descriptors}x[i, j]*w[j]+w[0]) >= 1 + epsilon[i]; subject to UpperBound {i in 1..chemicals}: epsilon[i] <= u; subject to LowerBound {i in 1..chemicals}: epsilon[i] >= -u; Command Window The following code will only represent 1 sample, since it is basically the same code for the rest of the data, the only things that change are the delta and u variables. sw: ampl ampl: option solver cplex; ampl: model C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\LPBasedClass.mod.txt; ampl: let chemicals := 90; ampl: let descriptors := 40; ampl: let u := 6; ampl: let Test_chem :=10; ampl: let Test_desc := 40; ampl: read {i in 1..chemicals, j in 1..descriptors+1} TrData[i,j] < C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\TrainingSet.txt; ampl: close C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\TrainingSet.txt; ampl: read {i in 1..Test_chem, j in 1..Test_desc+1} Test_Data[i,j] < C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\Testing.txt; ampl: close C:\Users\wendy2\Documents\amplcml\MODELS\PROJECT\Testing.txt; ampl: solve; CPLEX 12.5.1.0: optimal solution; objective 190.1754694 158 dual simplex iterations (0 in phase I) ampl: let delta := 6; ampl: let {i in 1..chemicals} prefn [i] := (if sum{j in 1..descriptors}x[i,j]*w[j]+w[0] >= delta then 1 else -1); ampl: let TrainingUsable := sum{i in 1..chemicals} (if prefn[i] = y[i] then 1 else 0); ampl: display TrainingUsable; TrainingUsable = 69 ampl: let {i in 1..Test_chem} Test_Func [i] := (if sum{j in 1..descriptors}Testy[i,j]*w[j]+w[0] >= delta then 1 else -1); ampl: let TestUsable := sum{i in 1..Test_chem} (if Test_Func [i] = Testx[i] then 1 else 0); ampl: display TestUsable; TestUsable = 4