Readme

SecretP v.2.1 for identifying different types of Gram-negative bacterial secreted proteins is separated into two steps. ------------------------------------------------------------------------------------------------------The first step is to transform protein sequences into numerical vectors as the inputs of SVM models. The substitution model is composed of PseAA, and implemented by using the Matlab program: PseAA.m. Usage: PseAA (inputfile, lg, q), -inputfile is the file of primary sequences; -lg is the maximum distance between two considered amino acid residues; -q is the number of physicochemical properties of amino acids; For example: PseAA (‘input.txt’, 5, 5) In this way, a 45-D vector can be obtained for each protein sequence, and listed in the Farrayfile. Here, we use the ‘one to one’ algorithm to solve the multi-class problem, and build fifteen SVM models. In order to satisfy the format required by libSVM, two types of secreted proteins in the training data sets are labeled by adding “1” and “-1” at the start of their vectors, respectively. The proteins in the test data sets are labeled by adding “0” at the start of their vectors, and label “0” has not any meaning but just for satisfying the format required by libSVM here. For example, the training sets of T1SPs and T2SPs in this paper are labeled by the following program: Here, index12file is a random matrix and used to disturb the order of data. All labeled vectors are listed in train12.txt and used as the input for libSVM. ------------------------------------------------------------------------------------------------------- The second step is to implement the libSVM program. The prediction result is listed in a file, which named as: resultfile. After entering the DOS window, --------------------------------- First, scaling the vectors in the training data set to [-1, 1] with svm-scale.exe. Usage: svm-scale –l -1 –u 1 –s range inputfile > inputfile.scale -l is the lower limit (default -1); -u is the upper limit (default +1); -s is the file saving scaling parameters; -inputfile is train12.txt here; For example: svm-scale –l -1 –u 1 –s range12 train12.txt > train12.scale --------------------------------- Second, optimizing SVM parameters with grid.py. Usage: grid.py train12.scale --------------------------------- Third, training the SVM model with svm-train.exe. It’s notable that the data sets of T1SPs and T2SPs are unbalanced, so different weights are assigned for them. The numbers of the two types of secreted proteins are 112 and 99, respectively. The ratio among them is approximate to 11:10, and the least common multiple is 110. So the weights for them are 10 and 11, which are inversely proportional to the ratio. Usage: svm-train –c value1 –g value2 –wi valuei inputfile.scale modelfile -c is the optimal regularization parameter C; -g is the optimal kernel width parameter γ; -wi is the weight for the i-th type of proteins; For example: svm-train –c 8 –g 0.125 –w1 10 –w-1 11 train12.scale model12 --------------------------------- Then, scaling the vectors in the test sets with svm-scale.exe. Usage: svm-scale -r range inputfile > inputfile.scale -r is the file restoring scaling parameters; -inputfile is test.txt here; --------------------------------- In the end, predicting the test data sets with svmpredict.exe. Usage: svm-predict inputfile.scale modelfile resultfile -----------------------------------------------------------------------------------------------------Finally, the prediction result is listed in the file: resultfile. -----------------------------------------------------------------------------------------------------For a query protein, it should be put into all fifteen SVM models, and each model gives a prediction result. Then, the number of times each secretory type judged is investigated, and this protein will be grouped into the secretory type with the highest score (5 times) finally. If two secretory types get the same score (4 times), the numbers of proteins released via the two secretion systems in the training set are investigated. Considering the data imbalance, this protein will be grouped into the secretory type with the smaller number.

Readme

Related documents

Products

Support

Readme

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib