Readme

advertisement
SecretP v.2.1 for identifying different types of Gram-negative bacterial secreted
proteins is separated into two steps.
------------------------------------------------------------------------------------------------------The first step is to transform protein sequences into numerical vectors as the inputs
of SVM models. The substitution model is composed of PseAA, and implemented by
using the Matlab program: PseAA.m.
Usage: PseAA (inputfile, lg, q),
-inputfile is the file of primary sequences;
-lg is the maximum distance between two considered amino acid residues;
-q is the number of physicochemical properties of amino acids;
For example: PseAA (‘input.txt’, 5, 5)
In this way, a 45-D vector can be obtained for each protein sequence, and listed in the
Farrayfile.
Here, we use the ‘one to one’ algorithm to solve the multi-class problem, and build
fifteen SVM models. In order to satisfy the format required by libSVM, two types of
secreted proteins in the training data sets are labeled by adding “1” and “-1” at the
start of their vectors, respectively. The proteins in the test data sets are labeled by
adding “0” at the start of their vectors, and label “0” has not any meaning but just for
satisfying the format required by libSVM here.
For example, the training sets of T1SPs and T2SPs in this paper are labeled by the
following program:
Here, index12file is a random matrix and used to disturb the order of data. All labeled
vectors are listed in train12.txt and used as the input for libSVM.
-------------------------------------------------------------------------------------------------------
The second step is to implement the libSVM program. The prediction result is
listed in a file, which named as: resultfile. After entering the DOS window,
---------------------------------
First, scaling the vectors in the training data set to [-1, 1] with svm-scale.exe.
Usage: svm-scale –l -1 –u 1 –s range inputfile > inputfile.scale
-l is the lower limit (default -1);
-u is the upper limit (default +1);
-s is the file saving scaling parameters;
-inputfile is train12.txt here;
For example: svm-scale –l -1 –u 1 –s range12 train12.txt > train12.scale
---------------------------------
Second, optimizing SVM parameters with grid.py.
Usage: grid.py train12.scale
---------------------------------
Third, training the SVM model with svm-train.exe. It’s notable that the data sets of
T1SPs and T2SPs are unbalanced, so different weights are assigned for them. The
numbers of the two types of secreted proteins are 112 and 99, respectively. The ratio
among them is approximate to 11:10, and the least common multiple is 110. So the
weights for them are 10 and 11, which are inversely proportional to the ratio.
Usage: svm-train –c value1 –g value2 –wi valuei inputfile.scale modelfile
-c is the optimal regularization parameter C;
-g is the optimal kernel width parameter γ;
-wi is the weight for the i-th type of proteins;
For example: svm-train –c 8 –g 0.125 –w1 10 –w-1 11 train12.scale model12
---------------------------------
Then, scaling the vectors in the test sets with svm-scale.exe.
Usage: svm-scale -r range inputfile > inputfile.scale
-r is the file restoring scaling parameters;
-inputfile is test.txt here;
---------------------------------
In the end, predicting the test data sets with svmpredict.exe.
Usage: svm-predict inputfile.scale modelfile resultfile
-----------------------------------------------------------------------------------------------------Finally, the prediction result is listed in the file: resultfile.
-----------------------------------------------------------------------------------------------------For a query protein, it should be put into all fifteen SVM models, and each model
gives a prediction result. Then, the number of times each secretory type judged is
investigated, and this protein will be grouped into the secretory type with the highest
score (5 times) finally. If two secretory types get the same score (4 times), the
numbers of proteins released via the two secretion systems in the training set are
investigated. Considering the data imbalance, this protein will be grouped into the
secretory type with the smaller number.
Download