Supplementary Material - Springer Static Content Server

advertisement
Supplementary Material
Performance measures
Performance of the HBC-Evo system is reported using following quality measures of
accuracy (Acc), sensitivity (Sn), specificity (Sp), and F-score. These quality measures are defined
as:
Sn  (
TP
)
TP  FN
TN
)
TN  FP
TP  TN
Acc  (
)
TP  TN  FP  FN
Sp  (
Prc  (
TP
)
TP  FP
Fscore  2(
Prc * Sn
)
( Prc  Sn)
(S1)
(S2)
(S3)
(S4)
(S5)
where, TP = true positives, TN = true negatives, FP = false positives, and FN = false negatives.
Yule’s Q -statistic is used to measure the diversity of individual classifier. This diversity
will enhance the prediction performance of the proposed ensemble. This statistic is used for two
base classifiers
C
i
and C j  as Qi , j 
ad  bc
, where a and d represent the scores of
ad  bc
Ci and C j classifiers for correct and incorrect predictions, respectively. Whereas, b shows the
score when Ci classifier is correct and C j is incorrect; c is the score of C j classifier being
correct and Ci incorrect. This measure is related to the distance measure that finds the
normalized difference between the agreement and disagreement of the two classifiers. Q values
lie in the range [-1, 1]. The two classifiers are more independent, if the value of Q is closer to
zero. Positive value of Q indicates two classifiers tend to agree on same decisions. However, its
negative value specifies two classifiers tend to commit errors on different objects. To generalize
the pairwise diversity measures to an entire ensemble, we took average Q values of all pair of
individual classifiers. For m base classifiers, the averaged Q over all pairs of classifiers is
calculated as:
m 1 m
2
Qavg 
(S6)
  Qi ,k
m(m  1) i 1 k i 1
S1
Supplementary Table S1 Single letter codes and numerical values of Hydrophobicity (Hd) and
Hydrophilicity (Hb) of twenty native amino acid molecules of protein
1.
Alanine
A
Hd
(Tanford
1962)
0.62
2.
Cysteine
C
0.29
-1
3.
Aspartic acid
D
-0.9
3
4.
Glutamic acid
E
-0.74
3.0
5.
Phenylalanine
F
1.19
-2.5
6.
Glycine
G
0.48
0
7.
Histidine
H
-0.4
-0.5
8.
Isoleucine
I
1.38
-1.8
9.
Lysine
K
-1.5
3.0
10.
Leucine
L
1.06
-1.8
11.
Methionine
M
0.64
-1.3
12.
Asparagine
N
-0.78
0.2
13.
Proline
P
0.12
0
14.
Glutamine
Q
-0.85
0.2
15.
Arginine
R
-2.53
3.0
16.
Serine
S
-0.18
0.3
17.
Threonine
T
-0.05
-0.4
18.
Valine
V
1.08
-1.5
19.
Tryptophan
W
0.81
-3.4
20.
Tyrosine
Y
0.26
-2.3
Sr. No.
Amino acid
Single letter
code
Hb
(Hopp and Woods
1981)
-0.5
S2
60
Breast-cancer protein
Cancer protein
50
40
Percentage difference
30
20
10
0
-10
-20
-30
-40
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
Variation of amino acids in cancerous proteins
Supplementary Fig. S1 Variation of amino acid molecules in datasets related to cancer and
breast cancer protein sequences with reference to non-cancer
We have trained individually four diverse types of computational algorithms (NB, KNN, SVM,
and RF) in various feature spaces of AAC (60 dimensions), SAAC (60 dimensions), PseAAC-S
(60 dimensions) and PseAAC-P (60 dimensions). Detail information of NB, KNN, SVM, and RF
algorithms is available in the literature of computational intelligence. Libsvm software is
employed to implement SVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm) (Chang and Lin 2011).
Supplementary Table S2 provides the summary of necessary parameters to generate optimal
numerical expressions. Several GP simulations were performed to adjust the parameters using
GPLAB3 toolbox. This toolbox can be found at http://gplab.sourceforge.net/download.html. At
the end, these numerical expressions are used to predict healthy and cancerous tissues.
S3
Supplementary Table S2 Summary of GP parameters setting for the HBC-Evo system
GP Parameters
Values
ˆ   X X X X  , where
Terminals/nonSet of predictions of based-level classifiers X
1
2
3
4
terminal set
X 1 ,X 2 ,X 3 , and X 4 indicate the predicted values of four based-level
classifiers.
Parameters set,   +, -,, , log, sin, cos, exp, power, constants,
Fitness criterion
Selection method
Pop. size
Gen.
Pop. Init.
Max. Tree depth
Sampling
Expected offspring
Operators
probabilities
Survival criterion
where constants are random numbers in the range of [0, 1]
“AUC”
“Generational”
80
180 and 140 for C/NC and B/NBC datasets, respectively
“Ramped half and half”
30
“Tournament”
"rank89”
Automatically adjustable, variable crossover/mutation ratio
Keep the best candidate solution
Supplementary Table S3 Performance of the base-level classifiers in terms of AUC, accuracy
(Acc), specificity (Sp), sensitivity (Sn), and F-score measures for C/NC and B/NBC datasets of
amino acid sequences
C/NC
B/NBC
AUC
Acc
Sp
Sn
F-score
AUC
Acc
Sp
Sn
F-score
AAC
NB
KNN
SVM
RF
96.59
98.02
98.29
99.33
90.76
93.11
96.99
96.76
82.06
93.29
95.50
96.78
99.40
92.97
98.50
96.78
91.50
93.10
97.10
96.78
97.98
98.20
98.30
99.01
93.82
92.60
94.01
96.45
88.96
93.60
91.31
97.30
98.70
91.66
96.69
95.60
94.14
92.52
94.18
96.42
SAAC
NB
KNN
SVM
RF
97.82
96.88
98.92
99.40
88.98
91.06
96.63
96.84
77.90
90.73
95.28
97.20
100.0
91.31
98.02
96.40
90.10
91.10
96.70
96.79
98.70
97.70
98.89
98.96
93.46
91.88
94.10
96.40
86.92
90.49
88.20
96.80
100.0
93.27
100.0
96.03
93.85
91.98
94.42
96.39
PseAAC-S
NB
KNN
SVM
RF
97.29
98.57
99.10
99.56
90.82
95.41
92.12
97.93
96.43
96.51
84.53
98.28
85.19
94.32
99.78
97.56
90.25
95.36
92.69
97.93
94.32
98.82
98.50
99.20
88.10
94.90
94.07
96.82
90.50
94.98
90.89
97.62
85.67
94.85
97.20
96.04
87.78
94.90
94.22
96.84
PseAAC-P
Feature/Method
NB
KNN
SVM
RF
97.47
98.30
99.10
99.49
93.14
94.50
96.20
96.92
86.21
96.29
93.20
97.70
100.0
92.71
99.20
96.20
93.58
94.40
96.29
96.90
98.21
98.85
98.66
99.20
94.31
95.47
94.31
97.10
90.17
94.85
89.81
96.89
98.49
96.03
98.84
97.33
94.57
95.50
94.58
97.10
S4
Supplementary Table S4 Prediction performance of the proposed HBC-Evo in terms of
specificity (Sp), sensitivity (Sn), and F-score measures for C/NC and B/NBC datasets of amino
acid sequences
Proposed
HBC-Evo
EnsAAC
EnsSAAC
EnsPseAAC-S
EnsPseAAC-P
C/NC
B/NBC
Sp
Sn
F-score
Sp
Sn
F-score
98.60
98.60
98.74
97.33
99.10
99.20
99.31
99.41
98.85
98.89
99.01
98.39
98.73
97.65
99.35
98.20
96.91
97.74
97.45
98.39
97.80
97.69
98.36
98.30
Best GP individual complexity
For C/NC dataset, Supplementary Fig.S1a shows the complexity of the best GP individual, in
each generation, against the number of generations. This figure demonstrates the complexity of
the best individual in various feature spaces. It is observed that after 18 generations, the best-fit
individual, for PseAAC-S space, converges to the optimal point. Fig.S1b depicts the complexity
of the best individual with the number of generations. It is observed that for SAAC space, the
average tree depth is large. However, PseAAC-S has reveled average nodes and tree depth
relatively small as compared to other feature spaces.
S5
(a)
(b)
Supplementary Fig. S2 For cancer dataset, complexity of the best GP individual in each
generation with respect to (a) fitness criterion (b) number of nodes and level of tree depth against
the number of generations
S6
ˆ)
FPseAAC  P ( X
Supplementary Fig. S3 For cancer dataset, tree of the best individual of the HBC-Evo in
PseAAC-P feature space
Supplementary Reference
Chang C-C, Lin C-J (2011) LIBSVM : a library for support vector machines. ACM Transactions
on Intelligent Systems and Technology 2 (27):1-27
Hopp TP, Woods KR (1981) Prediction of protein antigenic determinants from amino acid
sequences. National Acad Sciences 78 (6):3824-3828
Tanford C (1962) Contribution of hydrophobic interactions to the stability of the globular
conformation of proteins. Journal of the American Chemical Society 84 (22):4240-4247
S7
Download