Ricardo W. Pino Urias, a,b Stephen J. Barigye, c Yovani Marrero-Ponce, a,d,e* César R.
García-Jacas, a,f José R. Valdes-Martiní b and Facundo Perez-Gimenez d a
Unit of Computer-Aided Molecular “Biosilico” Discovery and Bioinformatic Research (CAMD-
BIR International), Cartagena de Indias, Bolivar, Colombia. b
Faculty of Mathematics Physics and Computation. Universidad Central “Marta Abreu” de Las
Villas, Santa Clara, 54830, Villa Clara, Cuba. c Departamento de Química, Universidade Federal de Lavras, UFLA, Caixa Postal 3037, 37200-
000 Lavras, MG, Brazil. d Facultad de Farmacia, Universitat de València, Burjasot, 46100, València, Spain. e
Grupo de Investigación en Estudios Químicos y Biológicos, Facultad de Ciencias Básicas,
Universidad Tecnológica de Bolívar, Cartagena de Indias, Bolívar, Colombia. f Grupo de Investigación de Bioinformática, Centro de Estudio de Matemática Computacional
(CEMC), Universidad de las Ciencias Informáticas, La Habana, Cuba.
* Corresponding author:
E-mail : ymarrero77@yahoo.es or ymponce@gmail.com
URL : http://www.uv.es/yoma/
CONTENTS
Theoretical background SI1 . Shannon Entropy (SE) and Scaled Shannon’s Entropy.
Shannon’s entropy
Recently, Godden and Bajorath have proposed an Information-theoretic methodology for evaluating the relevance of Molecular Descriptors (MDs), using the concept of Shannon’s entropy. This approach consists in the evaluation of the distribution of MD values for molecular structures in a defined number of discrete intervals (bins). However, it is known that MDs are derived from a diverse range of theories and mathematical procedures and it is thus logical to have varying ranges of discrete and continuous of MD values for a given dataset. To achieve an unbiased comparison, it is crucial that the method employed does not depend on the type or the range of the MD values. In this sense, a regular discretization scheme that assigns an equal number of discrete intervals (bins) to each MD is proposed. Shannon’s entropy for distribution, P
(i) , defined as
H
i
N
1 p i log
2 p i where, p i
is the probability that case c adopts a value within a specific data interval i for a given
MD. Zero entropy means that all MD values for given compound dataset fall in the same discrete interval, while maximum entropy is obtained when the MD values are evenly distributed among the bins.
Scaled Shannon’s Entropy (sSE)
Scaled Shannon’s Entropy also known as standardized Shannon’s entropy, is a normalization of Shannon’s entropy in respect to the maximum obtained SE value and it is calculated as:
sSE
SE
I n
SE log
2 n
n
I
log
2 n
I max
I c where, n is the number of bins.
SI2. The percentages of correct classification using combinations of supervised and unsupervised feature selection approaches, using KNN1 and SVM classifiers, respectively.
Methods
DSE-DGSE
DSE-DV
DSE-EDSE
DSE-gSE
DSE-rSE
DSE-SE
DSE-SVDE
GR-DGSE
GR-DV
GR-EDSE
GR-gSE
GR-rSE
GR-SE
GR-SVDE
IG-DGSE
IG-DV
IG-EDSE
IG-gSE
IG-rSE
IG-SE
IG-SVDE
JI-DGSE
JI-DV
JI-EDSE
JI-gSE
JI-rSE
JI-SE
JI-SVDE
MIDSE-DGSE
MIDSE-DV
MIDSE-EDSE
MIDSE-gSE
MIDSE-rSE
MIDSE-SE
MIDSE-SVDE
56
85
52
66
85
85
81
85
81
59
82
81
54
82
82
82
66
66
58
71
66
76
72
67
54
77
78
70
78
72
SVM KNN1
Correct Incorrect Correct Incorrect
78 22 77 23
78
78
78
55
22
22
22
45
77
77
78
66
23
23
22
34
22
30
22
28
28
33
46
23
77
69
76
72
73
64
67
73
27
36
33
27
23
31
24
28
46
18
18
18
19
41
18
19
15
15
19
15
44
15
48
34
34
34
42
29
34
24
67
72
72
72
74
66
72
76
67
67
63
66
67
67
66
72
72
72
72
74
72
71
28
28
28
26
28
29
33
33
34
28
33
33
37
34
26
34
28
24
33
28
28
28
SU-DGSE
SU-DV
SU-EDSE
SU-gSE
SU-rSE
SU-SE
SU-SVDE
77
78
75
68
57
79
73
23
22
25
32
43
21
27
70
73
77
66
67
74
81
30
27
23
34
33
26
19
SI3. Matrix showing the percentages of correct classification for all pair-wise combinations for the 15 feature selection tools.
cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cluster measure
W-
Relieff-
DV
JI-
DGSE
W-
ChiSquare-
DGSE
W-
SU
W-
IG
JI
W-
OneR-
DV
W-
OneR-
SVDE
W-
IGrSE
DSE-
SVDE
SE GR
SU-
EDSE
MIDSE DSE
1
2
3
4
5
6
W-Relieff-DV
JI-DGSE
W-ChiSquare-
DGSE
W-SU
W-IG
JI
7
8
W-OneR-DV
W-OneR-SVDE
9 W-IG-rSE
10 DSE-SVDE
11 SE
12 GR
13 SU-EDSE
14 MIDSE
15 DSE
1
2
3
W-Relieff-DV
JI-DGSE
W-ChiSquare-
DGSE
65
77
85
75
64
71
73
84
74
77
78
80
80
73
78
69
83
79
83
85
83
85
82
80
81
79
86
83
78
75
82
82
86
84
80
78
77
67
82
79
67
76
71
77
75
68
82
84
81
73
81
79
83
79
KNN1
80 80 73
84 81 73
86 84 80
83 86 85
86 81 81
85 81 67
83 80 78
78 79 71
84 79 73
87 79 70
82 79 73
79 77 67
85 83 80
80 77 72
84 81 74
SVM
83 76 74
85 82 85
84 83 80
82
83
82
84
77
78
77
78
76
77
81
74
81
76
83
80
78
82
77
68
82
77
72
74
78
81
73
74
73
84
79
73
78
74
75
75
72
70
74
75
74
70
73
74
78
79
71
77
78 75 64
82 79 67
82 81 79
87 82 79
79 79 77
70 73 67
76 77 78
74 72 70
73 77 72
69 70 76
70 74 71
76 71 67
76 78 77
70 72 74
80 71 75
71
76
86
78
77
77
75
76
74
74
76
85
83
80
75
80
84
79
74
80
81
77 79 69
80 77 79
81 76 79
82
84
80
78
81
78
71
75
75
78
82
77
78
80
84
81
74
77
73
71
83
72
74
75
72
77
75
78
70
80
77
72
78
76
73
78
82
81
81
6
7
4
5
W-SU
W-IG
JI
W-OneR-DV
8
9
W-OneR-SVDE
W-IG-rSE
10 DSE-SVDE
11 SE
12 GR
13 SU-EDSE
14 MIDSE
15 DSE
79
84
73
84
80
80
77
85
82
85
83
81
69
82
76
80
74
77
79
83
76
74
82
82
79
80
78
79
81
81
76
84
83
80
82
81
80 82 82
82 76 79
82 79 56
83 80 82
83 81 82
86 78 69
78 84 70
82 76 71
76 74 42
80 76 74
79 71 69
83 79 79
82
81
78
82
80
80
81
83
80
82
82
84
65
79
77
83
66
79
74
86
78
69
80
81
79
83
80
81
83
81
78
83
81
82
82
82
78 82 76
84 76 74
70 71 42
80 81 82
81 78 79
79 74 65
70 69 66
69 71 73
66 73 54
80 79 77
69 65 62
73 72 79
77
75
75
83
79
80
79
80
76
74
81
77
79
77
74
82
81
73
72
83
79
79
84
78
62
75
66
80
77
69
65
79
71
69
78
74