IMMAN: Free Software for Information Theory-based Chemometric Analysis SUPPORTING INFORMATION

advertisement

SUPPORTING INFORMATION

IMMAN: Free Software for Information

Theory-based Chemometric Analysis

Ricardo W. Pino Urias, a,b Stephen J. Barigye, c Yovani Marrero-Ponce, a,d,e* César R.

García-Jacas, a,f José R. Valdes-Martiní b and Facundo Perez-Gimenez d a

Unit of Computer-Aided Molecular “Biosilico” Discovery and Bioinformatic Research (CAMD-

BIR International), Cartagena de Indias, Bolivar, Colombia. b

Faculty of Mathematics Physics and Computation. Universidad Central “Marta Abreu” de Las

Villas, Santa Clara, 54830, Villa Clara, Cuba. c Departamento de Química, Universidade Federal de Lavras, UFLA, Caixa Postal 3037, 37200-

000 Lavras, MG, Brazil. d Facultad de Farmacia, Universitat de València, Burjasot, 46100, València, Spain. e

Grupo de Investigación en Estudios Químicos y Biológicos, Facultad de Ciencias Básicas,

Universidad Tecnológica de Bolívar, Cartagena de Indias, Bolívar, Colombia. f Grupo de Investigación de Bioinformática, Centro de Estudio de Matemática Computacional

(CEMC), Universidad de las Ciencias Informáticas, La Habana, Cuba.

* Corresponding author:

E-mail : ymarrero77@yahoo.es or ymponce@gmail.com

URL : http://www.uv.es/yoma/

CONTENTS

Theoretical background SI1 . Shannon Entropy (SE) and Scaled Shannon’s Entropy.

Shannon’s entropy

Recently, Godden and Bajorath have proposed an Information-theoretic methodology for evaluating the relevance of Molecular Descriptors (MDs), using the concept of Shannon’s entropy. This approach consists in the evaluation of the distribution of MD values for molecular structures in a defined number of discrete intervals (bins). However, it is known that MDs are derived from a diverse range of theories and mathematical procedures and it is thus logical to have varying ranges of discrete and continuous of MD values for a given dataset. To achieve an unbiased comparison, it is crucial that the method employed does not depend on the type or the range of the MD values. In this sense, a regular discretization scheme that assigns an equal number of discrete intervals (bins) to each MD is proposed. Shannon’s entropy for distribution, P

(i) , defined as

H

  i

N

1 p i log

2 p i where, p i

is the probability that case c adopts a value within a specific data interval i for a given

MD. Zero entropy means that all MD values for given compound dataset fall in the same discrete interval, while maximum entropy is obtained when the MD values are evenly distributed among the bins.

Scaled Shannon’s Entropy (sSE)

Scaled Shannon’s Entropy also known as standardized Shannon’s entropy, is a normalization of Shannon’s entropy in respect to the maximum obtained SE value and it is calculated as:

sSE

SE

I n

SE log

2 n

 n

I

 log

2 n

I max

I c where, n is the number of bins.

SI2. The percentages of correct classification using combinations of supervised and unsupervised feature selection approaches, using KNN1 and SVM classifiers, respectively.

Methods

DSE-DGSE

DSE-DV

DSE-EDSE

DSE-gSE

DSE-rSE

DSE-SE

DSE-SVDE

GR-DGSE

GR-DV

GR-EDSE

GR-gSE

GR-rSE

GR-SE

GR-SVDE

IG-DGSE

IG-DV

IG-EDSE

IG-gSE

IG-rSE

IG-SE

IG-SVDE

JI-DGSE

JI-DV

JI-EDSE

JI-gSE

JI-rSE

JI-SE

JI-SVDE

MIDSE-DGSE

MIDSE-DV

MIDSE-EDSE

MIDSE-gSE

MIDSE-rSE

MIDSE-SE

MIDSE-SVDE

56

85

52

66

85

85

81

85

81

59

82

81

54

82

82

82

66

66

58

71

66

76

72

67

54

77

78

70

78

72

SVM KNN1

Correct Incorrect Correct Incorrect

78 22 77 23

78

78

78

55

22

22

22

45

77

77

78

66

23

23

22

34

22

30

22

28

28

33

46

23

77

69

76

72

73

64

67

73

27

36

33

27

23

31

24

28

46

18

18

18

19

41

18

19

15

15

19

15

44

15

48

34

34

34

42

29

34

24

67

72

72

72

74

66

72

76

67

67

63

66

67

67

66

72

72

72

72

74

72

71

28

28

28

26

28

29

33

33

34

28

33

33

37

34

26

34

28

24

33

28

28

28

SU-DGSE

SU-DV

SU-EDSE

SU-gSE

SU-rSE

SU-SE

SU-SVDE

77

78

75

68

57

79

73

23

22

25

32

43

21

27

70

73

77

66

67

74

81

30

27

23

34

33

26

19

SI3. Matrix showing the percentages of correct classification for all pair-wise combinations for the 15 feature selection tools.

cluster 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 cluster measure

W-

Relieff-

DV

JI-

DGSE

W-

ChiSquare-

DGSE

W-

SU

W-

IG

JI

W-

OneR-

DV

W-

OneR-

SVDE

W-

IGrSE

DSE-

SVDE

SE GR

SU-

EDSE

MIDSE DSE

1

2

3

4

5

6

W-Relieff-DV

JI-DGSE

W-ChiSquare-

DGSE

W-SU

W-IG

JI

7

8

W-OneR-DV

W-OneR-SVDE

9 W-IG-rSE

10 DSE-SVDE

11 SE

12 GR

13 SU-EDSE

14 MIDSE

15 DSE

1

2

3

W-Relieff-DV

JI-DGSE

W-ChiSquare-

DGSE

65

77

85

75

64

71

73

84

74

77

78

80

80

73

78

69

83

79

83

85

83

85

82

80

81

79

86

83

78

75

82

82

86

84

80

78

77

67

82

79

67

76

71

77

75

68

82

84

81

73

81

79

83

79

KNN1

80 80 73

84 81 73

86 84 80

83 86 85

86 81 81

85 81 67

83 80 78

78 79 71

84 79 73

87 79 70

82 79 73

79 77 67

85 83 80

80 77 72

84 81 74

SVM

83 76 74

85 82 85

84 83 80

82

83

82

84

77

78

77

78

76

77

81

74

81

76

83

80

78

82

77

68

82

77

72

74

78

81

73

74

73

84

79

73

78

74

75

75

72

70

74

75

74

70

73

74

78

79

71

77

78 75 64

82 79 67

82 81 79

87 82 79

79 79 77

70 73 67

76 77 78

74 72 70

73 77 72

69 70 76

70 74 71

76 71 67

76 78 77

70 72 74

80 71 75

71

76

86

78

77

77

75

76

74

74

76

85

83

80

75

80

84

79

74

80

81

77 79 69

80 77 79

81 76 79

82

84

80

78

81

78

71

75

75

78

82

77

78

80

84

81

74

77

73

71

83

72

74

75

72

77

75

78

70

80

77

72

78

76

73

78

82

81

81

6

7

4

5

W-SU

W-IG

JI

W-OneR-DV

8

9

W-OneR-SVDE

W-IG-rSE

10 DSE-SVDE

11 SE

12 GR

13 SU-EDSE

14 MIDSE

15 DSE

79

84

73

84

80

80

77

85

82

85

83

81

69

82

76

80

74

77

79

83

76

74

82

82

79

80

78

79

81

81

76

84

83

80

82

81

80 82 82

82 76 79

82 79 56

83 80 82

83 81 82

86 78 69

78 84 70

82 76 71

76 74 42

80 76 74

79 71 69

83 79 79

82

81

78

82

80

80

81

83

80

82

82

84

65

79

77

83

66

79

74

86

78

69

80

81

79

83

80

81

83

81

78

83

81

82

82

82

78 82 76

84 76 74

70 71 42

80 81 82

81 78 79

79 74 65

70 69 66

69 71 73

66 73 54

80 79 77

69 65 62

73 72 79

77

75

75

83

79

80

79

80

76

74

81

77

79

77

74

82

81

73

72

83

79

79

84

78

62

75

66

80

77

69

65

79

71

69

78

74

Download