Document 14602914

advertisement
DEVELOPMENT OF QSAR MODELS FOR PREDICTING BIOLOGICAL
ACTIVITY OF CHEMICAL COMPOUNDS FROM NATURAL PRODUCTS AND
ITS APPLICATION IN DATABASE MINING
NENI FRIMAYANTI
A thesis submitted in fulfilment of the
requirements for the award of the degree of
Master of Science (Chemistry)
Faculty of Science
Universiti Teknologi Malaysia
AUGUST 2005
Specially Dedicated To
My Beloved Mama (Emzalia Farida)
Papa (Bahdar Johan, S.Pd) and My Sweet Sister
(Belinda Monalisa, S.Si)
ACKNOWLEDGEMENT
Vision, values and courage are the main gift of this thesis. I am grateful for
the inspiration and wisdom of many thoughts that have been instrumental in its
formulation.
First of all, I have readily acknowledged and thank to Allah SWT, the
Omnipotent and Omniscient who created everything and in giving me the ability to
begin and complete this project. I also wish to express my sincere appreciation to
my supervisor, Assoc. Prof. Dr. Mohamed Noor Hasan, for his guidance, advice,
motivation, critics and friendship. Without his help, this thesis would not have been
the same as presented here.
I would like to thank Dr. Farediah Ahmad and uni Deni Susanti, M.Sc for
teaching and explaination about bioactive compounds. I would also like to thank to
Dr. Fahrul Zaman Huyop and his research group for the many useful discussions and
help in biological tests. I am also indebted to Universiti Teknologi Malaysia (UTM)
for support in providing the research grant for this project entitled “StructureActivity Relationship Studies of Bioactive Compounds in Natural Products” (IRPA
Vot 74089).
My sincere appreciation is also extended to abang Ir. Henry Nasution, MT
and om Afitra Jaya, SH for help and kindness, so that I can pursue my study here.
Many thanks to my entire beautiful sisters in S36 UTM and my best friend Kiki, I
can not forget about our familiarity and friendship.
Last but certainly not least, I want to thank my mama, papa, my sweet sister
Mona, my lovely grandma (mak, tino, and nenek) and om Rahman and all of my big
family, for their affection, prayer and support throughout my study. I love you all.
ABSTRACT
Due to drug resistant problems, there is an urgent need to discover and
develop new anti bacterial and anti tuberculosis lead compounds.
Quantitative
structure activity relationship (QSAR) methodology have been used to develop
models that correlate biological activity of chemicals derived from natural products
and their molecular structure. The approach started by generation of a series of
descriptors from three-dimensional representations of the compounds in the data set.
In this study, the first data set consisted of 56 compounds isolated from natural
products with their minimum inhibition concentration (MIC, µg/mL) against
Escherichia coli.
The second data set consisted of 122 plant terpenoids with
moderate to high activity against Mycobacterium tuberculosis. Genetic algorithmpartial least square (GAPLS) and multiple linear regression analysis (MLRA)
techniques have been used in the model development. The validated QSAR models
were applied in mining chemicals in a large database. The same set of descriptors
that appeared in the QSAR models were used in chemical similarity search (based on
Euclidean distance) comparing active compounds of the training set and those in the
database. The selected compounds were short-listed by applying the applicability
domain criterion to reduce the number of candidates to be tested. Finally, the
biological activity of these compounds was determined experimentally using disk
diffusion method to confirm their predicted MIC values.
ABSTRAK
Masaalah ketidakberkesanan ubat-ubatan telah menyebabkan perlunya
penemuan dan pengembangan sebatian-sebatian anti bakteria dan anti tuberkulosis.
Kaedah hubungan kuantitatif struktur-aktiviti (QSAR) telah digunakan untuk
membangunkan model yang dapat menghubungkan antara aktiviti biologi sebatiansebatian kimia daripada hasilan semula jadi dengan struktur molekularnya. Kaedah
ini dimulakan dengan menjanakan deskriptor daripada model tiga dimensi sebatiansebatian yang terdapat dalam set data. Dalam kajian ini, set data pertama terdiri
daripada 56 sebatian yang telah dipisahkan daripada hasilan semula jadi dengan
minimum inhibition concentration (MIC, µg/mL) terhadap Escherichia coli. Set data
kedua terdiri daripada 122 terpen tumbuh-tumbuhan dengan aktiviti sederhana dan
tinggi terhadap Mycobacterium tuberculosis. Teknik algoritma genetik-kuasa dua
terkecil separa (GAPLS) dan analisis linear berganda (MLRA) telah digunakan
sebagai kaedah untuk membina model. Model QSAR yang sah kemudian digunakan
untuk mencari sebatian kimia dari sebuah pangkalan data yang besar. Set deskriptor
yang sama seperti di dalam model QSAR telah digunakan untuk mencari sebatian
kimia yang sama (berdasarkan jarak Euclidean) antara sebatian yang aktif daripada
set data dengan sebatian-sebatian daripada pangkalan data. Sebatian-sebatian yang
terpilih disenarai pendekkan dengan mengaplikasikan kriteria applicability domain
untuk mengurangkan jumlah sebatian yang akan diuji. Akhirnya, aktiviti biologi
daripada sebatian-sebatian tersebut di uji secara eksperimen, menggunakan teknik
disk diffusion untuk mengesahkan nilai MIC ramalan.
vii
TABLE OF CONTENTS
CHAPTER
1
TITLE
PAGE
DECLARATION
ii
DEDICATION
iii
ACKNOWLEDGEMENT
iv
ABSTRACT
v
ABSTRAK
vi
TABLE OF CONTENTS
vii
LIST OF TABLES
xi
LIST OF FIGURES
xiii
LIST OF SYMBOLS
xv
LIST OF ACRONYMS
xvii
LIST OF APPENDICES
xix
INTRODUCTION
1.1
Introduction
1
1.2
Quantitative Structure Activity Relationship (QSAR)
2
1.3
History and Development of QSAR
4
1.3.1 Data Set
7
1.3.2 Descriptors
8
1.4
1.3.2.1 Topological Descriptors
10
1.3.2.2 Electronic Descriptors
11
1.3.2.3 Geometric Descriptors
11
Feature Selection
12
1.4.1 Genetic Algorithm (GA)
14
viii
1.5
Tools and Techniques of QSAR
14
1.5.1 Multiple Linear Regression Analysis
15
1.5.2 Partial Least Squares
17
1.6
Applications of QSAR
20
1.7
Overview of Multidrug Resistance Mycobacterium
22
tuberculosis
1.7.1 Mycobacterium tuberculosis
23
1.7.2 How Does Tuberculosis Spread
24
Minimum Inhibition Concentration (MIC)
25
1.8.1 Escherichia coli
26
1.9
Database Mining
27
1.10
Research Scope
28
1.11
Research Objectives
28
1.12
Significance of Research
29
1.13
Layout of the Thesis
29
1.8
2
RESEARCH METHODOLOGY
2.1
Introduction
31
2.2
Data Set
32
2.3
Structure Entry and Molecular Modeling
33
2.4
Descriptor Generation
33
2.5
Feature Selection
35
2.5.1 Objective Feature Selection
35
2.5.2 Subjective Feature Selection
36
Model Development
37
2.6.1 Multiple Linear Regression Analysis
38
2.6.2 Partial Least Squares
39
2.7
Model Validation
40
2.8
Application of QSAR Models to Database Mining
42
2.8.1 Molecular Descriptors and Similarity
44
2.6
Calculation
2.8.2 Applicability Domain of QSAR Models
45
2.8.3 Biological Activity Predicted Using QSAR
46
ix
Models
2.9
3
Laboratory Testing
46
2.9.1 Material and Method of Agar Diffusion
47
DEVELOPMENT OF QSAR MODELS AND DATABASE
MINING FOR ANTI BACTERIAL AGENTS
3.1
Introduction
49
3.2
Selection of Descriptors and Feature Selection
49
3.3
Model Development Using MLRA Method
54
3.4
Model Development Using PLS Method
57
3.5
Model Validation
61
3.6
Application of QSAR Models to Database Mining
63
3.6.1 Application of QSAR Models in AmicBase
Database Mining (without Scaling)
66
3.6.2 Application of QSAR Models in AmicBase
68
Database Mining (with Scaling)
3.7
Experimental Validation
3.8
Effects of Range Scaling and Applicability Domain to
Search New Agents
4
71
74
DEVELOPMENT OF QSAR MODELS AND DATABASE
MINING FOR ANTI TUBERCULOSIS AGENTS
4.1
Introduction
76
4.2
Descriptors generation and Objective Feature
76
Selection
4.3
Development of QSAR Models by using MLRA
80
Method
4.4
Development of QSAR Models by using PLS
83
Technique
4.5
Model Validation
86
4.6
Application of QSAR Models to Database Mining
89
4.6.1 Application of QSAR Models in AmicBase
Database Mining (without Scaling)
4.6.2 Application of QSAR Models in AmicBase
90
x
Database mining (with Scaling)
93
4.6.3 Effects of Applicability Domain to
Search New Agents
4.7
5
Experimental Validation
96
96
CONCLUSIONS AND RECOMENDATION
5.1
Introduction
100
5.2
Conclusion
100
5.3
Limitation of the study
101
5.4
Future Research Recommendation
102
REFERENCES
103
APPENDIX A
110
APPENDIX B
114
APPENDIX C
121
xi
LIST OF TABLES
TABLE NO.
TITLE
PAGE
2.1
Type of descriptors in TSAR
34
3.1
List of selected descriptors and their statistical analysis
50
3.2
Correlation matrix of descriptors
52
3.3
Statistical output of MLRA model
54
3.4
Descriptors which were included in the QSAR model by
using of MLRA
55
3.5
Statistical analysis of MLRA method
57
3.6
Statistical output of GA-PLS for each dimension
58
3.7
Statistical output of PLS model
59
3.8
Descriptors which were included in the QSAR model by
using of PLS
60
3.9
Calculated MIC for compounds in the prediction set
62
3.10
List of probe compounds for database mining
64
3.11
Selected compounds with predicted MIC value
67
3.12
Selected compound with their biological activity predicted
70
3.13
MIC value of selected compounds (without scaling) using
71
agar diffusion method
3.14
MIC value of selected compounds (with scaling) using agar
73
diffusion
4.1
List of selected descriptors and their statistics analysis
77
4.2
Correlation matrix of descriptors
78
4.3
Statistical output of MLRA model
81
4.4
Descriptors which were included in the MLRA model
81
xii
4.5
Statistical plot output of GA-PLS for each dimension
83
4.6
Statistic of the PLS model
84
4.7
Descriptors which were included in the PLS model
86
4.8
Calculated MIC for compounds in the prediction set
87
4.9
Selected compounds with their predicted anti tuberculosis
activity
90
4.10
List of probe compounds for database mining
91
4.11
Selected compounds with their predicted MIC value
95
4.12
MIC value of selected compounds (without scaling) using
4.13
agar diffusion method
97
MIC value of selected compounds (with scaling) using agar
98
diffusion method
xiii
LIST OF FIGURES
TABLE NO.
TITLE
PAGE
1.1
The general QSAR problem
1.2
Flow diagram for the genetic algorithm (GA)
15
1.3
Illustration of the difference between PCR and PLS
19
1.4
Structure of E. coli
26
2.1
General QSAR methodology
32
2.2
Genetic algorithm process
38
2.3
Flowchart for the general model building process in QSAR
studies
2.4
9
41
Flowchart of database mining that employs predictive
QSAR models
43
3.1
Plot of experimental vs. predicted MIC for MLRA model
56
3.2
Plot of predicted value vs. standard residual for MLRA
56
model
3.3
Plot PRESS vs. No of component
58
3.4
Plot of experimental vs. predicted MIC for PLS model
59
3.5
Plot of predicted value vs. standard residual for PLS model
61
3.6
Flowchart to select new compounds in AmbicBase
database
3.7
Flowchart to select new compounds in AmbicBAse
database
3.8
3.9
66
69
Inhibition zone of E. coli using (a) m-cresol and (b)
eugenol methyl ether
72
Inhibiton zone of E. coli using selective compounds
74
xiv
4.1
Plot of experimental value vs. predicted MIC for MLRA
82
4.2
Plot of predicted value vs. standard residual for MLRA
82
model
4.3
Plot PRESS vs. No. of component
84
4.4
Plot of experimental vs. predicted MIC for PLS model
85
4.5
Plot of predicted value vs. standard residual for PLS model
85
4.6
Step to select new compounds against M. tuberculosis
94
4.7
Inhibition zone of active and inactive agents
97
xv
LIST OF SYMBOLS
a, b, c, d
~
b̂ , b
-
Regression coefficient
-
Regression vector
~ˆ
b
-
~
The estimate of b
ĉ
-
Activity of unknown compounds
DT
-
Applicability domain
Es
-
Steric component
ρ
-
Proportionality reaction constant
σ
-
Electronic properties of aromatic compounds, standard
deviation of Euclidean distance
π
-
Hydrophobicity of substituents
px
-
Partition coefficients of derivative molecule
pH
-
Partition coefficients of parent molecule
r2
-
How closely equation fits the data
r2 (CV)
-
Predictive power of the model
runk
-
Matrix of the known descriptor
χ
-
Molecular connectivity indices
X
-
Mean value
y
-
Activity observed value
y
-
Mean value, average Euclidean distance
ŷ
-
Predicted value
C
-
Concentration of molecule
D
-
Distance matrix
F
-
Degrees of freedom
R
-
Matrix of descriptor
xvi
RT
-
Pseudo-inverse of matrix descriptor
S
-
A diagonal matrix, standard error of the
regression model
s.d
-
Standard deviation
U
-
Score matrix from PCA
V
-
Matrix containing the loading
W
-
Wiener index
Z
-
An arbitrary parameter to control the significance level
xvii
LIST OF ACRONYMS
BC3
-
Benzo [c] quinolizin-3-ones
CADD
-
Computer assisted drug design
CAMD
-
Computer assisted molecular design
DAT
-
Dopamine transporter
EC50
-
Effect concentration
ED
-
Euclidean distance
EDCs
-
Endocrine disrupting chemicals
EIEC
-
Enteroinvasive
EPEC
-
Enter pathogenic
ETEC
-
Enterotoxigenic
GA
-
Genetic algorithm
GA-MLRA -
Genetic algorithm-multiple linear regression analysis
GAPLS
-
Genetic algorithm partial least squares
GSA
-
Genetic simulated annealing
HOMO
-
Highest occupied molecular orbital
IC50
-
Inhibition concentration
KNN
-
K-nearest neighbor
LDA
-
Linear discriminant analysis
LFER
-
Linear free energy relationship
LUMO
-
Lowest unoccupied molecular orbital
MDR
-
Multi drug resistant
MIC
-
Minimum inhibition concentration
MLRA
-
Multiple linear regression analysis
MLR
-
Multivariate linear regression
MRA
-
Multiple regression analysis
xviii
NCI
-
National cancer institute
PCA
-
Principal component analysis
PCR
-
Principal component regression
PLS
-
Partial least squares
PRESS
-
Predictive sum of squares
QSAR
-
Quantitative structure activity relationship
QSPR
-
Quantitative structure property relationship
RSS
-
Residual sum of squares
TCH
-
Thiophene 2 carboxylic acid hyrazide
SSR
-
Sum of squares
SST
-
Total sum of squares
VTEC
-
Verotoxigenic
VOCs
-
Volatile organic compounds
CHAPTER 1
INTRODUCTION
1.1
Introduction
Malaysia is rich with chemical diversity of its natural products.
It is
estimated, there are about 12,000 species of plants found in this country and more
than 1000 species are said to have therapeutic properties [1].
Much of these
resources are still untapped although a number of research groups have been actively
involved in systematically studying their chemical and biological properties. Some
of these compounds and their derivatives have been shown to have antibacterial
properties [2, 3]. For example, bioactive compounds can be produced from the
family of Rubiaceae, Verbanaceae, Zingiberaceae and Piperaceae.
Tuberculosis, mainly caused by Mycobacterium tuberculosis, is the leading
killer among all infectious disease worldwide and is responsible for more than two
million deaths annually. The recent increase in the number of multi-drug resistant
clinical isolates of M. tuberculosis has created an urgent need for discovery and
development of new anti tuberculosis lead compounds.
It is expected that the
quantitative structure-activity relationship (QSAR) approach which has been
successfully applied to study factors involved in determining chemical properties or
biological activities of chemical compounds can be applied here [4].
In a typical structure-activity relationship study, one is interested to develop
models that can correlate the structural features of a series of chemical compounds
2
with their physicochemical properties or biological activities. These correlation
models can be used to predict the activity of new compounds as well as to form a
basis for understanding factors affecting their activities [5, 6].
QSAR models are constructed by analyzing known or computed property
data and series of numerical descriptors representing the structural characteristic.
Descriptors quantitative properties depend on the structure of the molecule. Various
physicochemical parameters including thermodynamic properties (such as system
energies), electronic properties (e.g. value of highest occupied molecular orbital
(HOMO) and lowest unoccupied molecular orbital (LUMO), molecular shape (e.g.
surface area, length to breadth ratio) and simple structural characteristic (e.g. number
of bonds, connectivity indexes, etc) have been used to get solid models which were
able to predict the biological activity of unknown molecules [6].
In this study the structure activity relationship approach above was
implemented to develop models that can correlate structural features of the
compounds isolated from plants with their anti bacterial activity. Good models
developed using the method were applied to screen a large chemical database.
Results of the screening probes can be used to select and to postulate structure of
leads molecules that can be synthesized in the production of new drugs in
pharmaceutical industries.
1.2
Quantitative Structure Activity Relationship (QSAR)
Drugs exert their biological effects by participating in a series of events
which include transport, binding with the receptor and metabolism to an inactive
species. Since the interaction mechanism between the molecule and the putative
receptor are unknown in most cases (i.e., no bound crystal structure), one is reduced
to making inferences from properties which can easily be obtained (molecular
properties and descriptors) to explain these interactions for unknown molecules.
3
The pharmaceutical companies need to continuously discover and develop
new drugs, particularly in the field of anti-infective agents, in order to fight the
increase of resistance to older drugs and newly discovered types of infections such as
mutated bacteria and viral infection. Traditional and novel approaches are used in
drug discovery, which can be grouped into three categories [7]:
1. Random screening of a large number of compounds in search of desired
biological properties.
2. Structural modifications of lead compounds, through the substitution, addition or
elimination of chemical groups.
3. Rational drug design, including different approaches and techniques most of
them with important computational component.
These approaches are not necessarily incompatible, and most companies try
to use new methods to accelerate the discovery of new compounds. QSAR is a new
technique based on the reasonable premise that the biological activity of compounds
is a consequence of its molecular structure, provided we can identify those aspects of
molecular structure that relevant to a particular biological activity.
QSAR is a part of chemometrics discipline that represents an attempt to
correlate structural or property descriptors of compounds with activities. In other
words, it is an indication of the explosion of techniques, procedures, and ideas, all
relating in some fashion to attempt to summarise chemical and biological
information in a form that allows one to generate and test hypotheses to facilitate an
understanding of interactions between molecules. QSAR can also be referred to
statistical analysis of potential relationships between chemical structure and
biological activity.
The goal of structure activity relationship is to analyse and detect the
determining factors for the measured activity for a particular system, in order to have
an insight on the mechanism and behaviour of the studied system. For such purpose,
the strategy is to generate mathematical models that correlate experimental
measurements with a set of chemical descriptors determined from the molecular
structure for a set of compounds.
4
The formulation of thousands of equations using QSAR methodology attest
to a validation of its concepts and its utility in the elucidation of the mechanism of
action of drugs at molecular level and more complete understanding of
physicochemical phenomena such as hydrophobicity. It is now possible not only to
develop models for a system but also to compare models from a biological database
and to draw analogies with model from physical organic database.
1.3
History and Development of QSAR
More than a century ago, Crum-Brown and Fraser expressed the idea that the
physiological action of a substance was a function of its chemical composition and
constitution [8]. In 1863, Cros at the university of Strasbourg observed that toxicity
of alcohols to mammals increased as the water solubility of alcohol decreased while
in 1890’s, Hans Horst Meyer of the university Marburg and Charles Ernerst Overton
of the university of Zurich, working independently, noted that the toxicity of organic
compounds depended on their lipophilicity [9]. Basing on biological experiments,
they correlated partition coefficients with anesthetic potencies. Besides, Overton
also determined the effect of functional groups in the increase or decrease of partition
coefficients [10]. Afterwards, Lazarev in St. Petersburg continued where Overton
and Meyer left off, applying partition coefficients to the development of industrial
hygiene standards. Lazarev reported correlations on a log scale, and developed a
system for estimating partition coefficients from chemical structure.
In 1893, Richet showed that the cytotoxicities of a diverse set of simple
organic molecules were inversely related to their corresponding water solubilities
and in 1939 the earliest mathematical formulation is attributed to Ferguson, who
announced a principle for toxicity [8].
He observed the increase in anesthetic
potency when ascending in a homologous series of either n-alkanes or alkanols to a
point where a loss of potency, or at least no further increase occurred, using physical
properties such as solubility in water, distribution between phases, capillarity and
steam pressure.
5
Little additional development of QSAR occurred until the work of Louis
Hammet (1937) within the field of organic chemistry, who observed that the addition
of substituents to the aromatic ring of benzoic acid had an orderly and quantitative
effect on the dissociation constant.
He also correlated electronic properties of
organic acid and bases with their equilibrium constants and reactivity.
From
empirical observation, he consequently derived the following linear relationship, the
so called Hammet equation:
log
K
= ρσ
K0
1.1
where the slope ρ is proportionality reaction constant pertaining to a given
equilibrium that relates the effect of substituents on that equilibrium to the effect on
the benzoic acid equilibrium.
σ is a parameter that describes the electronic
properties of aromatic substituents i.e. donating power.
Based on Hammett’s
relationship, the electronic properties were utilized as the descriptors of structure [9].
Taft devised a way for separating polar, steric and resonance effects and
introducing the first steric parameters, Es [11]. Working in the same direction,
Swain studied the effects of field and resonance. He investigated the variation of
reactivity of a given electrophilic substrate towards a series of nucleophilic reagents
[10].
Free and Wilson partitioned the molecule in a different manner as Hammet.
They postulated that the biological activity of a molecular set can be related with the
addition of substituents, taking into account the number, type and position in the
parent skeleton [10].
In 1962 Hansch and Muir published their brilliant study on the structure
activity relationship of plant growth regulators and their dependency on Hammett
constant and hydrophobicity. The parameter π, which is relative hydrophobicity of
substituents, was defined in a manner analogous to the definition of sigma:
6
πx
= log p x – log p H
1.2
Px and PH represent the partition coefficients of derivative and the parent molecule,
respectively. In 1964 Hansch and Fujita combined these hydrophobic constant with
Hammett’s electronic constants to yield the linear Hansch equation.
Hansch analysis is powerful technique for use in optimizing the activity of
lead compounds. All physicochemical factors that relate to the transport and receptor
interaction can be broken down into hydrophobic, electronic and steric component.
Correlation between hydrophobic, electronic and steric components to biological
activity can be summarized in an equation like below:
Log
1
C
= aπ + bσ + cEs + d
1.3
where C is molar concentration of compounds, π, σ and Es is hydrophobic, electronic
and steric component. a, b, c and d are regression coefficients. The combination of
Hansch and Free-Wilson analysis in a mixed approach widens the applicability of
both QSAR methods.
The linear free energy relationship (LFER) approach was contributed as the
first attempt to predict the property of a compound from an analysis of its structure
[12]. LFER methods are widely used for the development of quantitative models for
energy-based properties such as partition coefficients, binding constants, or reaction
rate constant. This is based on the pioneering work of Hammet, who introduced this
method for the prediction of chemical reactivity.
The basic assumption is that
influence of a structural feature on the free energy change of a chemical process is
constant for a co generic series of compounds. The basic LFER approach was later
extended to the more general concept of fragmentation. Molecules are dissected into
substructures and each substructure is seen to contribute a constant increment to the
free energy based property. The promise of strict linearity does not hold true in most
cases, so correction have to be applied in the majority of methods based on
7
fragmentation approach. Correction terms are often related to long range interaction
such as resonance or steric effect.
Computer-assisted drug design (CADD), also called computer-assisted
molecular design (CAMD), represent more recent application of computers as tools
in the drug design process [9]. It is important to emphasize that computers cannot
substitute for a clear understanding of the system being studied. A computer should
therefore be considered as an additional tool to gain better insight into the chemistry
and biology of the problem at hand. This tool has enabled the rapid synthesis of
large number of molecules. Massive amount of data can be generated in relatively
short period of time.
In the middle of the 20th century, two QSAR approaches now considered as
classical were developed [7]:
1. Techniques based on the recognition of molecular features (fragments, groups or
sites) and calculation, generally by regression analysis of the contribution that
these patterns make to activity, assuming additively of the effects.
2. Techniques based on physicochemical parameter as structural descriptors. The
rationale of this method is the fact that biological responses of the living
organism to drugs are frequently controlled by lipophilicity, electronic and steric
properties.
1.3.1
Data Set
Data set consists of compounds with molecular structure and biological
activity; the compounds were divided between training and test set. Approximately
40 % were selected with a maximum dissimilarity algorithm and assigned to the test
set, with the remaining 60 % assigned to training set [13]. The training set was used
for QSAR model development and test set was used for model validation.
Other techniques that can be used to make a division of a data set into
training and test set are based on sphere-exclusion algorithms [14]. The procedure
8
implemented in this method starts with the calculation of the distance matrix D
between representative points in the descriptor space. Each probe sphere radius
corresponds to one division into training set and prediction set. A sphere-exclusion
algorithm consists of the following steps:
1. Select a compound with the highest activity.
2. Include this compound in the training set.
3. Construct a probe sphere around these compounds.
4. Include compounds, corresponding to representative points within this sphere,
except for the sphere center, in the test set.
5. Exclude all points within this sphere from the initial set of compounds.
The procedure for division of a data set can also be done by sorting the list in
increasing value of biological activity. Next, the odd numbered compounds are
assigned to training set and even numbered compounds are assigned to prediction set
or in the other way even numbered compounds are assigned to training set and odd
numbered compounds are assigned in prediction set.
1.3.2
Descriptors
QSAR models are constructed by analyzing known or computed property
data and series of descriptors representing the system characteristic. An important
class of these descriptors belongs to the empirical parameter category derived from
physical organic chemistry. These parameters focus on how chemical reaction rates
depend on differences in molecular structure.
Encoding the molecules numerically allows an indirect link between structure
and activity to be established. Descriptors are numerical quantities that characterize
properties of molecules [11]; descriptors also can be defined as numerical values that
encode certain aspects of molecular structure [12, 15]. For each structure in the data
set, more than 200 descriptors can be calculated ranging from atom and bond counts
to more detailed combinations of structural information. The relationship between
biological activity and descriptors is:
9
Molecule activity = f (molecule structure) = f (descriptor)
1.4
The QSAR methodology begins with calculation of numerical descriptors for
a set of compounds. Figure 1.1 shows how the generation of structural descriptors
establishes the relationships between molecular structures and properties or
biological activities.
MOLECULAR
STRUCTURES
PROPERTIES
Representation
feature selection
STRUCTURAL
DESCRIPTORS
Figure 1.1: The general QSAR problem
Descriptors can be a quantitative property that depends on the structure of
molecule.
Various physicochemical parameters such as heat of formation,
polarizability, hyperpolarizability, vibrational frequencies, etc have been used jointly
with connectivity, topological indices and geometrical indices in order to get good
model able to predict the anti bacterial activity [16].
The development of molecular structure descriptors is the most important part
of any structure activity investigations because the descriptors must contain enough
information to permit the correct classification of the compounds under study.
Descriptors fall into three main categories: topological, electronic and geometric
[17].
The following sections provide information and examples about each
descriptor class to convey a clearer understanding of the descriptors routines.
10
1.3.2.1 Topological Descriptors
The structures of organic compounds can be represented as graphs. The
theorems of graph theory can then be applied to generate graph invariants, which in
the context of chemistry are called topological descriptors.
The topological
description of a molecule contains information on the atom-atom connectivity in the
molecule, and encodes the size, shape, branching, heteroatom and the presence of
multiple bonds [18, 19]. This graph description of molecules neglects information
on bond lengths, bond angles and torsion angles, but is able to encode in numerical
form the important atom connectivity information that determine a wide range of
physical, chemical and biological properties. Topological indices are widely used as
structural descriptors in quantitative structure-property relationships (QSPR) and
QSAR models.
The Wiener index, W, defined in 1947, is widely used in QSAR and QSPR
models as a part of topological descriptors, and it still represent an important source
of inspiration for defining new topological indices. The path number W is defined as
the sum of the distances between any two carbon atoms in the molecule, in terms of
carbon-carbon bonds [17]. Hosoya extended the application of the wiener index by
defining it from the distance matrix as the half sum of the diagonal elements of a
distance matrix in the hydrogen depleted molecular graph [20].
Randic firstly introduced the concept of molecular connectivity in 1975. It is
also called the connectivity index or branching index, to provide a topological index
that could characterize the amount of branching in hydrocarbon molecules. This
initial concept was extended by Kier and Hall to develop the well known χ indices
[18].
They have found the branching index is seen to provide some basic
information concerning the overall composition of the molecule.
Another example of topological descriptors is the electro topological states.
This descriptor is a numerical value computed for each atom in a molecule, which
encode information about both topological environments of the atom and the
electronic interactions due to all other atoms in the molecule [12].
11
1.3.2.2 Electronic Descriptors
A large variety of electronic whole molecule descriptors have been used to
encode the electronic features in QSAR investigations. The electronic environment
of each molecule is estimated with the electronic descriptor routines. Electronic
descriptors provide information about the overall charge distribution by calculating
values such as the partial charges on each atom.
A number of electronic descriptors may encode the effects or strengths of
intermolecular interactions. The more commonly recognized intermolecular forces
arise from the following interactions; ion-ion, ion-dipole, dipole-dipole, etc. There
are some examples of this descriptor, such as electric dipole moment, that encodes
the strength of polar type interaction.
Molecular polarizability and molar refractivity are closely related properties
that measure a molecule’s susceptibility to becoming polarized. While descriptor
related to intermolecular interactions are useful for predicting bulk physical
properties and certain types of biological activities, they provide little direct
information about the reactivities of compounds.
This information is available
through molecular orbital calculation [20].
The HOMO energy is roughly related to the ionization potential of a
molecule, while the LUMO energy is related to the electron affinity. The magnitudes
of these quantities are measures of the overall susceptibility of the molecule to losing
a pair of the electron to an electrophile or accepting a pair of electrons from a
nucleophile.
1.3.2.3 Geometric Descriptors
Biological activity is often related to the shape and size of the active
compounds as well as the degree of complementarity of the compound and a
receptor.
With the given methods for generating three-dimensional molecular
models of compounds, these models can be used to develop geometric descriptors.
12
Geometric descriptors capture information about the overall threedimensional size and shape of molecules. As the name implies, they require that the
molecules reside in accurate, three dimensional geometric conformations before
descriptor generation. Examples of geometric descriptors: include the calculation of
solvent accessible surface area and volumes and moment of inertia.
These
descriptors are useful in encoding steric effects that can occur between molecules.
Geometric descriptors appear frequently in QSAR of biological activity,
especially when solvent accessible surface area information is used in conjunction
with partial charge information to form the polar surface area descriptors. Surface
area has a prominent effect on the interactions which occur between a drug molecule
and its surroundings [20]. The other calculated descriptor for biological activity
investigation is the molecular volume. The total molecular volume is taken as the
sum of the contributions for each atom in the structure. The volume contributions of
attached hydrogen atoms are also included in the final volume.
1.4
Feature Selection
Each descriptor contains useful information, but not all of these descriptors
will be used to develop QSAR models. Feature selection was needed to reduce the
number of descriptors. It is a step carried out in many analysis of reducing an initial
too-large set of descriptors down to some smaller number that are felt to include the
descriptors that matter [21].
The objective of feature selection is to identify the best subset of descriptors
and to reduce the descriptors pool to a reasonable number; several stages of statistical
testing are performed to remove descriptors that contain redundant information. Two
methods to achieved feature selection:
i) Objective feature selection uses only the independent variable; the goal is to
remove redundancy amongst the descriptors and to deter chance effects during
model development. Pair wise correlations coefficient are calculated for all pairs
13
of descriptors, if r2 value is greater than 0.8, one of the two descriptors will be
rejected randomly.
ii) Subjective feature selection which also uses the dependent variable is used to
identify the most information rich descriptor subsets which best map an accurate
link between structure and a property of interest.
The genetic algorithm (GA) method can also be used to select the optimum
number of descriptors for use in regression analysis.
The GA could be useful
technique for searching large probability space with a large number of descriptors for
a small number of molecules. For example, this technique has been successfully
applied to select the descriptors which can be used to correlate and predict effect
concentration (EC50) values of fluorovinyloxyacetamides compounds [22].
K-nearest–neighbor (KNN) analysis has also been used as variable selection
procedure. In principle, this technique seeks to optimize simultaneously the selection
of variables from the original pool of all molecular descriptors that are used to
calculate similarities between compounds. KNN technique has been applied to select
descriptors and establish the QSAR models for predicting the anticonvulsant activity
of functionalized amino acid [23].
Searching all combination of descriptors is impractical so a logical approach
is taken by combining an optimization routine. It has been shown to be very efficient
in screening the reduced pool to identify optimal models [24]. Generalized simulated
annealing (GSA) attempts to find models with the best configuration of descriptors
that will produce low error for the training set compounds [25]. Once the initial
model is evaluated for fitness, a perturbation is made by randomly replacing one (or
more) descriptor with another from the reduced pool. If the new model is better than
the first, the step is accepted and third model is produced via perturbation of the
descriptors in the second model.
Multiple linear regression analysis can only handle data sets where the
number of descriptors is smaller than the number of molecules, unless again a
preselection of descriptors is carried out (e.g. by using GA). Genetic algorithmmultiple linear regression analysis (GA-MLRA) have been combined to make a new
14
classification and regression tool for predicting a compound’s quantitative or
categorical biological activity based on a quantitative description of the compound’s
molecular structure [26].
1.4.1
Genetic Algorithm (GA)
The GA approach is a general optimization method first developed by
Holland [27] involves an iterative mutation/scoring/selection procedure on a
constant-size population of individuals. The theory behind GA originates from the
‘survival of the fittest’ principle. Darwinian Theory states that individuals who
possess dominant features will prevail in a population and produce children with
even more superior features.
In GA, models represent chromosomes while the
descriptors comprising the model represent the genes encoding each chromosome.
Mating and mutation allow GA to efficiently scan an error surface and assess the
fitness for thousands of models.
The advantages of GA methods are: it searches the descriptor space
efficiently and it can find models containing combination of descriptors or features
that predict well as group but poorly individually [28, 29]. GA methods were used to
select the optimum number of descriptors for use in regression analysis. The general
GA scheme is shown in Figure 1.2.
1.5
Tools and Techniques of QSAR
QSAR studies include mathematical correlation between molecular structure
and its activity.
For quantitative modeling, two methods are primarily used to
develop QSAR/QSPR models. Model complexity generally increases during the
model development stages. Simple methods requiring low computational resources
are examined first with the more complex and computationally demanding
techniques being employed last in an effort to increase model quality.
15
Initialization
(Random)
Population
Fitness function
Mating and mutation
Best model
Figure 1.2: Flow diagram for the GA
The first and most widely used mathematical technique in QSAR analysis is
multiple regression analysis (MRA). Regression analysis is a powerful means for
establishing a correlation between independent variables which in this case usually
include physicochemical parameter and dependent variable such as biological
activity [22, 30].
1.5.1
Multiple Linear Regression Analysis (MLRA)
The goal of MLRA is to find the best subset of descriptors which provide
accurate predictions for each compound in the training set. For each model, the
values for descriptor coefficients and y-intercept are found that provide the most
accurate mapping between input descriptors and property of interest. Generally,
linear regression is represented by the equation below:
16
c = Rb
1.5
where c is matrix of molecular activity (n sample x 1), R is a matrix of descriptors (n
sample x n descriptors) and b is model coefficients (n descriptors x 1).
Using the response matrix and the known activity of only one of the
compounds c, the regression coefficients (equation 1.5) can be estimated as:
b̂ = ( R T R) −1 R T c
1.6
where b̂ is the regression vector, RT is the pseudo-inverse of matrix descriptors, R is a
matrix descriptors and c is activity of compounds.
ĉ = runk bˆ
1.7
( runk ) is matrix of the known descriptor, it is possible to use the estimated regression
vector (b) to predict the activity of unknown compounds ( ĉ ), by using this equation
(equation 1.7).
Multiple regressions calculate an equation describing the relationship
between a single dependent y variable and several explanatory X variables [31]. The
independent variable, which in this case usually include the physicochemical
parameter and biological data are assumed as dependent variable. The analysis
derives an equation of the form [11]:
Y = a1x1 + a2x2 + a3x3 + …..anxn + e
1.8
The multiple correlation coefficient r2 describes how closely the equation fits
the data. If the regression equation describes the data perfectly then r2 will be 1.0
[32, 33].
r2 =
SSR
SST
1.9
17
Where SSR is the explained Sum of Squares of y and SST is the total sum of the
difference between the observed y values and their mean.
n
SST = ∑ ( y − y ) 2
1.10
i =1
SSR is the sum of the difference between the predicted y values ( ŷ ) and mean.
n
SSR = ∑ ( yˆ − y ) 2
1.11
i =1
The major drawback of regression analysis is the danger of over fitting. This
is the risk that an apparently good regression equation will be found, based on a
chance numerical relationship between the y variable and one or more the x variable,
rather than a genuine predictive relationship. When an over fitted model is used
predictively, the predicted values for untested compounds will not be an accurate
prediction of true values.
1.5.2 Partial Least Squares (PLS)
PLS was developed in the 1960’s by Herman Wold as an econometric
technique, but its most avid users are chemical engineers and chemometricians [33].
PLS has been applied to monitoring and controlling industrial processes; a large
process can easily have hundreds of controllable variables and dozens of outputs.
PLS analysis calculates equations describing the relationship between one or
more dependent variables and a group of explanatory variables [34]. PLS include
two steps procedure; they are principal component analysis (PCA) and multivariate
linear regression (MLR).
PLS analysis can be used in exactly the same way as regression, a single y
(dependent) variable and two or more x (independent) variables are specified. PLS
18
always include all x variables in the analysis. As with regression, an equation is
derived that allow the y values for unknown variables to be predicted from known
x values [35].
Therefore, PLS is able to investigate complex structure activity
problems, to analyze data in a more realistic way, and to interpret how molecular
structure influences biological activity [10].
An important feature of the method is that usually a fewer factors (variables)
are required. The precise number of extracted factors is usually chosen by some
heuristic techniques based on the amount of residual variation. Another approach is
to construct PLS model for a given number of factors on one set of data and then to
test it on another, choosing the number extracted factors for which the total
prediction error is minimized.
Recall the form of linear regression model is c = Rb ( equation 1.5) The
difficulty often encountered when solving for b is that the R T R matrix is not
invertible because of redundancy in the variables. Principal component regression
(PCR) eliminates this redundancy by constructing a new matrix U with column that
is linear combinations of the original columns in R. Using the U matrix, a new
model is written as shown in equation 1.12:
~
c = Ub
1.12
The technique of PLS is similar to PCR with the crucial difference that the
quantities calculated are chosen to explain not only the variation in the independent
(X) variable but also the variation in the dependent (Y) variables as well. PCR
produces the weight matrix reflecting only the covariance of the predictor variables,
while PLS regression includes the response variable Y in the process of reduction of
the variables, and so the covariance is between the independent and dependent
variables.
PCR and PLS use different approaches for choosing the linear combinations
of variables for the columns of U. PCR only uses the R matrix to determine the
linear combinations of variables but in PLS technique, the covariance of the
19
measurements with the concentrations is used in addition to the variance in R to
generate U [36]. The illustration of the difference between PCR and PLS is shown in
Figure 1.3.
Step 1
R
R
U
U
c
PCR
PLS
Step 2
c = Ub
PCR and PLS
Figure 1.3: Illustration of the difference between PCR and PLS
U is the score matrix from PCA, which defines the location of the samples relative to
one another in row space. The score matrix is related to the original matrix R (matrix
of descriptor) in the following manner:
R = USV T
1.13
Where U is the score matrix, V is a matrix containing the loadings and S is a diagonal
matrix. The orthonormal property of V (i.e., V T V = I) can be used to solve equation
1.13 for U as follows:
U = RVS −1
1.14
The following equation is possible to solve equation 1.12 and can be used to predict
the activity of unknown compounds:
20
~ˆ
b = UT c
1.15
where c is matrix of activity, UT is pseudo-inverse of score matrix from PCA and
~ˆ
b is regression vector.
1.6 Applications of QSAR
The major goal of QSAR in chemical research is to predict the behavior of
new molecules, using relationships derived from analysis of the properties of
previously tested molecules. QSAR studies represent one of the best methodologies
in computer based drug design, offering valuable information about biological
activity and providing a computationally inexpensive methodology to design of
potential bioactive drugs.
MRA was used to generate the QSAR models.
These models were
constructed by correlating the topological descriptors and anti tumor activity of 20
(S)-campotechin derivatives.
Good QSAR models can be used to instruct the
designing and predicting the anti tumor activity of new analogues [37].
QSAR approach also can be used to search new agents against
Mycobacterium tuberculosis (M. tuberculosis) and other typical mycobacteria. It is
significant due to the lack of effectiveness of known anti tuberculosis agents against
opportunistic pathogens as a consequence of rapidly emerging resistance [38].
QSAR
method
was
employed
by
using
the
hydrophobicity
and
electrophilicity as parameter to investigate the structural features that affect the
toxicity of nitrobenzene derivatives to yeast and response-surface analysis was
performed to develop a robust QSAR for predictive use [39]. QSAR model also
have been developed between hydrazide potencies against Escherichia coli (E. coli)
and Sacharomyices cerevisae.
The study shows that an extra thermodynamic
relationship can be established between two different cell systems [40].
21
The inflammatory process is necessary for survival against pathogens and
injury, but sometimes the inflammatory response is aggravated and sustained without
benefit. A large number of Homoisoflavanones have been isolated several genera
within the Hyacinthaceae family and have anti-inflammatory properties.
The
biological data was then correlated to the physicochemical descriptors of the
compounds by applying statistical regression analysis and also to establish a
quantitative structure activity relationship model with reliable predictive ability as
the potential degree of anti-inflammatory activity of compounds within this class
[41].
QSAR studies are being applied in Environment assessment; toxicity to
aquatic life form is one of the crucial factors in evaluating the environmental risks of
man-made chemicals.
Chemicals could jointly cause toxic effects to fish at
concentration as low as 2% of their individual inhibition concentration (IC50). The
application of QSAR models derived from single chemicals toxicity assay are used to
predict concentration of component in mixtures that would jointly cause 50%
inhibition of microbial respiration [42].
Chemical and biological transformations, and degradations, play a role in the
transport and mobility of such chemicals in the environment. Volatile Organic
compounds (VOCs) are a class of organic chemicals largely present in the
troposphere because of their vapor pressure. By using QSAR modeling can be used
to predict the rate constant for hydroxyl radical trophospheric degradation of 46
heterogeneous organic compounds [43]. A variety of QSAR paradigms have been
presented as possible computational tools to aid with the rapid assessment of
endocrine disruptions potential for environmentally relevant component [44].
A large number of environmental chemicals known as endocrine-disrupting
chemicals (EDCs) are suspected of disrupting endocrine functions by mimicking or
antagonizing natural hormones in experimental animals, wildlife, and humans.
EDCs may exert adverse effect through a variety of mechanisms, including estrogen
receptor (ER)-mediated mechanisms of toxicity. Consequently, more than 58,000
environmental and industrial chemicals have been identified as candidates for
possible experimental testing. QSAR could be used as an inexpensive prescreening
22
tool to prioritize the chemicals for further testing and to classify of chemicals
according to their ability to bind the estrogen receptor [45].
The new approach of QSAR models is by developing a drug discovery
strategy that employs QSAR models for chemical database mining. The approach
classified the lead molecules to active and inactive classes also to predict their
biological activity [12].
1.7
Overview of Multidrug Resistance Mycobacterium tuberculosis
TB, or tuberculosis, is a disease caused by bacteria called Mycobacterium
tuberculosis (M. tuberculosis). It can affect several organs of human body, including
the brain, the kidney and the bones; but most commonly it affects the lungs
(pulmonary tuberculosis). It is estimated that one-third of the world’s population is
infected with this bacteria. While only a small percentage of infected individuals
will develop clinical tuberculosis, each year there are approximately eight million
new cases and two million deaths. M. tuberculosis is thus responsible for more
human mortality than any other single bacterial species [46].
Since tuberculosis spreads easily when people are in close contact with an
infected person, it was more common in towns than in the countryside. People often
came to towns to trade their goods or do other business. Between sixteenth and
nineteenth centuries, many of the new arrivals in the major cities of Europe were
consumed by tuberculosis plus other infectious diseases. A city’s population was
maintained only by a steady supply of healthy young people coming to make their
fortunes. The Industrial Revolution, which began in the late seventeenth century in
England and perhaps a hundred years later in the United States, brought more people
into the urban areas and city life became more perilous. People could not escape the
risk of tuberculosis infection even in their own homes, away from the factories. All
these factors created the perfect breeding ground for tuberculosis which became an
epidemic in Europe and later in the United States.
23
A number of efficacious anti tuberculosis agents were discovered in the late
1940’s and 1950’s with the last, rifampin introduced in the 1960’s [47]. These
agents have reasonable efficacy and when used in combination should preclude the
development of drug resistance.
However in 1962, Eleanor Roosevelt died of
tuberculosis. It was learned that M. tuberculosis was resistant to this agent. The use
(or in most cases misuse) of these drugs has lead over the year to an increasing
prevalence of multi-drug resistant (MDR) strains and there is now an urgent need to
develop new effective agents.
1.7.1
Mycobacterium tuberculosis
The genus Mycobacterium consists of gram positive bacilli with distinctive
cell wall characterized by the presence of unusual glycolipids.
A number of
mycobacteria are pathogenic for man but the most important is undoubtedly M.
tuberculosis, the causative agents of tuberculosis [47].
M. tuberculosis is a part of tubercle bacilli species, it grow well (eugenic) on
egg media containing glycerol or pyruvate. Colonies resemble breadcrumbs and are
cream colored.
Films show clumping and cord formation especially on moist
medium and it is usually resistant to thiophene 2 carboxylic acid hyrazide (TCH) is
nitrase positive, aerobic and susceptible to pyrazinamide.
M. tuberculosis is an obligate aerobe which grows at different rates within
cavities, caseous foci, and macrophages.
The doubling time is 12-14 hours as
compared to 0.25-1 hour for most other pathogenic bacteria. Since the efficiency of
many bacterial agents is directly proportional to the rate of growth, eradication of
infection requires prolonged therapy (6-18 months).
24
1.7.2
How Does Tuberculosis Spread
The TB germ is carried on droplets in the air, and can enter the body through
the airway. A person with active pulmonary tuberculosis can spread the disease by
coughing or sneezing. To become infected, a person has to come in close contact
with another person having active tuberculosis.
The process of catching tuberculosis involves two stages: the first stage of the
infection usually last for several months. During this period, the body’s natural
defenses (immune system) resist the disease, and most or all of the bacteria are
walled in by a fibrous capsule that develops around the area. Before the initial attack
is over, a few bacteria may escape into the bloodstream and be carried elsewhere in
the body, where they are again walled. In many cases, the disease never develops
beyond this stage, and is referred to as TB infection. If the immune systems fails to
stop the infection and it is left untreated, the disease progress to the second stage,
active disease. There, the germ multiples rapidly and destroys the tissues of the
lungs (or the other affected organ). Sometimes, the latent period is many years, and
the bacteria become active when the opportunity presents itself, especially when
immunity is low.
The second stage of the disease is manifested by destruction or consumptions
of the tissues of the affected organ.
When the lungs is affected, it results is
diminished respiratory capacity, associated with other organs are affected, even if
treated adequately, it may leave permanent, disabling scar tissue [48].
Usually, the initial diagnostic/screening test for tuberculosis is the skin test.
A small amount of fluid is injected under the skin of the forearm; the fluid contains a
protein derived from the microorganism causing TB, and is absolutely harmless to
the body. The area is visually examined by a health professional after 48-72 hours to
determine the result of the test.
25
1.8
Minimum Inhibition Concentration (MIC)
QSAR have been used widely to predict the hazard of untested chemicals
with already tested chemicals by developing statistical relationship between
molecular structure descriptors and biological activity [49].
The principles of
determining the affectivity of noxious agents to a bacterium were well enumerated at
the turn of the century, the discovery of antibiotic made these tests (or their
modification) too cumbersome for the large numbers of test necessary to be put up as
a routine analysis.
Diffusion test widely used to determine the susceptibility of organisms
isolated from clinical specimens have their limitation; when equivocal results are
obtained or in prolonged serious infection e.g. bacterial endocarditis, the quantitation
of antibiotic action needs to be more precise. The way to a precise assessment is to
determine the MIC of the antibiotic to the organism concerned.
MIC is the lowest concentration of the antibiotic which will inhibit the
growth of microbes.
Dilution methods are used to determine the MIC of
antimicrobial agents [50]. In dilution test, microorganisms are tested for their ability
to produce visible growth on a series of agar plates (agar dilution) or in microplate
wells of broth (broth microdilution) containing dilution of the antimicrobial agent.
The lowest concentration of an antimicrobial agent which will inhibit the visible
growth of a microorganism is known as the MIC
MIC methods are widely used in the comparative testing of new agents. In
clinical laboratories they are used to establish the susceptibility of organisms that
give equivocal result in disk test, for test on organisms where disk test may be
unreliable, and when a more accurate result is required for clinical management.
26
1.8.1
Escherichia coli
The bacteria E. coli was named after the Austrian doctor, Theodor von
Esherich (1857-1911), who first isolated the genus of bacteria belonging to the
family enterobacteriaceae, tribe Escherichia.
This bacterium is the common
inhabitant of the intestinal tract of man and other animal, it is needed to breakdown
cellulose and it assists in the absorption of vitamin K, the blood clotting vitamin [51].
E. coli is a motile species, it can produce acid and gas from lactose at 44oC
and at lower temperatures, is indole positive at 37oC, MR positive, fails to grow in
citrate and is malonate and gluconate negative. It is H2S negative and usually
decarboxylates lysine [52]. The structure of E. coli [53] is shown in Figure 1.4:
Figure 1.4: Structure of E. coli
E. coli is one of the normal bacterial floras of the gastrointestinal tract of
poultry.
Ten to fifteen percent of the intestinal coliforms in chicken are of
pathogenic serotypes. Colibacillosis is a common systemic infection caused by E.
coli in poultry.
The disease causes considerable economic damage to poultry
production world wide. As least four types of E. coli cause gastrointestinal disease
in human: enter pathogenic (EPEC), enterotoxigenic (ETEC), enteroinvasive (EIEC)
and verotoxigenic (VTEC).
27
The EPEC strains have been associated with outbreaks of infantile diarrhea
and identified serologically. ETEC strains are thought to cause gastroenteritis in
both adults and children. While EIEC strains cause diarrhea to that in shigellosis.
The strains associated with invasive enteric infections are less reactive than typical E.
coli and VTEC strains derive their name from their cytotoxicities on Vero cells in
tissue culture. They have been associated with haemolytic uraemic syndrome and
haemorrhagic colitis.
1.9
Database Mining
Pharmaceutical lead compounds traditionally have been discovered by
isolation of natural products from fermentation broth and plants extracts, and by
screening corporate chemical databases. Recently, this process has been assisted by
structure based rational drug design technology. Drug design is one of the most
important fields of study for bioinformatics. A major goal of drug design is to
discover and optimize novel chemical substances that specifically interact with target
molecules and, as a consequence, compensate or reverse disease process [54].
Drug abuse continues to remain one of the most difficult and a costly issue of
modern society and cocaine is among the most heavily abused and devastating illicit
substances. QSAR models have been developed to correlate structural features of the
dopamine transporter (DAT) ligands and their biological activities. It also have been
employed to search new lead compounds in the national cancer institute (NCI)
database and yielded five compounds that are suitable for development as novel
DAT inhibitors [55].
Another application is in the search for anticonvulsant agents to treat
epilepsy. Epilepsy is a chronic disorder, characterized by recurrent unprovoked
seizures. Currently, the main treatment for epileptic disorder is the long term and
consistent administration of anticonvulsant drugs.
Therapies have failed to
adequately control this disorder, documenting the need for new agents with different
mechanisms of action. Development of variable selection KNN QSAR models have
28
been used to mine external chemicals databases or virtual libraries for lead
identification. This strategy was successfully applied for the discovery of novel
anticonvulsant agents [56, 57].
The national cancer institute (NCI) USA has been carrying out invitro
screening of compounds to determine their in vitro inhibitory activity of cell growth
in the NCI 60 human cancer cell lines for the purpose of anticancer drug discovery.
These Web-based data mining tools allow robust analysis of the correlation between
the in vitro anticancer activity of the drugs in the NCI anticancer database, the
protein levels and mRNA levels of molecular targets (genes) in the NCI 60 human
cancer cell lines for identification of potential lead compounds for specific molecular
targets and for study of the molecular mechanism of actions for a drug molecule [58].
1.10
Research Scope
This study focused on developing QSAR models that correlate biological
activity (e.g., anti bacterial, anti tuberculosis) and chemical structures. The validated
QSAR models were applied to mining chemicals in a large database. Database
mining is one of the most important follow-up applications of QSAR model
development. The proposed model can be utilized to select compounds that have
similar structural attributes as the active compounds in the training set and they are
expected to demonstrate anti bacterial and anti tuberculosis activity. The compounds
used in data set were limited to those that have been extracted from natural products,
while the second data set consists of compounds derived from plant terpenoids.
1.11
Research Objectives
The main objectives of this research were:
1. To develop computer models that correlate biological activity of chemical
compounds in natural products with their structural characteristics.
29
2. To apply of the QSAR models in screening a large library of compounds
(database mining).
1. 12
Significance of Research
One potential contribution of this research is in the utilization of our natural
products to develop anti bacterial agents. Successful development of new agents will
undoubtedly increase the value of our natural resources. Terpenoids are also a class
of compounds that have been extracted from natural products; they can be used to
combat the growth of M. tuberculosis bacteria. As discussed previously there is an
urgent need to develop new effective agents against M. tuberculosis. Billions of
dollars are spent each year by the drug and chemical companies of the world in the
effort to study the relationships between molecular structure and its bioactivities for
generating new drugs. By using QSAR models, we can correlate structural feature of
the compounds isolated from plants with their biological activity and the models can
be used to predict the activity of new compounds. Knowledge from this study can be
used in the production of new drugs in pharmaceutical industries.
1.13
Layout of the Thesis
This thesis is organized into five chapters. Chapter 1 describes the
background of research, some review of the literature to understand the issues and
formulate research problem. The review describes about QSAR, GA, tools and
techniques of QSAR, overview of multidrug resistance M. tuberculosis and MIC. It
also presents the research scope, research objective, significance of research, and
layout of the thesis.
Chapter 2 presents the research methodology employed in conducting the
study. Two main approaches were adopted; development of QSAR models and
application of these models in database mining. QSAR models will be used to
30
predict the biological activity of unknown compounds not included in QSAR model
development (prediction set) and it also can be further exploit for the design and
discovery of new potent anti bacterial agents. Database mining can be used to search
compounds that have similar attributes as the active compounds in the training set.
Chapter 3 presents the results from data set which consists of compounds
isolated from natural products with biological activity against E. coli. It describes
development of QSAR models to predict the anti bacterial activity of unknown
compounds, followed by application to search for new compounds in database
mining. Results of biological testing of selected compounds are also presented.
Chapter 4 presents the results of study in which plant terpenoids against M.
tuberculosis was used as data set. It explains about the QSAR model development
and it’s validation by predicting the anti tuberculosis activity of compounds not
included in the model development process. These models were later used to search
new potent anti tuberculosis agents. Finally, the activities of the new lead molecules
were tested experimentally by using disk diffusion method.
Chapter 5 presents the conclusions of this study. The report culminates with
some suggestions for future research.
31
CHAPTER 2
RESEARCH METHODOLOGY
2.1
Introduction
One method for gaining insight into the potential biological activity of a
molecule is by comparison of its physical properties to a group of similar chemical
compounds. QSAR can be used to predict the properties of molecules before they
are actually being synthesized and tested for biological activity.
QSAR models also can be used to efficiently screen large libraries of
compounds to identify those with desired characteristics. The ultimate goal of any
QSAR model is to establish a precise structure activity link with a set of training
compounds, so accurate predictions can be made on unknown compounds based on
structure alone.
The QSAR paradigm is based on the assumption that there is an underlying
relationship between molecular structure and biological activity, which arises from
this systematic variation. Also, it is assumed that the multivariate physicochemical
description of the set of compounds reveals these analogies. All physical, chemical,
and biological properties of chemical substance can be computed from its molecular
structure, encoded in a numerical form with the aid of various descriptors.
In this research, data of bioactive compounds from natural products were
used as sample to construct QSAR models. The important steps involved in QSAR
32
methodology included structure entry and molecular modeling, descriptor generation,
feature selection, model construction and model validation, as illustrated in Figure
2.1.
Structure entry and
molecular modeling
Descriptor
generation
Model
validation
Feature
selection
Model
construction
Figure 2.1: General QSAR methodology
2.2
Data Set
The first stage in the process of QSAR model development includes selection
of molecular data set for QSAR studies. In this study, structural and biological
activity data were collected from the literature and from previous project carried out
by other researchers in the Department of Chemistry, Universiti Teknologi Malaysia.
The compounds used in the data set were limited to those that have been extracted
from natural products.
The first data set consisted of 56 compounds isolated from Piper advuncum
leaves, Piper guineense root, Piper pedicellosum, Piper ungaromense, Premna
integrifolia leaves, Vitex pubescens bark, Lantana camara leaves, and Macaranga
triloba bark with MIC (µg/mL) against E. coli were measured as described in the
references [59-61]. The compounds in data set were then divided into training set
(28 compounds) for model development and prediction set (28 compounds) for
model validation as shown appendix A.
33
The second data set comprising 122 plant terpenoids (isolated from Salvia
multicaulis, Borrichia frutescens, Melia volkensii, Inula helenium and Rudbeckia
subtomentosa), with moderate to high activity (MIC, µg/mL) against M. tuberculosis
were taken from the literature [62].
They were divided into training set (61
compounds) to establish the QSAR model and prediction set (61 compounds) for
testing the accuracy of constructed model.
The division of data set into training set and prediction set were performed by
first sorting the list in increasing value of biological activity.
Next the even
numbered compounds were assigned to the training set and odd numbered
compounds were assigned to the prediction set. This method was chosen to produce
more representative samples in the training set.
2.3
Structure Entry and Molecular Modeling
Structure entry and molecular modeling is an important stage in QSAR
methodology.
A number of software packages were used in this stage.
First,
ChemDraw Ultra 6.0 (Cambridge Soft) was used to draw 2D model molecular
structure of the compounds. TSAR 3.3 software package (Acellrys) which consists
of Corina was used to convert the molecular structure to 3D structure and COSMIC
was used to optimize the structures of compounds.
Generation of molecular
descriptors, feature selection and models generation were also achieved using TSAR
3.3 software package (Accellrys). These software were run on Microsoft Window
XP on a Pentium IV computer.
2.4
Descriptor Generation
A common issue in QSAR is how to describe molecules and their properties.
The nature of the descriptors used and extent to which they encode the structural
34
features related to the biological activity is a crucial part of QSAR study. Descriptor
can be defined as [11, 12]:
1. Numerical quantities generated to represent the molecular structures.
2. Numerical values that encode certain aspects of molecular structure.
3. Physicochemical properties that describe some aspect of the chemical
structure
TSAR 3.3 software package (Accellrys) can calculate 316 types of
descriptors [33], which are summarized in Table 2.1.
Table 2.1: Type of descriptors in TSAR
Type of Descriptor
Molecular attributes
Explanation
Simple descriptor including mass, surface area, volume,
and verloop parameter, measure of inertia, dipole, molar
refractivity, lipophilicity and lipole.
Molecular indices
Molecular indices to describe connectivity, shape
topology and electropology.
Atom counts
Counts the number of each specified atom in selected
structures.
Ring counts
The number of ring size in selected structures.
Group counts
The number of each specified group in selected
structures.
ADME screen
Adsorption, distribution, metabolism, excretion behaviors
of structures based on selected principles.
ASP similarities
An optional program calculates similarity indices.
Vamp electrostatics
An optional semi empirical molecular orbital package
used to calculate electrostatics properties and perform
structure optimizing.
35
It is unrealistic to think that all of the descriptors contain useful information.
Therefore, after numerical descriptors have been calculated for each compound, the
number of descriptors was reduced to a set of descriptors that are information rich
but as small as possible.
2.5
Feature Selection
Usually, number of descriptors that can be generated is very large and most
probably there are high degree of correlation among them. Therefore, several steps
of feature selection were performed on the descriptor pool to reduce it to a more
manageable size. To reduce the number of descriptors, several stages of statistical
testing were performed to remove descriptors that contain redundant information
[17]. The purpose of descriptor selection is to ensure the stability of a model.
The feature selection procedure can be broken into two steps. The first step,
objective feature selection, involves reduction of the descriptor pool to such a level
that the likelihood of finding a chance correlation will be minimal. This reduced
pool was then used in subjective feature selection.
2.5.1
Objective Feature Selection
Objective feature selection examines the independent variable values (i.e.
descriptor values). The goal was to remove redundancy amongst the descriptors that
were highly correlated (i.e. contain the same information) and to deter chance effects
during model development.
Correlation matrix is a table of all possible pair wise correlation coefficients
for set of variables.
It can be used to help identify highly correlated pairs of
variables, and thus identify redundancy in data set. The correlation matrix showed
the relationship between variables in columns and rows (consist of descriptors).
36
Pair wise correlations were examined between all pairs of descriptors in the
reduced pool. If two descriptors were highly correlated, typically above a correlation
coefficient of 0.9, one is randomly removed and the descriptors with the highest
desirability will be retained [33].
2.5.2
Subjective feature Selection
Subjective feature selection was used to identify the most information rich
descriptor subset which best map an accurate link between structure and property of
interest [17]. Searching all combinations of descriptors is impractical so a logical
approach is taken by combining an optimization routine with a fitness evaluator to
find a good model.
GA method was used to select the optimum number of descriptors for use in
regression analysis [22]. GA is a simulated method based on ideas from Darwin’s
theory of natural selection and evolution. The algorithm consists of the following
steps [13, 28]:
1.
Chromosome is represented by a binary bit string and initial population of
chromosome is created in a random way.
2.
A value for the fitness function of each chromosome is evaluated.
3.
According to the values of fitness function, the chromosome of the next
generation is reproduced by selection, crossover and mutation operation.
In GA, QSAR models represent chromosomes while the descriptors
comprising the models represent the genes encoding each chromosome. An initial
population of models is randomly generated with descriptors from the reduced pool.
The first population is evaluated for fitness with the best models being noted in the
list top models. An average cost function is computed using all the models in the
first population. The models that possess cost functions lower than the average are
combined with another set of random models and passed through a mating and
mutation stage to form the second population. The second population was evaluated
37
for fitness and if any of the second generation models are better than those in the
first, they take their appropriate place in the rank of the top models.
Mating and mutation are process to generate new population of children
strings; each population of strings serves as parents to the subsequent population of
children strings. To illustrate the process of mating and mutation, an example is
given in Figure 2.2.
We have two subset called parent 1 and parent 2, each
containing five descriptors. The algorithm determines a fixed split point to perform a
cross-over mating process whereby the first two descriptors of parent 1 and the last
three descriptors of parent 2 combined to form child 1. The remaining descriptors
from these two subsets are combined to form a second child.
Occasionally, a
mutation can arise which replaces one of the descriptors in a model with another
randomly drawn from the reduced pool.
MLRA can only handle data sets where the number of descriptors is smaller
than the number of molecules, unless a preselection of descriptors is carried out (e.g.
by using genetic algorithm). GA and MLRA have been combined to make a new
regression tool for predicting a compound’s quantitative or categorical biological
activity based on a quantitative description of the compound’s molecular structure
[26].
The combination of PLS and GA is used to develop a regression technique,
the hybrid approach that integrates GA as a powerful optimization tool and PLS as
robust statistical method are applied to variable selection and modeling [14, 63].
Genetic algorithm partial least square (GAPLS) has been applied to QSPR studies of
PCBs, and many good models were generated [28].
2.6
Model Development
After selecting the necessary subsets of descriptors, statistical models were
generated. The descriptors generated from compounds in the training set were used
to build the model.
For quantitative modeling, two methods, multiple linear
38
regression analysis and partial least squares were primarily used to develop the
QSAR models. The goal was to find the best subsets of descriptors which will
produce stable QSAR model and have an ability to predict properties of unknown
compounds.
Parent 1
Parent 2
7
15
25
33
46
3
19
23
39
52
MATING
Child 1
7
15
23
39
52
Child 2
3
19
25
33
46
27
46
MUTATION
Child 2
3
19
25
Figure 2.2: Genetic Algorithm process
2.6.1
Multiple Linear Regression Analysis (MLRA)
Multivariate regression procedure estimates the value of a dependent variable
(biological activity) from independent variables represented by the different
molecular descriptors. The first part of data analysis consists of using the data to
determine values of parameter in the models so that the models fit the data well.
Stepwise regression was chosen to construct the QSAR models. In stepwise
multiple regressions, a selection algorithm is used to choose a subset of the input X
39
variables. The advantages of estimating a model with stepwise MLRA is only a few
variables are selected to construct simple QSAR model [36]. The stepwise method
combines two approaches, forward and backward stepping.
At each step, partial F values were calculated for each variable. In forward
stepping, the partial F values of all variables outside the model were calculated. If
any variable has a value greater than F to enter, the variable with the highest partial
F value is added to the model. The process was continued until no more variables
qualify to enter the model.
In backward stepping, the partial F values of all
variables inside the model were calculated. A variable with the lowest partial F
value was removed from the model. The process was continued until no more
variables qualify to leave the model. In general, the model can be accepted if it has
fewer variables with better predictive power ( r 2 CV ).
Cross validation provides a rigorous internal check on the models derived
using regression or partial least square analysis. It is used to give an estimate of the
true predictive power of the model, i.e. how reliable predicted values for untested
compounds. The calculation of cross validation is similar for multiple regression and
PLS analysis. TSAR 3.3 software packages (Accelrys) leave out groups of rows in
fixed pattern, using three cross validation groups of rows. A third data is deleted and
the values for these rows predicted using the rest of the data. This is repeated until
for second and then third groups. The model is judged base on these prediction.
2.6.2
Partial Least Squares (PLS)
PLS is frequently used as regression method in QSAR model development.
There are some additional advantages of PLS such as it is less influenced by noise,
more stable, increased the predictability and improved the interpretation of the result
as compared to other methods applied to the data set (e.g. PCR, LDA), insensitive to
colinearity among the predictor variables and allow one to handle data set where the
number of variables is larger than number of observations [35, 64].
40
PLS is able to investigate complex structure-activity problems, to analyze
data in a more realistic way, and to interpret how molecular structure influences
biological activity. The goal of PLS is to seek the direction in the space of X , which
yields the biggest covariance between X and Y .
Cross validation in this technique is similar to the method described in
MLRA earlier.
By default, TSAR stops the PLS iteration when the statistical
significance of the current vector goes above a fixed value (1.0 by default). This is a
sensible criterion, but not the only possible one.
Other sensible methods for
choosing the number of components include:
1. Stopping at the lowest value of Predictive sum of squares (PRESS).
2. Stopping when the PRESS value first starts to increase.
Finally, a model has to be tested using an independent data set with
compounds yet completely unknown to the model: the prediction set. The complete
process of building a prediction model is depicted in Figure 2.3.
2.7
Model Validation
The final part of QSAR model development is model validation, when the
predictive power of the model and hence its ability to reproduce biological activities
(anti bacterial) of untested compounds is established. While this provides some
assessment of the goodness of fit of a model, it does not provide a thorough and
independent assessment of how a model may predict new compounds. To assess
such predictivity the use of prediction set is essential.
Validation of a model involves demonstration of predictive ability by
predicting the property of interest for compounds not used during the generation of
the models, that is, an external prediction set of compounds. For the most part, the
prediction set error should be on par or slightly above the error of the training set.
The second method of validation involves performing randomization experiments to
test for chance correlations.
41
Structures and experimental
data
Calculation of molecular descriptors
Descriptors analysis and optimization
Model development
Evaluation
No
Predictive quality
Sufficient
Yes
Test
No
Predictive quality
Sufficient
Yes
Final model validation
Figure 2.3: Flow chart for the general model building process in QSAR studies
42
2.8
Application of QSAR Models to Database Mining
Various QSAR models have the common purpose of establishing meaningful
correlation between activity and quantitative descriptors of chemical structure, thus
the successful development of alternative QSAR model confirms the existence of
structure activity relationship intrinsic to a data set.
QSAR models must be able to update easily from the information flow
produced by the synthesis/biological testing activities and able to virtually screen
large structural databases in a short time and sometimes conflicting requirements to
be effective in modern drug discovery and development process [65]. Structure
activity relationship studies also can be used in principle to identify active/inactive
untested compounds and to design new active compounds. Braga and Galvao [66]
used the structure activity relationship studies to directly correlate some molecular
descriptors with Benzo [c] quinolizin-3-ones (BC3) biological activity and this
information can be used to classify active or inactive of non tested compounds.
Database mining is an obvious future application of QSAR models by
searching compounds that have similar structural attributes as the active compounds
in the training set. In this study, the Anti microbial database (AmicBase database)
which consists of 3339 compounds [67] was used as a source to search new active
compounds against E. coli and M. tuberculosis.
The general flowchart of the database mining procedure is summarized in
Figure 2.4 and includes the following major steps.
1. Develop predictive QSAR models for a data set of compounds with known
structures and activities.
2. Select probe molecules (i.e. biologically active compounds) and calculate their
chemical descriptors.
3. Compute the same chemical descriptors for all compounds in the database.
4. Calculate chemical similarity values (we use the Euclidean distance) between all
active probes and every structure in the database.
5. Rank all database structures by their similarity to a probe and select M structures
within certain similarity cutoff value.
43
6. Predict biological activity values of these M structures based on pre-constructed
QSAR models using applicability domain as additional similarity threshold.
7. Select structures predicted by all of QSAR models to have high values of
biological activity as hits.
A database of N structures
(e.g., AmicBase)
Experimental QSAR data
set (e.g., antibacterial)
Calculate chemical descriptors
Calculate chemical descriptors
Evaluate similarity
Select M (of N) structures with the highest
similarity to active compounds
Predict bioactivity
Develop QSAR model
Select hits
Figure 2.4: Flowchart of database mining that employs predictive QSAR models
44
2.8.1
Molecular Descriptors and Similarity Calculation
Molecular descriptors were calculated with the TSAR 3.3 software package
(Acellrys) for the probe molecules and the compounds in the database. Next the
similarity of active compounds in the training set and database were calculated by
using descriptors that were included in the QSAR model [56].
The perception of structural similarity is relative and should always be
considered in the context of a particular biological target. In similarity searching
systems as well as in QSAR, a molecule is represented by a set of M numerical
descriptors, denoted as X 1 , X 2 , X 3 ,…, X M , where X k ' s are the values of individual
descriptors [24]. Thus, a molecule can be geometrically represented by a point in
M-dimensional descriptor space with coordinates X 1 , X 2 , X 3 …, X M .
Many QSAR methods require scaling of the original data to extract
significant and useful information and to remove unimportant, not interesting
features. Scaling the descriptors is a very delicate procedure because we do not
know the underlying relationship between the descriptors and the activity for most of
them and therefore cannot foresee the influence of these manipulations [68]. In this
research, range scaling was used to avoid giving descriptors with significantly higher
ranges a disproportionably higher weight upon the molecular similarity calculations.
It was calculated as follow:
yi =
x i − min (x )
max ( x ) − min ( x )
2.1
yi is scaled value, xi is original value while min ( x ) is minimum of the collection of
x object and max ( x ) is maximum of the collection of x objects.
Euclidean distance was used as the measure of similarity in the
multidimensional descriptor space between all active probes (i.e., molecules used for
QSAR model development) and every structure in the database.
The distance
45
d ij between any two compounds i and j in N-dimensional descriptor space was
calculated using the following equation:
N
∑(X
d ij =
n =1
in
− X jn ) 2
2.2
Where xin and x jn are the values of nth descriptor for compounds i and j,
respectively, and the summation is over all descriptors.
Compounds with the
smallest distance (highest similarity) from the active probes were considered as hits
and subjected to the prediction of their similarity based on QSAR model.
2.8.2
Applicability Domain of QSAR Model
Formally a QSAR model can predict the target property for any compounds
for which chemical descriptors can be calculated. The nearness is measured by an
appropriate distance metric (e.g. a molecular similarity measure as applied to the
classification of molecular structures).
The structural similarity of bioactive
compounds in the training set and the database were calculated and a special
similarity threshold were introduced to avoid making predictions for compounds that
differ substantially from the training set molecules. Threshold ( DT ) was calculated
from the training set models as follows:
DT
=
y + Zσ
2.3
where y the average Euclidean distance between each compound, σ is the standard
deviation of these Euclidean distances and Z is an arbitrary parameter to control the
significance level. The default value of this parameter Z is 0.5 [56, 69]. If the
distance of the external compounds from at least one of its nearest neighbors in the
training set exceeds this threshold, it is considered impossible to evaluate its activity
accurately and this compound was excluded from consideration [23].
46
2.8.3
Biological Activity Predicted using QSAR Models
QSAR technique consists of construction of a mathematical model relating
the structure of molecules from a database to a property or a biological activity by
means of statistical tool. Once correlation has been established and it can be used to
predict the property or biological activity of new molecules.
The essential
characteristic of any QSAR model is its predictive power, defined as the ability of
the QSAR model to predict unknown compounds in the prediction set
The biological activities (in this case, anti bacterial and anti tuberculosis
activity) of selected compounds from database mining were predicted using QSAR
models. The average of activity predicted in each model was used as predicted MIC
value. The successes of the development of QSAR models were also measured by
doing laboratory testing of new compounds that have been found in the database.
2.9
Laboratory Testing
Mining of the database using the best QSAR models yielded some probable
active compounds and to confirm the computational results, it is necessary to
perform laboratory experiments to test the biological activity of selected compounds
against E. coli for first data set and M. tuberculosis for the second data set. Some
measures of effectiveness of a chemotherapeutic agent against a pathogen can be
obtained from minimal inhibition concentration (MIC). MIC testing can show which
agents are most effective against a pathogen and give an estimate of the proper
therapeutic dose. Agar diffusion method was used to determine action of E. coli and
M. tuberculosis to some testing agents.
47
2.9.1
Material and Method of Agar Diffusion
In this study, the activity testing of the compounds was achieved against two
types of bacteria. The anti bacterial activity was achieved against E. coli BL21
(Promega) as representative of gram negative bacteria, while the anti tuberculosis
activity was achieved against M. tuberculosis as representative of gram positive
bacteria. Actually, M. tuberculosis is a pathogenic bacterium and it is highly risky to
use this bacterium in the laboratory. Therefore, Rhodococcus sp which is similar to
Mycobacterium, since they belong to same class (Actinobacteria) and Order
(Corynebacterineae) [70], was used instead. Furthermore it was readily available in
the Department of Biology, Universiti Teknologi Malaysia.
Some reagents were needed to measure the activity of bacteria, like nutrient
agar, as it was used to grow the bacteria. Agar preparation included the following
steps [71]:
1. Nutrient agar (Sigma) was prepared from a commercially available dehydrated
form.
2. Immediately after autoclaving, it was allowed to cool in a 45 to 50oC water bath.
3. The freshly prepared and cooled medium were poured into glass or plastic, flat
bottomed Petri dishes on a level, horizontal surface to give a uniform depth of
approximately 4 mm. This corresponds to 60 to 70 mL of medium for plates
with diameters of 150 mm and 25 to 30 mL for plates with a diameter of 100
mm.
4. The agar medium was allowed to cool to room temperature and unless the plate
was used the same day, it was stored in a refrigerator (2 to 8oC).
5. Plates were used within seven days after preparation unless adequate precautions,
such as wrapping in plastic, have been taken to minimize drying of the agar.
6. A representative sample of each bath of plates was examined for sterility by
incubating at 25 to 30oC for 24 hours or longer.
Anti bacteria agent were accurately weighed and dissolved in the appropriate
diluents to yield the required concentration, using sterile glassware. Standard strains
of stock cultures were used to evaluate the anti bacterial activity stock solution. If
48
satisfactory, the stock can be a liquated in 5 mL volumes and frozen at -20oC or 60oC.
Normal Whatman filter paper was used to prepare disc approximately 0.5 cm
in diameter, which was placed in a Petri dish and sterilized in a hot air oven. The
loop used for delivering the anti bacterial was made of 20 gauges wire and has a
diameter of 2 mm. This delivers 10µL of anti bacterial to each disc. Treatment disc
was placed with flame-sterilized forceps onto the inoculated plate with sufficient
space between the discs. No more than 8 or 9 (with one in the middle) disc should be
placed on a 100 mm diameter plate, in order to accommodate resulting zones of
inhibition without significant overlap of adjacent zones.
During incubation, the agent diffuses from the filter paper to the agar. The
further it gets from the filter paper, the smaller the concentration of the agent. At
some distance from the disc, the MIC is reached and a zone of inhibition is thus
created. The diameter of the zone is proportional to the amount of antimicrobial
agent added to the disc.
49
CHAPTER 3
DEVELOPMENT OF QSAR MODELS AND DATABASE MINING
FOR ANTI BACTERIAL AGENTS
3.1
Introduction
This chapter describes the results of the development of QSAR models using
GAPLS and MLRA, and application of the models to mining chemicals in a
database. It begins with selection of descriptors using correlation matrix for
objective feature selection and genetic algorithm for subjective feature selection.
The statistical results of QSAR models using GAPLS and MLRA technique will be
described in the next section. It is followed by validation of QSAR models to predict
the biological activity of unknown compounds that were not included in the model
development process. The results of mining chemicals in a database are presented in
section 3.6. The predicted activity of new active agents using agar diffusion method
and discussion related to them are described in the last section.
3.2
Selection of Descriptors and Feature Selection
The development of molecular structure descriptors is the most important part
of any structure-activity investigation because the descriptors must contain enough
information to permit the correct characterization of the compounds. Descriptors are
50
numerical values that characterize properties of molecules. Common descriptors
used for QSAR models include topological, geometrical and electronics descriptors.
In this study, 316 descriptors were calculated for the compounds in the data
set, but not all descriptors were used to develop the model. Having identified a set of
suitably distributed, non-correlated descriptors, it is necessary to decide which
should be incorporated into the QSAR equation. Feature selection was applied to
reduce and select the best descriptors that will be included in the model development
process.
Selection of descriptors was done by correlating each descriptor with one
another using data reduction techniques [20, 33]. Redundant or highly correlated
descriptors were removed from the descriptor pool during objective feature selection.
Redundancy lessens the discriminating power of descriptors, thereby reducing their
worth in model development
This results in a correlation matrix for all input variables. A coefficient of 1.0
indicates that two variables are perfectly correlated. A coefficient of 0.0 indicates no
correlation. Pair-wise correlations were performed on members of the descriptor
pool, removing one of the two descriptors randomly if their correlation coefficient
exceeded 0.90. The reduced descriptor pool used to develop the models reported in
this work contained 58 descriptors; it is summarized in Table 3.1. The correlation
matrix of these descriptors is presented in Table 3.2. This pool of descriptors was
held constant throughout the entire model building process.
Table 3.1: List of selected descriptors and their statistical analysis
Statistics
Descriptor
Inertia moment 1 size
Inertia moment 3 length
Verloop B1 (sub.1)
Verloop B1 (sub. 3)
Verloop B2 (sub. 2)
X
S.d
133.30
81.83
3.17
0.68
1.63
0.11
1.34
0.28
1.71
0.28
Statistics
Descriptor
Inertia moment 2
length
Ellipsoidal volume
Verloop B1 (sub. 2)
Verloop B2 (sub.1)
Verloop B3 (sub. 1)
X
S.d
3.74
0.86
1102.32
1432.25
1.62
0.16
1.87
0.41
2.38
0.68
51
Cont. Table 3.1: List of selected descriptors and their statistical analysis
Statistics
Descriptor
Verllop B3 (sub. 2)
Verloop B5 (sub. 2)
Total dipole moment
Dipole moment Y
Log P
Lipole X component
Lipole Z component
Kier ChiV3 (ring)
Kier ChiV6 (ring)
Balaban topological
Vamp LUMO
Vamp pol. XY
Vamp pol. YY
Vamp pol. ZZ
Vamp quadpole XY
Vamp quadpole YY
Vamp quadpole ZZ
Vamp octupole XXY
Vamp octupole YYX
Vamp octupole YYZ
Vamp octupole ZZY
Vamp octupole XYZ
ADME H bond
acceptors
ADME violation
X
S.d
2.04
0.59
2.37
0.85
2.71
1.57
0.02
1.74
3.41
2.04
0.33
4.34
-0.56
1.88
0.01
0.03
0.04
0.04
1.79
0.39
0.03
0.68
0.94
2.26
39.57
11.33
31.81
13.59
1.30
16.14
-1.82
16.23
-3.15
10.96
-13.19
70.52
9.84
103.40
-12.50
50.36
19.18
67.40
-0.21
67.18
2.71
2.33
0.28
0.46
Statistics
Descriptor
Verloop B5 (sub. 1)
Verloop B5 (sub.2)
Dipole moment X
Dipole moment Z
Total lipole
Lipole Y component
Kier Chi6 (ring)
Kier ChiV5 (ring)
KAlpha 3 index
Vamp heat of
formation
Vamp HOMO
Vamp pol. XZ
Vamp pol. YZ
Vamp quadpole XX
Vamp quadpole XZ
Vamp quadpole YZ
Vamp octupole XXX
Vamp octupole XXZ
Vamp octupole YYY
Vamp octupole ZZX
Vamp octupole ZZZ
ADME weight
ADME H bond donors
Cosmic total energy
X
S.d
3.22
1.43
1.57
0.65
-0.42
2.29
-0.31
1.23
4.22
3.50
-1.02
2.60
0.08
0.06
0.02
0.03
4.21
3.41
-98.32
85.45
-9.31
0.46
0.54
2.34
-0.39
2.59
4.97
16.01
-2.64
10.25
-2.97
9.61
-58.83
177.63
-0.88
61.7
-58.39
213.74
-6.28
74.26
3.44
109.72
266.42
81.36
0.96
1.29
-87.71
115.50
54
3.3
Model Development Using MLRA Method
After descriptors generation, subset of descriptors were examined to form
predictive models using two computational methodologies i.e. MLRA and PLS. In
multiple regressions, a selection algorithm is used to choose a subset of the input X
variables [32].
It finds a correlation between molecular structures and their
corresponding property through a linear combination of structural descriptors, and
only the chosen descriptors will be included in the model. This can mean that a
variable which appears to be highly significant in the final model will be selected.
Final model developed using MLRA method, and some results of statistical
tests are presented in Table 3.3. It will be the key to predict the ability of the model
to achieve biological activity of the data from prediction set.
Multiple regressions calculate an equation describing the relationship
between a single dependent y variable and several explanatory x variables. The
independent variable in this case is MIC. The best model generated using MLRA for
first data set has r 2 value of 0.87 and r 2 (CV) of 0.74. The equation is:
Y = 0.021 x Verloop B1 (subst.3) + 0.002 x Lipole Z component +
0.063 x Kier Chi6 (ring) + 0.0007 x Vamp Quadpole YZ +
0.007 x ADME H bond donors + 0.005
3.1
Table 3.3: Statistical output of MLRA model
Statistical output
Value
r2
0.874
Cross validation r 2 (CV)
0.748
Residual sum of square ( RSS )
2.452
Predictive sum of square ( PRESS )
4.902
There are five variables which were included in QSAR models by using
MLRA technique. The explanation about these descriptors is presented in table 3.4.
55
A plot of experimental vs. predicted MIC is shown in Figure 3.1 while a plot of
predicted value vs. standard residual is presented in Figure 3.2. Although this is a
very good model in terms of r 2 , the value of cross validated r 2 is a bit smaller that
could indicate an unstable model and might not be very useful for predicting
purposes. Furthermore, the five term equation will be very dependent on the trend of
data in the training set.
A brief explanation about the statistical analysis of MLRA
method is summarized in Table 3.5.
Table 3.4: Descriptors which were included in the QSAR model by using of MLRA
Descriptor
Verloop parameter
Symbol
Explanation
Verloop B1
The smallest distance from the
(substituents 3)
axis of the attachment bond,
measured perpendicularly to the
edge of the substituents.
Connectivity indices
Kier Chi6 (ring)
Numeric descriptor derived from
molecular topology that reflects
the atom identities, bonding
environment and number of
bonding Hydrogen.
Electrostatic
Vamp Quadpole YZ
parameter
Properties of molecule arising
from the interaction between a
charge probe such as positive unit
point reflecting a proton and target
molecule.
ADME parameter
ADME H bond donor
Adsorption, distribution,
metabolism, excretion number of
H bond donor.
56
Figure 3.1: Plot of experimental vs. predicted MIC for MLRA model
Figure 3.2: Plot of predicted value vs. standard residual for MLRA model
57
Table 3.5: Statistical Analysis of MLRA method
Component
Analyzing
S value
Standard error of the regression model. For a model
with good predictive power, this is an estimate a how
accurately the model predict unknown y values.
F Value
Derived from the sum of squares values and degrees
of freedom.
r2
The fraction of the total variance of the y variable
that is explained by regression equations the closer
value is 1.0 the better regression equation explains Y
variable.
r 2 (CV)
Is a key measure of predictive power of the model
The closer value is to 1.0 the better predictive power
for good model r 2 (CV) should be fairly close to r 2
(it will usually be lower)
Residual sum of squares
The variance of the residuals not explained by the
regression equation
Predictive sum of squares A measure of how well the use of the fitted values for
subset model can predict the observed responses Yi.
3.4
Model Development Using PLS Method
To overcome the limitation of the MLRA model, PLS technique has also
been used to develop the QSAR model. PLS is insensitive to co-linearity among the
predictor variables and allows one to handle data set where the number of variables
is larger than number of observations [72].
PLS analysis calculates equation describing the relationship between one or
more dependent variable and a group of explanatory variables. PLS may also be
used in exactly the same way as MLRA; a single Y (dependent) variable and two or
more X (independent) variables are specified.
58
The PLS method was also aided by GA technique to select the descriptors to
be included in the model [15, 28]. The PLS routine in TSAR stops the iteration if a
model with one the following criterion is attained [30, 33]: the lowest value of
PRESS or when PRESS value starts to increase.
Table 3.6 shows the statistical output of GA-PLS for each dimension and
plot of PRESS vs. no. of PLS component is shown in Figure 3.3. According to this
data and the plot, it was indicated that the best model generated using PLS for this
data set has six components with r 2 of 0.96 and r 2 (CV) of 0.86. The high value of
r 2 (CV) and the lowest value of PRESS indicate a more stable model and more
suitable for predicting compounds not included in the training set.
Table 3.6: Statistical output of GA-PLS for each dimension
Statistical
output
r2
PLS
dim 1
PLS
dim 2
PLS
dim 3
PLS
dim 4
PLS
dim 5
PLS
dim 6
PLS
dim 7
0.661
0.864
0.927
0.955
0.967
0.968
0.968
r 2 (CV)
0.397
0.605
0.824
0.853
0.858
0.861
0.855
RSS
8.474
3.393
1.819
1.112
0.824
0.781
0.781
PRESS
15.070
9.851
4.417
3.681
3.539
3.471
3.634
Figure 3.3: Plot PRESS vs. No. of component
59
Statistical output of the PLS model is shown in Table 3.7 while list of
descriptors which were included in QSAR model by using of PLS technique is
shown in Table 3.8, and a plot of experimental vs. predicted MIC is shown in Figure
3.4.
Table 3.7: Statistical output of PLS model
Statistical output
Value
Fraction of variance
0.9687
Cross validation r 2 (CV)
0.8611
Residual sum of squares( RSS )
0.7817
Predictive sum of squares ( PRESS )
3.4714
Figure 3.4: Plot of experimental vs. predicted MIC for PLS model
This plot displayed the activity predicted by a QSAR model against the
experimentally measured or observed activity. The data are plotted as a scatter plot,
where each point represents one compound of the data set. Ideally the scatter plots
showed form a straight line. From the plot (Figure 3.4), it can be concluded that PLS
60
technique has generated a QSAR model with high degree of accuracy, and this was
confirmed with the spread out of each point around the ideal line.
Table 3.8: Descriptors which were included in the QSAR model by using of PLS
Descriptor
Verloop parameter
Symbol
Explanation
Verloop B1
The smallest distance from the
(substituents 3)
axis of the attachment bond,
measured perpendicularly to the
edge of the substituents.
Molecular attributes
Lipole Z component
Measure of the lippophilic
distribution. It is calculated using
the substituents point of
attachment as an origin with this
bond placed along the x-axis.
Connectivity indices
Kier Chi 6 (ring)
Numeric descriptor derived from
molecular topology that reflects
the atom identities, bonding
environment and number of
bonding Hydrogen.
Connectivity indices
Kier ChiV3 (ring)
Numeric descriptors indexes are
derived from number of skeletal
neighbor of each atom and include
information about atomic
identities.
Electrostatic
Vamp polarization XY
parameter
An optional semi empirical
molecular orbital that perform
structure optimization
ADME parameter
ADME H bond donor
Adsorption, distribution,
metabolism, excretion number of
H bond donor.
61
The plot of predicted value vs. standard residual is presented in Figure 3.5.
The residuals are the difference between predicted and observed activities.
According to this plot the residuals were evenly distributed and there was no
observation that can be considered as an outlier.
Figure 3.5: Plot of predicted value vs. standard residual for PLS model
3.5
Model Validation
It is important to evaluate the robustness and the predictive capacity or
validity of the model before using the model for interpretation and prediction of the
biological activity. The purpose of model validation is to predict the biological
activities of non tested compounds.
The models were validated by predicting MIC for compounds in the
prediction set. The calculated MIC values are shown in Table 3.9 and the correlation
coefficients ( r 2 ) between predicted and experimental values for both models were
also calculated.
62
The high value of r 2 (0.88) between calculated and experimental values
indicated that both models were stable and capable of predicting the anti bacterial
activity of compounds not included in the model development process.
Table 3.9: Calculated MIC for compounds in the prediction set
Calculated MIC
Compound No.
Experimental MIC
1
PLS
MLRA
0.10
0.091
0.089
2
0.07
0.064
0.067
3
0.06
0.061
0.058
4
0.06
0.047
0.055
5
0.05
0.059
0.056
6
0.05
0.049
0.048
7
0.05
0.054
0.055
8
0.05
0.051
0.053
9
0.05
0.044
0.042
10
0.05
0.043
0.046
11
0.05
0.042
0.040
12
0.05
0.057
0.048
13
0.05
0.052
0.055
14
0.05
0.051
0.048
15
0.05
0.049
0.046
16
0.04
0.041
0.045
17
0.04
0.032
0.034
18
0.04
0.047
0.044
19
0.04
0.039
0.036
20
0.04
0.036
0.039
21
0.03
0.027
0.031
22
0.03
0.033
0.034
23
0.03
0.035
0.034
24
0.03
0.025
0.024
25
0.026
0.027
0.023
63
Cont. Table 3.9: Calculated MIC for compounds in the prediction set
Calculated MIC
Compound No.
Experimental MIC
26
3.6
PLS
MLRA
0.025
0.035
0.034
27
0.025
0.025
0.027
28
0.020
0.020
0.023
Application of QSAR Models to Database Mining
One popular computational approach to rational drug discovery is database
mining, which relies on the structure of known active molecules as queries.
Applications of QSAR can be extended to any molecular design purpose, including
environmental sciences, prediction of different kinds of biological activity by
correlation of congeneric series of compounds, lead compound optimization,
classification, diagnosis and elucidation of mechanisms of drug action and prediction
of novel structural leads in drug discovery.
The developed QSAR models were capable of predicting the anti bacterial
activity of the excluded 28 compounds in the prediction set with high degree of
accuracy. In the next stage, the models were applied to search for biologically active
compounds in a large database. Potentially active compounds in the database were
selected based on the similarity of these compounds with active compounds in the
training set (Table 3.10).
Compounds that demonstrated a minimum inhibition
concentration (MIC) of 64 µg/mL or lower [62] were selected as the similarity
probes for database mining.
64
No
Table 3.10: List of probe compounds for database mining
MIC (µg/mL) No
Structure
Structure
O
OH
HO
HO
OH
0.1
1
OH
2
O
OH
HO
OH
HO
0.06
OMe
4
O
0.06
O
O
O
OH
OH O
O
OH
OH
HO
OMe O
0.05
5
6
COOH
O
O
MeO
0.07
O
O
3
MIC (µg/mL)
O
O
0.05
O
OH O
O
C
OH
7
0.05
8
HO
OH
O
C
O
OH
O
CHO
0.05
9
10
0.05
11
0.05
CHO
O
12
O
0.05
HO
Me
CH2OH
Me
CH2
Me
OMe
OH
13
0.05
OCH3
0.05
0.05
14
O
65
Cont. Table 3.10: List of probe compounds for database mining
No
Structure
MIC (µg/mL)
No
15
0.05
16
17
0.04
18
Structure
MIC (µg/mL)
O
C
HO
0.04
OH
O
0.04
O
O
O
19
21
O
CH CHCH
HO
HO
0.04
20
0.03
22
0.04
O
0.03
OH
O
O
23
O
O
O
0.03
0.03
24
O
OMe
O
25
0.026
26
0.025
28
OMe
0.025
O
O
27
O
O
N
H
0.02
66
3.6.1
Application of QSAR Models in AmicBase Database Mining (without
scaling)
Database mining of large number of compounds have been used as a facility
to discover new active anti bacterial agents. Similarity searching was the measure
that was used to calculate the inter-molecular structural similarities [74], this concept
was used to search new agent in a database.
Twenty eight compounds with anti bacterial activity of less than 64 µg/mL
were selected as the similarity probes for database mining; their structure and activity
are shown in Table 3.10. Degree of similarity, based on Euclidean distance between
active compounds in data set (28 compounds) and those in database was calculated
using the same set of descriptors used in the QSAR model. Out of 3339 compounds
in the AmicBase database [67], it was found that 659 compounds were within the
chosen similarity cutoff value (0.5 Euclidean distance unit) of any the 28 probes.
These compounds were further subjected to consensus hits criteria (i.e. selected by
using descriptors from both models) and resulted in only 16 compounds.
Finally, after applying the applicability domain criterion only three
compounds were selected and were predicted their anti bacterial activity.
complete process and number of selected compounds is shown in Figure 3.6.
3339
Compounds
Applicability
Domain
Euclidian distance
( Dij < 0.5)
16 compounds
659
Compounds
Consensus hits
3
Compounds
Figure 3.6: Flowchart to select new compounds in AmicBase Database
The
67
The final stage of applying QSAR models in database mining is to confirm
the ability of QSAR models to predict biological activity of selected compounds
from the database. 3 compounds were selected from AmicBase database and their
predicted anti bacterial activities are shown in Table 3.11.
Table 3.11: Selected Compounds with predicted MIC value
No. of compounds in
database
1515
MIC
Predicted
Structure
H3C O
H3C
29.72µg/mL
O
eugenol methyl ether
1893
37µg/mL
HO
H3C
O
CH3
4-methyl guaiacol
2488
37.75µg/mL
H3C
OH
m-cresol
Based on the computerized measurement, the predicted anti bacterial activity
value (MIC) of these selected compounds were less than 64 µg/mL. It indicates that
QSAR models were able to select active compounds that have the same properties as
active compounds in the training set. From their predicted MIC value, it was also
shown that these compounds were able to inhibit the growth of E.coli at very low
concentrations. Two of these compounds (eugenol methyl ether and m-cresol) were
chosen as test compounds to determine their MIC value by using laboratory analysis.
68
3.6.2 Application of QSAR Models in AmicBase Database Mining (With
Scaling)
Active compounds in the training set were used as probes to calculate the
degree of similarity and with compounds in the database. Euclidean distance was
employed to measure this similarity using the same set of descriptors that appeared in
QSAR model. Scaling was done to reduce the risk of over fitting [68].
Euclidean distance between each probe and every compound in the database
[67] were calculated for each descriptor that appeared in the QSAR model. Numbers
of compounds in the database within the chosen similarity cutoff value (0.5
Euclidean distance units in multidimensional descriptor space) were 1138
compounds. The initial list was further refined by selecting consensus hits, i.e. those
calculated by using descriptors from both models, reducing the number of candidates
to 80 compounds.
Subsequently the 80 compounds were subjected to the applicability domain
criteria i.e. similarity threshold and the number of possible candidates were further
narrowed down to 8 compounds. Applicability domain is specific for each QSAR
model if the distance of the compounds in database from at least one of its neighbors
in training set exceeds this threshold, the prediction is considered unreliable and
these compounds will be rejected. In Figure 3.7 is shown the summary of steps to
get new lead molecule in AmicBase database.
The rigorously validated of QSAR models have been used to predict the anti
bacterial activity for new molecules, or screening a large group of molecules with
unknown activity. Usually, the prediction model is elaborated using the parameters
calculated for a well-determined data of training set on the unknown test set. If the
training set is a sufficiently representative pattern of the system, then, it can be
assumed that the introduction of new elements with an unknown property will not
affect their stability and that confident prediction can be attempted. In Table 3.13 is
shown the structure and predicted activity of these 8 compounds.
69
Structure activity relationship studies can give information to obtain the
activity of new candidate molecule from database mining [74]. Based on the QSAR
calculation, the predicted anti bacterial activity value (MIC) of these selected
compounds were less than 64 µg/mL.
This indicated that QSAR models were
capable of finding active compounds that have the same properties as active
compounds in training set. From their predicted MIC value was also shown that
these compounds were able to inhibit the growth of E.coli at the minimum
concentration.
Database
3339 compounds
Training set
(28 compounds)
3339
Compounds
Active compounds
(28 compounds)
Generate descriptors
Range scaling
Euclidean distance < 0.5
Initial hits
1138 compounds
Appear in both models
Consensus hits
80 compounds
Applicability domain
8 compounds
Figure 3.7: Flowchart to select new compounds in AmicBase Database
70
Table 3.12: Selected compounds with their biological activity predicted
MIC
Structure
Structure
predicted No.
No.
MIC
predicted
OH
O
145
0.047
0.057
185
HO
2-cis-6-cis-Farnesol
(5-Isopropenyl-2-methyl-cyclohex-1-enyl)-acetic acid
O
283
0.038
O
O
444
0.037
O
O
OH
3,5a,9-Trimethyl-3a,5,5a,9b-tetrahydro3H,4H-naphtho[1,2-b]furan-2,8-dione
O
675
Methyl 4-hydroxy-3-(3-methyl-but-2enyl)-benzoate
0.040
O
814
0.051
O
O
Benzal acetylacetone
2-Benzylideneglutaraldehyde
O
1106
0.033
O
OH
Dec-3-enoic acid
OH
0.034
1201
tridecanoic acid
71
3.7
Experimental Validation
The QSAR models predicted the minimum inhibition concentration of the
chosen molecules in a database. Agar diffusion technique was used to confirm the
predicted activity (MIC value) of these compounds.
The concentration of test
compounds has been modified around the predicted range and it was applied as a
control to measure the activity.
Ampicilin was used as positive control while
distilled water was used as negative control. Table 3.13 shows the results from
laboratory analysis for compounds which were selected without scaling.
Table 3.13: MIC value of selected compounds (without scaling) using agar diffusion
method
No
Structure
MIC
(µg/mL)
Zone Diameter
(mm)
>128
0.9
H3C O
1
O
H3C
eugenol methyl ether
2
38-50
H3 C
OH
1.0
m-cresol
One of the selected compounds (i.e. eugenol methyl ether) was not active
against E. coli BL21 but it might be active to the other strains of E. coli or other
gram negative bacteria, depending on strain of bacteria which have been used to
measure the MIC value of active compounds in the training set.
In general, the QSAR model was able to predict the activity of hit compounds
from database mining. The difference of MIC value obtained by using agar diffusion
method and MIC predicted by using the QSAR models was not too large and their
values were less than 64 µg/mL. This is case for m-cresol which can be classified as
active compounds, because they were able to inhibit the growth of E. coli BL21 and
72
attack its cell wall at low concentration. Figure 3.8 are shown the inhibition zone of
hit compounds from database mining.
a. m-cresol
b. eugenol methyl ether
Figure 3.8: Inhibition zone of E.coli using: (a) m-cresol and (b) eugenol methyl
ether
Basically, QSAR models were able to choose compounds with the same
similarity like as active compounds in the training set and also can predict the anti
bacterial activity of these compounds against E. coli.
But to verify it using
laboratory analysis, it would be better to use the same strain of E. coli which has
been used to determine the MIC value of these compounds in the training set.
Laboratory testing was also done for hits compounds selected by using
scaling to confirm the biological activity predicted using QSAR models. Table 3.14
shows the biological activity predicted using agar diffusion method.
73
Table 3.14: MIC value of selected compounds (with scaling) using agar diffusion
No
Structure
1
O
MIC
(µg/mL)
Diameter zone
(mm)
47
0.25
0.348
0.80
>62.5
No inhibition
0.049
3.40
>62.5
No inhibition
HO
(5-isopropenyl-2-methyl-cyclohex-1enyl)-acetic acid or linalyl acetate
2
OH
2-cic-6-cis farnesol
3
O
O
OH
Methyl 4-hydroxy-3-(3-methyl-but-2enyl)-benzoate or ethoxycinnamate
O
4
O
Benzal acetylacetone
O
5
tridecanoic acid
OH
74
Five of the eight selected compounds (using scaling) were used for laboratory
testing. One of them i.e. tridecanoic acid was not tested because it was not soluble in
water. From table 3.15 three compounds (i.e. linalyl acetate, 2-cis-6-cis farnesol and
benzal acetylacetone) were able to inhibit the growth of E. coli BL21 at low
concentration. QSAR model was able to predict the MIC value of these compounds
accurately; and this was confirmed by laboratory testing. The inhibition zone of
these selective agents is shown in Figure 3.9.
Figure 3.9: Inhibition zone of E.coli using selective compounds
3.8
Effects of Range Scaling and Applicability Domain to Search New
Agents
Combination of range scaling and applicability domain in QSAR models
applied to mining chemicals in a large database was found to be effective in
accurately predicting MIC value.
Active compounds in the training set with
similarity concept were used to search active agents in database. A set of descriptors
which were included in the QSAR models consists of wide range of values;
furthermore scaling was needed to decrease the effects of large descriptors to others.
75
QSAR models which were developed using MLRA and GAPLS techniques
have specific applicability domain. We can think of this domain as an area in which
the model is applicable, i.e. prediction of activity will be reliable.
Screened
compounds with distances larger than the applicability domain were rejected because
these compounds were expected to be ‘different’ from the majority of the active
compounds in the training set.
Mining chemicals in a large database without
applying certain criteria like this will results in discovering new agents that fails the
laboratory test.
76
CHAPTER 4
DEVELOPMENT OF QSAR MODELS AND DATABASE MINING
FOR ANTI TUBERCULOSIS AGENTS
4.1
Introduction
This chapter presents the results of development of QSAR models using the
anti tuberculosis data set, followed by its application in database mining. The results
of the data analysis are presented and described in the next five sections. In section
4.2 results of the chosen descriptors which were used to generate QSAR models are
presented.
Section 4.3 describes the statistical analysis of QSAR models which
were generated using MLRA and GAPLS technique. Validation of both QSAR
models for predicting the anti tuberculosis activity of compounds in the prediction
set will be presented in the next section. Section 4.5 describes the application of
QSAR models to discover new active agents against M. tuberculosis. Finally, this
chapter will present the predicted activity of new agents using agar diffusion method.
4.2
Descriptors Generation and Objective Feature Selection
Numerical descriptors that encode topological, electronic and geometric
features of each molecule were calculated by using descriptors generation routines in
TSAR. Initially, 316 descriptors were generated for the compounds in the data set.
Objective feature selection was carried out to remove descriptors that contain
77
identical information or that are highly correlated with other descriptors.
A
descriptor was removed if it had the same value for over 90% of the training set
compounds [21]. Furthermore, highly correlated descriptors provide nearly identical
information and only one is needed for model development. Pair-wise correlation
was examined to remove descriptors that were highly correlated. The objective
feature selection reduced the number of descriptors to 56 for the QSAR model
development, which are summarized in Table 4.1. The correlation matrix of these
descriptors is presented in Table 4.2.
Table 4.1: List of selected descriptors and their statistics analysis
Descriptors Class
Statistical
X
S.d
Molecular volume
Inertia moment 2 size
Inertia moment 2 length
Ellipsoidal volume
Verloop L (sub. 2)
Verloop B1 (sub. 1)
Verloop B1 (sub. 3)
Verloop B2 (sub. 2)
Total dipole moment
Dipole moment Y
Log P
Lipole X component
Lipole Z component
Kier chi3 (ring)
Kier Chi6 (ring)
Balaban topological
Vamp LUMO
Vamp. Pol. XY
Vamp. Pol. YZ
Vamp quadpole XY
Vamp quadpole YY
Vamp quadpole ZZ
Vamp octupole XXY
Vamp octupole YYX
Vamp octupole YYZ
Vamp octupole ZZY
Vamp octupole XYZ
253.18
727.15
3.83
876.79
3.04
1.61
1.25
1.61
3.53
0.59
3.81
2.40
0.58
0.04
0.07
1.86
0.73
-1.22
0.76
-4.97
1.55
1.43
-11.51
-1.86
3.02
1.78
7.76
85.93
596.04
0.78
735.96
0.76
0.08
0.28
0.43
1.90
1.87
2.41
4.49
2.53
0.11
0.07
0.71
1.07
2.09
2.11
11.92
12.74
11.28
68.14
86.57
50.11
45.03
36.76
ADME H bond donors
1.01
1.05
Descriptors Class
Inertia moment 1 size
Inertia moment 1 length
Inertia moment 3 length
Verloop L (sub. 1)
Verloop L (sub. 3)
Verloop B1 (sub. 2)
Verloop B2 (sub. 1)
Verloop B3 (sub. 1)
Dipole moment X
Dipole moment Z
Total lipole
Lipole Y component
Kier ChiV6 (path)
Kier Chi5 (ring)
Kappa 2 (index)
Vamp heat of formation
Vamp HOMO
Vamp. Pol. XZ
Vamp quadpole XX
Vamp quadpole XZ
Vamp quadpole YZ
Vamp octupole XXX
Vamp octupole XXZ
Vamp octupole YYY
Vamp octupole ZZX
Vamp octupole ZZZ
ADME H bond
acceptors
Cosmic total energy
Statistical
X
S.d
169.17
19.14
3.20
3.65
2.52
1.51
1.79
2.06
0.67
0.18
5.27
-0.99
3.10
0.06
6.73
-122.60
-9.76
-0.02
-2.98
0.28
-1.79
12.18
1.97
24.58
0.32
10.60
2.75
99.03
23.36
.064
1.09
0.61
0.25
0.36
0.63
2.58
2.30
3.96
3.18
2.38
0.06
3.21
62.82
0.53
2.10
16.52
11.24
10.48
254.73
66.53
145.93
64.91
98.94
1.16
-10.30
95.27
80
4.3
Development of QSAR Model Using MLRA Method
The mathematical structure-activity relationships quantify the connection
between the structures and the properties of molecules.
The relationships are
presented in mathematical models that allow the prediction of properties from
structural parameters [26]. Regression analysis has been used in QSAR studies to
perform on a series of analogues of tuberculosis drugs of isotonic acid hydrazide
with multi parameter [75].
The best QSAR model developed using MLRA technique has r 2 of 0.77 and
r 2 (CV) of 0.72. The equation is:
Y=
-0.671 x inertia moment 1 length + 16.389 x Verloop L (subst.2)
– 144.683 x verloop B1 (subst.3) – 10.412 x Dipole moment
Y component + 8.853 x ADME H bond donors + 207.345
4.1
A summary of the model statistics is provided in Table 4.3. MLRA method
requires at least as many molecules as independent variables. However, to produce
reliable results, minimizing collinearities and the possibility of chance correlations,
typically the ratio of compounds to variable should be at least five to one [76]. When
the number of independent variables is greater than the number of molecules, MLRA
can not be applied. Brief descriptions about descriptors which were included in the
QSAR model are shown in Table 4.4.
The development of QSAR models by using of MLRA technique can be
accepted, if the models have r 2 (CV) greater than 0.5 and r 2 greater than 0.6 [73].
In this case, these models are still capable accurately for predicting the activities of
compounds that are not included in the model development process. A plot of
experimental vs. predicted MIC is shown in Figure 4.1, while a plot of standard
residual vs. predicted value (residual plot) is presented in Figure 4.2.
81
Table 4.3: Statistical output of MLRA Model
Statistical output
Value
r2
0.772
Cross validation r 2 (CV)
0.729
Residual sum of squares ( RSS )
2.409
Predictive sum of squares ( PRESS )
2.856
Table 4.4: Descriptors which were included in the MLRA model
Descriptor class
Symbol
Molecular attributes Inertia moment 1
length
Explanation
Indicates the strength and
orientation behaviors of molecule
in an electrostatic field.
Verloop parameter
Verloop L (subst. 2)
The maximum length of the
substituents along the axis of the
bond between the first atom of the
substituents and the parent
molecule.
Verloop parameter
Verloop B1 (subst. 3)
The smallest distance from the
axis of the attachment bond,
measured perpendicularly to the
edge of the substituents.
Molecular
Dipole moment Y
The moments are calculated using
attributes
component
the substituents point of
attachment as an origin with this
bond placed along the x-axis.
ADME parameter
ADME H bond donor
Adsorption, distribution,
metabolism and excretion number
of H bond donor.
82
Figure 4.1: Plot of experimental vs. Predicted MIC for MLRA
Figure 4.2: Plot of predicted value vs. standard. residual for MLRA model
83
4.4
Development of QSAR Model Using PLS Technique
PLS is model development technique of particular interest in QSAR because,
unlike MLRA, data with strong colinearity, noisy or with numerous X variables can
be analysed [72]. Therefore, PLS is able to investigate complex structure-activity
problem, to analyze data in more realistic way, and to interpret how molecular
structure influences biological activity. PLS also can be used quite effectively as a
tool for interpreting QSAR models and that the information extracted is much more
detailed than that obtained by simply considering the overall model equation.
Genetic Algorithm (GA) technique was used to select the descriptors for the
second data set which consisted of compounds with moderate to high activity against
M. tuberculosis. In Table 4.5 the statistics for each dimension of GA-PLS are
shown. PLS with three dimensions were selected because it has the
lowest PRESS value and highest r 2 value. The resulted QSAR model was stable and
can be used for predicting compounds that were not included in the training set.
Table 4.5: Statistical plot output of GA-PLS for each dimension
Statistical output
PLS dim1
PLS dim2
PLS dim3
PLS dim4
r2
0.818
0.819
0.819
0.820
r 2 (CV)
0.798
0.799
0.801
0.796
RSS
0.780
0.776
0.777
0.772
PRESS
0.873
0.860
0.857
0.876
Plot of PRESS vs. no. of PLS component is shown in Figure 4.3. It was
shown that a good QSAR model was selected with the lowest PRESS value and the
highest r2 (CV) value (PLS dim3). The highest r2 (CV) value was indicated that the
PLS model have the high predictive power for predicting the activity of compounds
not included in the training set. Otherwise in the last component (PLS dim4), PRESS
value was increased.
anymore.
It can mean that PLS model (PLS dim4) was not stable
84
Figure 4.3: Plot PRESS vs. No. of component
The combination of GA and PLS produced models with r 2 value of 0.81 and
r 2 (CV) value of 0.80 in PLS with three components. The statistical diagnostics of
the model is shown in Table 4.6, while a plot of experimental vs. predicted MIC is
shown in Figure 4.4 and Figure 4.5 is shown a plot of standard residual vs. predicted
value. Prior to the acceptance of a final model, PLS analysis was performed to
ensure that the model was not overfit. An overfit model can predict the activities of
the training set but may not accurately predict the activity of unknown samples [77].
Table 4.6: Statistic of the PLS model
Parameter
Value
Fraction of variance ( r 2 )
0.819
Cross validated r 2 (CV)
0.801
Residual sum of square ( RSS )
0.776
Predictive sum of square ( PRESS )
0.857
Based on the summary of statistical test (Table 4.4) using PLS technique, the
high value of r 2 (CV) indicated a stable model and the ability for predicting
85
compounds that were not included in the training set. Although the PLS model was
slightly better, both can be used as predictive models in the database mining. Brief
descriptions of parameters which were included in the QSAR model are shown in
Table 4.7.
Figure 4.4: Plot of experimental vs. predicted MIC for PLS model
Figure 4.5: Plot of predicted value vs. standard residual for PLS model
86
Table 4.7: Descriptors which were included in the PLS model
Descriptor class
Molecular attributes
Symbol
Inertia moment 1 length
Explanation
Indicates
orientation
the
strength
and
behaviors
of
molecule in an electrostatic
field.
Verloop parameter
Verloop B1 (subst.3)
The smallest distance from the
axis of the attachment bond,
measured perpendicularly to
the edge of the substituent.
Molecular attributes
Dipole Y component
The moment are calculated
using the substituent point of
attachment as an origin with
this bond placed along the xaxis
4.5
Model Validation
The most used method to determine the stability of a predictive model is by
means of the analysis of the influence of each of its elements upon the final model.
Any model, even with excellent goodness of fit and satisfactory predictions, may
lack a real relationship between structural descriptors and activity. As evidence of
the existence of chance correlations, a reliable validation procedure must be carried
out.
The definitive validity of the model is examined by means of external
validation, which evaluates how well the equation generalizes.
Both models were validated by predicting the anti tuberculosis activity of 61
compounds excluded during the model development process (prediction set) (Table
4.8). The correlation coefficient ( r 2 ) between predicted and experimental values
was also calculated. High value of r 2 (0.93) indicated both models were capable of
prediction unknown compounds in the prediction set.
87
Table 4.8: Calculated MIC for compounds in the prediction set
Compound No.
Calculated MIC
Experimental MIC
PLS
MLRA
1
128
92.3
92.3
2
128
140.3
140.3
3
128
122.2
122.2
4
128
108.7
108.7
5
128
133.8
133.8
6
128
120.2
120.1
7
128
104.1
104.1
8
128
96.1
96.1
9
128
128.7
128.6
10
128
123.5
123.5
11
128
107.0
107.0
12
128
117.2
117.2
13
128
83.6
83.6
14
128
126.5
126.5
15
128
126.9
126.9
16
128
106.4
106.4
17
128
108.5
108.5
18
128
108.7
108.7
19
128
84.1
84.0
20
128
76.1
76.2
21
128
127.8
127.8
22
96.0
98.4
98.4
23
64.0
52.9
52.8
24
64.0
64.0
62.9
25
64.0
64.7
64.7
26
64.0
70.0
69.0
27
64.0
65.8
65.8
28
64.0
67.9
67.9
88
Cont. Table 4.8: Calculated MIC for compounds in the prediction set
Compound No.
Calculated MIC
Experimental MIC
PLS
MLRA
29
64.0
61.1
61.1
30
64.0
53.6
53.6
31
64.0
59.2
59.2
32
32.0
33.1
33.2
33
32.0
26.3
26.4
34
32.0
35.0
35.0
35
32.0
26.5
26.5
36
32.0
75.7
75.8
37
32.0
54.3
54.3
38
32.0
38.5
38.5
39
32.0
38.5
38.5
40
32.0
38.8
38.8
41
20.0
26.2
26.3
42
16.0
11.2
11.2
43
16.0
12.1
12.2
44
16.0
13.6
13.5
45
16.0
10.7
10.6
46
16.0
18.6
18.6
47
16.0
17.9
17.9
48
15.0
12.4
12.4
49
8.0
7.4
7.4
50
8.0
7.7
7.7
51
8.0
9.5
9.5
52
8.0
7.4
7.3
53
7.3
12.4
12.5
54
5.6
8.1
8.2
55
4.0
20.2
20.2
56
2.0
1.2
1.2
57
2.0
6.5
6.5
89
Cont. Table 4.8: Calculated MIC for compounds in the prediction set
Compound No.
Calculated MIC
Experimental MIC
PLS
MLRA
58
2.0
1.8
1.2
60
1.0
1.1
1.1
61
0.25
0.2
0.2
Based on predicted value of MIC for each model in table 4.8, the combination
of GA and PLS was able to produce better prediction than MLRA, although the
difference of predicted values of MLRA and GAPLS was not too large. RSS and
PRESS value of MLRA model was higher than GAPLS, indicating that MLRA
model has high residual value (difference between actual and predicted value) and
was not as good as PLS model to predict the activity of unknown compounds.
4.6
Application of QSAR Models to Database Mining
QSAR models can be used in database mining i.e. finding molecular
structures that are similar to the probe molecules and or even predicting the activities
for the compounds in a database [74]. A QSAR model with high degree of accuracy
can be used as a means of screening compounds from existing databases for anti
tuberculosis activity. Alternatively, variable selected by QSAR optimization can be
used for similarity searches to improve the performance of the database mining
methods.
In this study, the effect of range scaling (before calculation of the
Euclidean distances) to molecular structure and properties of new lead compounds in
database mining was also examined.
90
4.6.1
Application of QSAR Models in AmicBase Database Mining (Without
Scaling)
The applicability of QSAR model to mining chemicals in a database was
tested.
This stage began with generation of descriptors for all compounds in
database using the same set of descriptors that appeared in the QSAR model.
Euclidean distances between 32 probe compounds and 3339 compounds in the
database were calculated to measure their similarity. A distance of 0.5 units in
multidimensional descriptors space was chosen as similarity cutoff value, resulting in
as many as 36 compounds.
The initial list was further refined by selecting consensus hits, i.e. molecules
found in both models, reducing to 18 compounds. The anti tuberculosis activity of
these 18 consensus hits were predicted by using the two best QSAR models, each of
this model has specific applicability domain criteria.
This step produces four
compounds 579, 1792, 2399 and 2918 respectively, which are summarized in table
4.9 and list of probe compounds is presented in Table 4.10.
Table 4.9: Selected compounds with their predicted anti tuberculosis activity
Ambicase
entry No.
Name of
Compounds
579
3-hexanol
1792
2-isobuthyl-4,5-
MIC
predicted
(µg/mL)
Structure
48.8
OH
24.2
OH
dimethyl-phenol
HO
2399
Vanilin
O
O
2918
Cineole
O
17.0
16.4
91
No
Table 4.10: List of probe compounds for database mining
MIC (µg/mL) No
Structure
Structure
H
H
1
N
H
62.5
2
H
HO
MIC (µg/mL)
50.0
O
O
H
3
H
32.0
O
H
H
4
O
O
O
O
O
O
O
5
H
32.0
O
32.0
H
O
6
H
32.0
H2C
H 3C
7
32.0
OH
8
CH2OH
32.0
HO
H
9
32.0
CH2OH
H
O
O
10
32.0
O
HO
Me
H
OH
H
11
O
O
16.0
O
14
H
O
O
16.0
16
COOH
HO
H
H
16.0
O
OH
15
16.0
O
O
O
O
O
12
O
H
13
32.0
16.0
92
Cont. Table 4.10: List of probe compounds for database mining
No
Structure
H
H
17
OAc
19
MIC (µg/mL)
No
16.0
18
14.4
20
8.0
22
Structure
H
MIC (µg/mL)
H
OH
OH
OH
16.0
8.0
OH
AcO
HOOC
H
21
H
O
8.0
OH
H
O
OH
HO CH3
OH
23
O
25
O
H
24
4.0
26
2.0
28
6.0
OH
CH3
C 2H5
H
O
8.0
HOH2C
H
N
N
H
CH2OH
3.8
C2 H5
OH
H
27
H
H
H
H
OH
H
2.0
O
O
OMe
H
H
29
OH
2.0
30
O
1.2
O
OH
OMe
O
O
31
H
N
O
OAc
0.89
NH2
0.25
32
N
93
Based on Table 4.9, QSAR models was able to search and predict the
biological activity of new lead compounds where by all of these selected compounds
can be classified as active agents against M. tuberculosis. To validate these results, it
was necessary to experimentally measure the biological activity of these agents.
Three of these compounds (3-hexanol, vanilin and cineole) were chosen as test
compounds against gram positive bacteria (e.g. M. tuberculosis, Rhodococcus sp)
[47].
4.6.2
Application of QSAR Models in AmicBase Database Mining (with
scaling)
New compounds with high activity against M. tuberculosis can be found by
applying the QSAR models to mining chemicals in a database (i.e. AmicBase) [67]
which consisted of 3339 chemicals. The similarity search was based on Euclidean
distances between active plant terpenoids and those in the database by using the set
of descriptors that appeared in the QSAR model.
An active plant terpenoid has MIC value less than 64µg/mL [21, 62] and
there were 32 plant terpenoids in training set with those activities (Table 4.9). The
similarity cutoff value was set to 0.5 units and a total of 545 compounds were short –
listed.
If the value of each descriptor in the QSAR models are significant by
different in magnitude and it will give an effect in Euclidean distance calculation,
therefore scaling was needed to avoid the domination of one descriptor to another.
Out of 545 compounds selected as initial hits, 12 compounds appear in both
models. The anti tuberculosis activity of these 12 consensus hits were predicted by
using two best QSAR models, each of this model has specific applicability domain
criteria.
This step produced 5 compounds.
Figure 4.6 summaries the steps to
discover new lead compounds against M. tuberculosis.
The anti tuberculosis activity of these selected compounds were predicted by
using both QSAR models. Structures and predicted MIC of these compounds are
94
shown in Table 4.11.
Based on predicted MIC value; it was confirmed five
compounds can be classified as active compounds which were able to prevent the
growth of M. tuberculosis. Laboratory testing (agar diffusion method) which defined
the biological activity of those selected compounds was needed to proof this and also
to ensure the applicability of QSAR models.
Database
3339 compounds
Training set
(61 compounds)
3319
Compounds
Active compounds
(32 compounds)
Generate descriptors
Range scaling
Euclidean distance < 0.5
Initial hits
545 compounds
Appear in both models
Consensus hits
12 compounds
Applicability domain
5 compounds
Figure 4.6: Step to select new compounds against M. tuberculosis
95
No.
Table 4.11: Selected Compounds with their predicted MIC value
MIC
Structure
Structure
predicted No.
MIC
predicted
HO
OH
5.7774
1061
1437
HO
O
60.602
O
Geranyl geraniol
8,9a-Dihydroxy-3,6,9-trimethylenedecahydro-azuleno[4,5-b]furan-2-one
OH
8.5275
2181
56.737
2393
HO
O
O
2-(3,7,11-Trimethyl-dodecyl)-hydroquinone
3,6,9a-Trimethyl-3a,4,5,6,6a,7,9a,9boctahydro-3H-azuleno[4,5-b]furan-2-one
HO
3101
HO
N
1-[(2-Hydroxy-ethyl)-methyl-amino]-dodecan-2ol
61.707
96
4.6.3
Effects of Applicability Domain to Search New Agents
QSAR models based on the mechanism of action approach tend to rely on
expert judgment to define the domain. The applicability domain may be defined in
terms of general properties and on much more detailed structural basis for specific
toxicities.
For a prediction to be valid, the compound must fall within the
applicability domain of the models [78].
Both QSAR models (GAPLS and MLRA) consisted of a set of descriptors,
which were used to measure the similarity. The applicability domain can be used to
ensure the prediction of new compounds is reliable. Compounds with the distance
larger than applicability domain indicated that the property of these compounds is
not similar with active compounds in the QSAR model. Therefore, these compounds
must be rejected.
4.7
Experimental Validation
The rigorously validated of QSAR models are confirmed if they were able to
predict the biological activity of unknown compounds in the prediction set and can
be used to search and predict anti tuberculosis activity of new molecules in database
mining [76]. M. tuberculosis is one of pathogen bacteria and very dangerous to
human, therefore activity testing has been done on Rhodococcus sp which has similar
properties with M. tuberculosis. Furthermore, it was much easier to purchase and
more readily available in the Department of Biology, Faculty of Science, Universiti
Teknologi Malaysia.
In Table 4.12 the predicted biological activity of selected compounds
(without scaling) by using of agar diffusion methods are shown. Ampicilin was used
as positive control and distilled water as negative control [70]. Inhibition zone of
active and inactive agents is shown in Figure 4.7. There was no inhibition of
Rhodococcus bacteria to these test compounds around active concentration. The
MIC values of these molecules were more than 128 µg/mL, indicating that they can
97
not be classified as active agents. The QSAR models predicted them as inactive
agents.
Figure 4.7: Inhibition zone of active agents and inactive agents
Table 4.12: MIC value of selected compounds (without scaling) using agar
diffusion method
No
Structure
MIC
(µg/mL)
Diameter zone
(mm)
>128
No inhibition
>128
No inhibition
128
No inhibition
OH
1
3-hexanol
HO
2
O
O
Vanilin
3
O
Cineole
98
Agar diffusion method with Ampicilin as positive control and distilled water
as negative control was also used to calculate the MIC value of compounds which
were selected by using range scaling descriptors. Three of the selected compounds
i.e. geranyl geraniol, 8,9a-dihydroxy-3,6,9-trimethylene-decahydro-azuleno[4,5-b]
furan-2-one, and 2-(3,7,11-trimethyl-dodecyl)-hyroquinone which were chosen as
test compounds. Table 4.13 presented the minimum inhibition concentrations of
these selected compounds.
2-(3, 7, 11-Trimethyl-dodecyl)-hydroquinone was not tested because it was not
soluble in water but two of these compound i.e. geranyl geraniol and leucomicine
were confirmed as active agents against M. tuberculosis with MIC value less than
64µg/mL.
It was shown that application of QSAR model with range scaling
descriptors prior to Euclidean distance calculated were able to search similar
compounds with accurate biological activity prediction.
Table 4.13: MIC value of selected compounds (with scaling) using agar diffusion
method
No
Structure
OH
1
MIC
(µg/mL)
Diameter
zone (mm)
25
3.20
32
3.90
Geranyl geraniol
HO
HO
O
2
O
8,9a-Dihydroxy-3,6,9-trimethylene-decahydroazuleno[4,5-b]furan-2-one or leucomicine
99
Cont. Table 4.13: MIC value of selected compounds (with scaling) using agar
diffusion method
No
Structure
MIC
(µg/mL)
Diameter
zone (mm)
>128
No
OH
3
HO
inhibition
2-(3,7,11-Trimethyl-dodecyl)-hydroquinone or
phenantren
100
CHAPTER 5
CONCLUSIONS AND RECOMENDATION
5.1
Introduction
This chapter presents the conclusions of this study. The first section provides
the conclusions of research finding in an attempt to answer the research objectives.
The next section addresses the limitation of the study and the last section presents the
potential areas for future research.
5.2
Conclusion
The main objective of this study was to develop QSAR models that correlate
the biological activity of chemical compounds found in natural products with their
chemical structure. The models were used in searching for new active agents against
M. tuberculosis and E. coli in a database mining.
Quantitative structure activity relationship (QSAR) approach can be used to
develop models with high predictive power to predict the activity of compounds that
are not included in the training set. Very good models that correlate the structural
descriptors with anti bacterial and anti tuberculosis activity have been developed by
using genetic algorithm-partial least square (GA-LPS) and multiple linear regression
analysis (MLRA). It was noted that better models (in term of predictive ability) were
101
produced by using genetic algorithm (GA) to select the descriptors in the model
development process.
QSAR models with high degree accuracy were applied to screening and
searching for new active agents in a large database by using structural similarity
concept, i.e. by using Euclidean distance to measure similarity.
Variables that
appeared selected in the QSAR models (descriptors) were used to measure the
similarity of active compounds in the data set and those in the database.
The
domination of descriptors with significantly large value were eliminated by using
range scaling.
Applicability domain was used in the last step of database mining to make
sure the prediction of new compounds is reliable. Applicability domain has specific
value for each models, furthermore it can be used to reject the compounds which
were not similar with the active compounds.
The biological activities of the selected compounds were calculated using the
QSAR models. The predicted values were later compared with experiment values.
By using agar diffusion method, it was confirmed that geranyl geraniol and
leucomicine as new agents with high potential to inhibit the growth of Rhodoccocus
sp (similar characteristics with M. tuberculosis) at low concentration. In addition, 2cis-6-cis farnesol and pentanedione were confirmed as new anti bacterial agents from
database. Based on the results, the concept of QSAR can be used in the production
of new drugs in the pharmaceutical industries
5.3
Limitation of the Study
There are some limitations and weaknesses have been found during the
course of this research. Structure entry and molecular modeling is the first step in
the QSAR approach. A long time was required to optimize the energy of molecular
structures in the data set and to generate some of the electrostatics descriptors. The
same can be said about the feature selection process, especially the objective feature
102
selection. The main problem is how to select the set of descriptors that should be
included in the model development process and how to reject the poor descriptors.
Due to the large number of descriptors that can be generated, this step requires a lot
of judgment from the researcher. Obviously this step cannot be simply automated.
5.4
Future Research Recommendation
Future study on QSAR models and database mining could emphasize on
development of new methodology to improve model accuracy, since quantitative
agreement between actual and predicted biological activity (i.e. anti bacterial, anti
tuberculosis) is not excellent for all compounds. In principle, other than MLRA or
PLS approach such as KNN or any other rigorous model building techniques could
also be adopted for this kind of study.
The main concept in the database mining process is similar biological
activities. In this study degree of similarity was determined by using Euclidean
distance calculated from descriptors that appeared in the QSAR models. Other
technique of similarity calculation can be applied for future research, one popular
similarity measure is using Tanimoto coefficient [79].
For future studies on the specific application to anti bacterial and anti
tuberculosis agents, we propose that we should examine derivatives of the chosen
compounds which were identified from the database mining and try to develop
QSAR models for these compounds. It is hoped that more active agents can be
discovered from this derivatives.
103
REFERENCES
1. Said, I. M. Sebatian Semula Jadi daripada Tumbuhan : Potensi, Prospek dan
Kenyataan. Bangi.: Penerbit Universiti Kebangsaan Malaysia. 1995.
2. Kawai, T., Kinoshita, K. and Takahashi, K. Anti-emtic Principles of Magnolia
obovota and Zingiber officinale rhizome. Planta Med. 1994. 60: 17-20.
3. Sirat, H. M., Hong, L. F. and Khaw, S. H. Chemical Composition of the Essential
Oil of the Fruits of Amomum tetraceum ridl. J. Essent. Oil Res. 2001. 13: 86.
4. Besalu, E., Ponec, R., Vicente, J. Virtual Generation of Agents against
Mycobacterium tuberculosis. A QSAR study. Mol. Diversity. 2003. 6: 107-120.
5. Parvu, L. QSAR-a Piece of Drug Design. J. Cell. Mol. Med. 2003. 7(3):333-335.
6. W. J. Dunn lll, Quantitative Structure Activity Relationships in Chemical and
Biochemical System. Chemom. Intell. Lab. Syst. 1989. 6: 181-190.
7. Gozalbes, R., Doucet, J. P., Derouin, F. Application of Topological Descriptors
in QSAR and Drug Design: History and New trends. Current Drug Targets:
Infect. Disord. 2002. 2: 93-102.
8. Selassie, C. D. History of Quantitative Structure Activity Relationship. Burger’s
Medicinal Chemistry and Drug Discovery 6th. ed. New York: Wiley Interscience.
2003.
9. Bevan,
D.
R.
QSAR
and
Drug
Design.
Netsci
Home
Page,
http://www.netsci.org/science/compchem/fetaure12.html (accessed 5th January
2004).
10. http://www.tdx.cesca.es/tesis.UDG/available/tdx-1210104-133736/tags2de4.pdf
(accessed 26th November 2004).
11. Stuper, A. J., Brugger, W. E., Jurs, P. C. Computer Assisted Studies of Chemical
Structure and Biological Function. New York: Wiley Interscience. 1979.
12. Gasteiger, J., Engel, T. Chemoinformatic. Weinhein: Wiley-VCH GmbH and Co.
KgaA. 2003.
104
13. Kovatcheva, A., Golbraikh, A., Oloff, S., Xiao, Y. D., Zheng, W., Wolschan, P.,
Buchbauer, G., and Tropsha, A. Combinatorial QSAR of Ambergris Fragnance
Compounds. J. Chem. Inf. Comput. Sci. 2004. 44: 582-595.
14. Sutherland, J. J., Weaver, D. F. Development of Quantitative Structure-Activity
Relationships and Classification Models for Anticonvulsant Activity of
Hydantoin Analogues. J. Chem. Inf. Comput. Sci. 2003. 43: 1028-1036.
15. Mattioni, B. E. The Development of QSAR Dodel for Physical Property and
Biological
Activity
Prediction
of
Organic
Compounds.
Ph.D
Thesis.
Pennsylvania State University; 2003.
16. Mishra, R. K. Getting Discriminant Functions of Antibacterial Activity from
Physicochemical and Topological Parameter. J. Chem. Inf. Comput. Sci. 2001.
41: 387-393.
17. Gasteiger, J. Handbook of Chemoinformatics. Vol.3. Weinheim: Wiley VCH
verlag GmbH and Co. 2003.
18. Kier, L. B., Hall, L. H. The Meaning of Molecular Connectivity: a Biomolecular
Accessibility Model. Croat. Chem. Acta. 2003. 75 (2): 371-382.
19. Liu, S., Cao, C., Li, Z. Approach to Estimation and Prediction for Normal
Boiling Point (NBP) of Alkanes Based on a Novel Molecular Distance-Edge
(MDE)
Vector, λ. J. Chem. Inf. Comput. Sci. 1998. 38: 387-394.
20. Waterbeemd, H. V. D. Chemometric Methods in Molecular Design. Weinheim:
Wiley VCH verlag GmbH and Co. 1995.
21. Wessel, M. D. Computer-Asisted Development of Quantitative Structure
Property Relationships and Design of Feature Selection Routines. Ph.D thesis.
Pennsylvania State University; 1997.
22. Cho, D. H., Lee, S. K., Kim, B. T., No, K. T. Quantitative Structure-Activity
Relationship (QSAR) Study of New Fluorovinyloxycetamides. Bull. Korean
Chem. Soc. 2001. 22(4): 388-394.
23. Shen, M., LeTiran, A., Xiao, Y., Golbraikh, A., Kohn, H., Tropsha, A.
Quantitative Structure Activity Relationship Analysis of Functionalized Amino
Acid Anticonvulsant Agents Using k Nearest Neighbor and Simulated Annealing
PLS Methods. J. Med. Chem. 2002. 45: 2811-2823.
24. Tropsha, A., Zheng, W. Identification of the Descriptor Pharmacophores Using
Variable Selection QSAR: Applications to Database Mining. Curr. Pharm. Des.
2001. 7: 599-612.
105
25. Sutter, J. M., Kalivas, J. H., Jurs, P. C. Automated Descriptors Selection for
Quantitative Structure Activity Relationship Using Generalized Simulated
Annealing. J. Chem. Inf. Comput. Sci. 1995. 35: 77-84.
26. Svetnik,V., Liaw, A., Tong, C., Culberson, J. C., Sheridon, R. P., Feuston, B. P.
Random Forest : A Classification and Regression Tool for Compound
Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003. 43: 19471958.
27. Dianati, M., Song, I., and Treiber, M. An Introduction to Genetic Algorithm and
Evolution Strategies.
http://www.swen.uwaterloo.com/~mdianati/articles.pdf (access on 2th May 2004).
28. Daren, Z. QSPR Studies of PCBs by the Combination of Genetic Algorithm and
PLS Analysis. J. Comp. Chem. 2001. 25: 197-204.
29. Leardi, R. Genetic Algorithm in Chemometrics and Chemistry: a Review. J.
Chemom. 2001. 15: 559-56.
30. Hawkins, D. M., Basak, S. C., and Shi, X. QSAR with Few Compounds and
Many Features. J. Chem. Inf. Comput. Sci. 2001. 41: 663-670.
31. Srivastava, M. S. Methods of Multivariate Statistics. New York: John Wiley &
Sons. Inc. 2002.
32. Kutner, M. M., Nachtsheim, C. J., Neter. J. Applied Linear Regression Models.
New York: MC. GrawHill. 2004.
33. Oxford Molecular. TSAR 3.3 for Windows Reference Guide. UK: Oxford
Molecular, Ltd. 2000.
34. Accelrys Home Page, http://www.accelrys.com/tool/QSAR (accessed 6th January
2004).
35. Tobias, R. D. An Introduction to Partial Least Squares Regression.
http://support.sas.com/techsup/technote/ts509.pdf (accessed 14th June 2004).
36. Beebe, K. R., Pell, R. J., Seasholtz, M. B. Chemometrics, a Practical Guide. New
York: Wiley Interscience. 1998.
37. Li, M. J., Jiang, C., Li, M. Z., You, T. P. QSAR Studies of 20(S)-Campotechin
Analogues as Antitumor Agents. J. Mol. Struct: THEOCHEM, 2005. 723: 165170.
38. Ragno, R., Marshall, G. R., Santo, R. D., Costu, R., Massa, S., Rompei, R.,
Artico, M. Antimycobacterial Pyroles: Synthesis, Anti Mycobacterium
106
tuberculosis Activity and QSAR Studies. J. Bioorg. Med. Chem. 2000. 8:14231432.
39. Montanari, M. L. C., Beezer, A. E., Montanari, C. A and Verloso, D. P. QSAR
Based on Biological Microcalorimetry. J. Med. Chem. 2000. 43: 3448-3452.
40. Wang, X., Yin, C., Wang, L. Structure Activity Relationship and Response
Surface Analysis of Nitro Aromatics Toxicity to the Yeast (Sacharomyces
cerevicae). Chemosphere. 2002. 46: 1045-1051.
41. Xu, S., Nirmalakhandan, N. Use of QSAR Models in Predicting Joint Effect in
Multi-Component Mixtures of Organic Chemicals. J. Wat. Res. 1998. 32: 23912399.
42. Gramatica, P., Pilutti, P., and Papa, E. Validated QSAR Prediction of OH
Trophosperic Degradation of VOCs, Splitting Into Training-Test Set and
Consensus Modeling. J. Chem. Inf. Comput. Sci. 2004. 44: 1794-1802.
43. Waller, C. H. A Comparative QSAR Study Using CoMFA, HQSAR and
FRED/SKEYS Paradigms for Estrogen Receptor Binding Affinities of
Structurally Diverse Compounds. J. Chem. Inf. Comput. Sci. 2004. 44: 758-765.
44. Du Toit, K., Elgorashi, E. E., Malan, S. F., Drewes, S. E., Van Staden, J.,
Crouch, N. R., Mulholland, D. A. Anti-Inflammatory Activity and QSAR Studies
of Compounds Isolated from Hyacinthaceae Species and
Tachiadenus
longiflorus grisb. (gentianaceae). J. Bioorg. Med. Chem. 2005. 13: 2561-2568.
45. Tong, W., Xie, Q., Hong, H., Shi, L., Fang, H., Perkins, R. Assessment of
Prediction Confidence and Domain Extrapolation of Two Structure Activity
Relationship Models for Predicting Estrogen Receptor Binding Activity. Environ.
Health Perspectives. August 2004. 112(2): 1249-1254.
46. Cruez, A. F. TB Returns as Number One Infectious Killer Disease. Prime News,
Tuesday, April 5, 2005.
47. Henderson, B., Wilson, M., McNab, R., Lax, A. J. Celular Microbiology. New
York: John Wiley & Sons. 1999.
48. http://www.mckinley.edu/health -info/dis-cond/tb/TB.html (accessed 7th October
2004).
49. Wang, X., Dong. Y., Wang. L. and Han. S. Acute Toxicity of Substituted Phenols
to Rana japonica Tadpoles and Mechanism-Based Quantitative Structure
Activity Relationship (QSAR) study. Chemosphere. 2001. 44: 447-455.
107
50. European Committee for Antimicrobial Susceptibility Testing (EUCAST) of the
European Society of Clinical Microbiology and Infectious Diseases (ESCMID).
Determination of Minimum Inhibition Concentrations (MICs) of Antibacterial
Agents by Agar Dilution. J. Clin. Microb. Infect. 2000. 6: 509-515.
51. Collins, C. H., Lyne, P. M., and Grange, J. M. Microbial Methods. London:
Butterworth-Heinemann court, Jordan Hill. 1989.
52. Tabatabaei, R. R., Nasirian, A. Isolation, Identification and Antimicrobial
Resistance Patterns of E. coli Isolated from Chicken Flocks. J. Pharmacol. Exp.
Ther. 2003. 2: 39-42.
53. http://cwx.prenhall.com/horton/medialib/media_portfolio/text (accessed 29th July
2005).
54. Schneider, G. Neural Networks are Useful Tools for Drug Design.
Neural
Network. 2000. 13: 15-16.
55. Hoffman, B. T., Kopajtic, T., Katz, J. L., Newman, H. 2D QSAR Modeling and
Preliminary Database Searching for Dopamine Transporter Inhibitors Using
Genetic Algorithm Variable Selection of Molconn Z Descriptors. J. Med. Chem.
2000. 43: 4151-4159.
56. Shen, M., Beguin, C., Golbraikh, A., Stables, J. P., Kohn, H., Tropsha, A.
Application of QSAR Models to Database Mining; Identification and
Experimental Validation of Novel Anticonvulsant Compounds. J. Med. Chem.
2004. 47: 2356-2364.
57. Shen, M. Implementation and Application of Machine Learning Algorithm in
Computer-Assisted Drug Design. Ph.D Thesis. University of North California;
2003.
58. Fang, X., Shao, L., Zhang, H., Wang, S. Web- Based Tools for Mining the NCI
Databases for Anticancer Drug Discovery. J. Chem. Inf. Comput. Sci. 2004. 44:
249-257.
59. Cheng, L.L. Kandungan Kimia dan Bioaktiviti daripada Spesies Premna, Vitex,
Lantana dan Macaranga. M.Sc. Tesis. Universiti Teknologi Malaysia; 2002.
60. Ramalu, J.C.D. Kajian Sebatian Semula Jadi daripada Empat Spesies Piper.
M.Sc. Tesis. Universiti Teknologi Malaysia; 1999.
61. Jamil, S. Komponen Semula Jadi Bagi Spesies Curcuma dan Boesen bergia
(Zingiberaceae). M.Sc. Tesis. Universiti Teknologi Malaysia; 1997.
108
62. Cantrell, C. L., Franzblau, S.G., and Fischer. N.H. Antimycobacterial Plant
Terpenoids. Planta Med. 2001. 67: 685-694.
63. Senese, C.L., Hopfinger, A.J. A Simple Clustering Technique to Improve QSAR
Model Selection and Predictivity Application to a Receptor Independent 4DQSAR Analysis of Cyclic Urea Derived Inhibitors of HIV-1 Protease. J. Chem.
Inf. Comput. Sci. 2003. 43: 2180-2193.
64. Leardi, R. and Gonzalez, A.L. Genetic Algorithms Applied to Feature Selection
in PLS Regression: How and When to Use Them. Chemom. Intell. Lab. Syst.
1998. 41: 195-20.
65. Bourin, N., Mozziconacci, J.C., Arnoult, E., chavatte,P., Marot, C., Allory, L.M.
2D QSAR Consensus Prediction for High-Throughput Virtual Screening an
Application COX-2 Inhibition Modeling and Screening of the NCI Database.
J. Chem. Inf. Comput. Sci. 2004. 44: 276-285.
66. Braga, S.F., and Galvao, D.S. Benzo [C] Quinolizin-3-ones Theoretical
Investigation: SAR Analysis and Application to Non Tested Compounds. J.
Chem. Inf. Comput. Sci. 2004. 44: 1987-1999.
67. Review
Science
Amicbase:
Database
on
Antimicrobials.
http://www.reviewscience.com/Compounds.htm (accessed 8th October 2004)
68. Mazzatorta, P., and Benfenati, E. The Importance of Scaling in Data Mining for
Toxicity Prediction. J. Chem. Inf. Comput. Sci. 2002. 42(5): 1250-1255.
69. Zheng, W., and Tropsha, A. Novel Variable Selection Quantitative StructureProperty Relationship Approach Based on the k-Nearest Neighbors Principle. J.
Chem. Inf. Comput. Sci. 2000. 40: 185-194.
70. Madigan, M. T., Martinko, J. M., Parker, J. Brock Biology of
Microorganism 9th Ed. Upper Saddle River, N. J.: Prentice Hall, 2000.
71. Lalitha, M.K. Manual on Antimicrobial Susceptibility Testing. Department of
Microbiology
Christian
Medical
College,
Velore,
Tamil.
Nadu.
http://www.arches.uga.edu/~lace52/procedure.html (accessed October 2004).
72. Tang, K., Li, T. Comparison of Different Partial Least Squares Methods in
QSAR. Anal. Chem. Acta. 2003. 476: 75-92.
73. Golbraikh, A., Tropsha, A. Predictive QSAR Modeling Diversity Sampling of
Experimental Datasets for the Training and Test Set Selection. J. Comput-Aided
Mol Des. 2002. 5: 231-243.
109
74. Gillet, V. J., Wild, D. J., Willet, P., Bradshaw, J. Simmilarity and Dissimilarity
Methods for Processing Chemical Structure Databases. the Computer Journal.
1998. 8:547-558.
75. Bagachi, M.C., Maiti, B.C., Bose, S. QSAR of Anti Tuberculosis Drugs of INH
Type Using Graphical Invariants. J. Mol. Struct: THEOCHEM. 2004. 679:179186.
76. Stanton, D.T. On the Physical Interpretation of QSAR Models. J. Chem. Inf.
Comput. Sci. 2003. 43: 1423-1433.
77. Rogers, D., Hopfinger, A. J. Application of Genetic Function Approximation to
Quantitative Structure Activity Relationship and Quantitative Structure Property
Relationship. J. Chem. Inf. Comput. Sci. 1994. 34: 854-866.
78. Cronin, M. Oppurtunities for Computer Aided Prediction of Toxicity in Drug
Discovery. A report. Computational Chemistry, School of Pharmacy and
Chemistry. Liverpool: John Moores University. 2002.
79. Martin, Y. C., Kofron, J. L., Traphagen, L. M. Do Structurally Similar Molecules
Have Similar Biological Activity? J. Med. Chem. 2002. 45: 4350-4358.
110
No
Structure
Appendix A: List of Compounds in the First Data Set
MIC
No
Structure
(µg/mL)
MIC
(µg/mL)
OH
OMe
OH
0.1
1
HO
2
OH
OH
O
O
3
OH
5
COOH
0.1
4
0.06
6
0.1
O
O
HO
OH
HO
OH
HO
0.07
OMe
0.06
HO
OH O
O
O
0.06
7
8
O
OH
0.06
O
O
O
OH
O
HO
O
9
Me
OH
OH
OMe O
CH2
0.06
10
MeO
O
O
0.05
O
Me
Me
COOH
11
0.05
COOH
12
0.05
O
HO
O
C
O
O C C
COOH H
13
HO
0.05
14
C
O
OH
0.05
111
No
MIC
(µg/mL)
Structure
No
Structure
OH
OCH3O
15
H3CO
0.05
O
16
HO
MIC
(µg/mL)
O
OH
O
0.05
OH
OH
17
HOOC
19
O
0.05
18
0.05
20
OH
0.05
CHO
0.05
CHO
21
C
O
O
H
0.05
22
CH2OH
O
O
0.05
O
HO
23
Me
Me
CH2
0.05
Me
24
Me
Me
CH2
0.05
Me
OH
OCH3
0.05
25
0.05
26
O
OMe
HO
27
29
HO
0.05
28
0.05
30
O
0.05
0.05
112
No
Structure
MIC
(µg/mL)
No
0.04
32
0.04
34
Structure
MIC
(µg/mL)
O
O
31
O
O
H
H
O
C
HO
OH
0.04
O
O
33
HO
O
0.04
OH O
O
35
0.04
N
N
H
0.04
36
O
N
O
OH
37
0.04
O
O
38
0.04
O
O
HO
0.04
39
O
0.04
40
O
0.03
41
O
42
HO
O
CH CHCH
0.03
113
No
Structure
O
O
43
O
MIC
(µg/mL)
No
0.03
44
MIC
(µg/mL)
Structure
O
0.03
OH
O
O
O
0.03
45
O
46
O
0.03
O
HO
O
O
47
O
O
0.03
48
0.03
50
0.03
O
O
49
O
0.026
O
OMe
O
51
0.026
52
0.025
54
OMe
0.025
O
O
HOOC
53
0.025
O
O
55
0.025
56
O
N
H
0.02
114
No
Appendix B: List of Compounds in the Second Data Set
MIC
No
Structure
(µg/mL)
Structure
OH
OH
128
1
H
128
2
H
O
O
O
O
H
3
OH
128
O
O
4
128
H
OH
O
OH
OH
OH
5
H
128
6
128
H
OH
OH
OH
OH
H
H
HO
7
H
128
H
O
H
8
H
128
OH
O
O
O
HAcO
O
O
9
MIC
(µg/mL)
O
O
O
128
10
128
12
128
H
O
O
OH
CH2OH
11
H
128
H
HO
H
H
H
O
13
128
O
128
14
O
AcO
OH
OH
128
15
128
16
OH
OAc
H
H
H2C
H2C
Me
17
AcO
128
18
Me
O
128
115
No
MIC
(µg/mL)
Structure
No
O
O
19
128
Me
20
Me
128
AcO
HO
HO
21
MIC
(µg/mL)
Structure
HO
128
COOH
HO
22
HO
COOH
HO
128
HO
Me
CH2OH
H
O
H
23
OH
128
128
24
OH
128
25
O
26
O
O
O
O
128
27
128
28
O
O
O
O
O
O
O
29
128
O
O
O
H
128
30
128
O
O
H
O
O
O
31
O
O
H
128
O
H
OAng
32
OH
O
O
O
H
H
33
OH
128
OAc
34
O
O
128
OH
O
128
O
O
O
O
H
H
OH
35
O
O
128
36
O
128
O
O
O
O
HO
CH3
116
No
MIC
(µg/mL)
Structure
No
H HO
HO
37
128
O
H
38
O
O
O
O
O
39
128
H
O
40
O
128
O
HO
41
128
O
O
O
MIC
(µg/mL)
Structure
COOH
HO
128
N
42
100
NH2
N
HO
Me
HO
H
H
96
43
44
64
H
OH
H
CH2OH
O
H3C
H
OH
45
64
OH
46
64
47
64
48
H
O
O
O
O
H
OH
64
49
H
64
50
H
O
O
O
O
OH
O
H
64
51
H
64
52
H
O
O
O
O
O
H
53
64
H
O
HO
64
O
H
64
54
O
O
O
O
O
HO
55
O
O
H
H
64
56
COOH
HO
64
117
No
MIC
(µg/mL)
Structure
No
MIC
(µg/mL)
Structure
H2 C
57
64
CH2OH
64
58
Me
HO
HO
H2 C
HO
64
59
64
60
COOH
COOH
H
HO
HO
Me
H
H
61
N
H
62.5
50
62
H
HO
O
O
H
63
32
O
64
32
O
H
O
O
O
O
H
H
32
65
H
66
32
H
O
O
O
O
O
O
O
67
32
70
32
72
O
H
H
O
71
O
32
O
H
H 3C
OH
32
H
H2 C
73
32
H
O
O
HO
O
68
O
H
69
32
H2C
COOH
32
74
CH2OH
HO
HO
32
118
No
MIC
(µg/mL)
Structure
No
MIC
(µg/mL)
Structure
H
H
75
32
COOH
H
32
76
CH2OH
H
HO
Me
HO
Me
HO
HO
77
32
COOH
H
78
32
COOH
HO
HO
AcO
Me
Me
H
H
H
NH2
79
32
80
20
82
OH
H
32
OH
81
H
16
O
O
O
O
H
83
O
16
84
16
86
O
O
16
H
O
O
O
O
85
O
OH
O
OH
16
87
OH
H
O
OCH3
OH
16
H
O
16
COOH
16
O
88
O
89
16
O
H
90
HO
O
H
16
91
COOH
O
92
H
16
H
OAc
119
No
Structure
H
H
93
OCH3
MIC
(µg/mL)
No
Structure
16
94
H
MIC
(µg/mL)
OH
OH
15
95
16
H
OH
14.4
96
H
H
H
COOH
AcO
HOOC
O
97
O
8
98
8
100
8
OH
O
H
99
H
8
O
H
O
H
O
AcO
H
101
8
O
102
8
OH
H
HO
OH
H
H
103
O
O
105
OH
8
104
7.3
106
5.6
108
OH
HO CH3
O
6
OH
CH3
OH
8
OH
OMe
107
H
O
109
H
HO
COOH
OAc
H
HO
4
110
O
O
H
OH
C2 H5
HOH2C
N
H
4
H
N
CH2OH
C 2 H5
3.8
120
No
Structure
H
N
111
H
N
C
HO
HO
O
O
H
NH
OH
NH
C
H
N
N
H
H
MIC
(µg/mL)
No
2
112
OHO
H
N
O
H2C
HO
2
H
H
H
H3 C
H C
O
MIC
(µg/mL)
Structure
H
O
O
CH3
OH
OH
O
O
113
MeO
H
H
115
2
114
2
116
2
118
H
H
OH
H
H
OH
2
2
OH
O
OMe
O
117
HO
1.2
O
O
OH
OMe
119
1.2
H
1.2
O
120
O
O
H
OAc
HO
0.25
HO
O
121
OH O
OH OH
O
O
O
OH
H
N
O
NH
N
O
O
122
N
N
N
NH2
0.25
121
Appendix C: Presentation and Publication
Parts of this work have been presented at the following symposia:
1. Mohamed Noor Hasan, Neni Frimayanti. “Development of QSAR Models
for Predicting Anti Bacterial Activity of Compounds in Natural Products”
Proceedings of the 17th Malaysian National Symposium on Analytical
Chemistry, Pahang, Malaysia, 24-26 August 2004, pp 340-342.
2. Mohamed Noor Hasan, Neni Frimayanti. “Development of QSAR Models
for Predicting Anti Tuberculosis Activity of Plant Terpenoids” Proceeding of
the Symposium on Science and Mathematics, Johor, Malaysia, 14-15
December 2004, pp 14-15.
The following article has been based on parts of this thesis:
1. Mohamed Noor Hasan, Neni Frimayanti. “Development of QSAR Models
for Predicting Anti Bacterial Activity of Compounds in Natural Products”
Malaysian Journal of Analytical Sciences, In Press.
Download