DEVELOPMENT OF QSAR MODELS FOR PREDICTING BIOLOGICAL ACTIVITY OF CHEMICAL COMPOUNDS FROM NATURAL PRODUCTS AND ITS APPLICATION IN DATABASE MINING NENI FRIMAYANTI A thesis submitted in fulfilment of the requirements for the award of the degree of Master of Science (Chemistry) Faculty of Science Universiti Teknologi Malaysia AUGUST 2005 Specially Dedicated To My Beloved Mama (Emzalia Farida) Papa (Bahdar Johan, S.Pd) and My Sweet Sister (Belinda Monalisa, S.Si) ACKNOWLEDGEMENT Vision, values and courage are the main gift of this thesis. I am grateful for the inspiration and wisdom of many thoughts that have been instrumental in its formulation. First of all, I have readily acknowledged and thank to Allah SWT, the Omnipotent and Omniscient who created everything and in giving me the ability to begin and complete this project. I also wish to express my sincere appreciation to my supervisor, Assoc. Prof. Dr. Mohamed Noor Hasan, for his guidance, advice, motivation, critics and friendship. Without his help, this thesis would not have been the same as presented here. I would like to thank Dr. Farediah Ahmad and uni Deni Susanti, M.Sc for teaching and explaination about bioactive compounds. I would also like to thank to Dr. Fahrul Zaman Huyop and his research group for the many useful discussions and help in biological tests. I am also indebted to Universiti Teknologi Malaysia (UTM) for support in providing the research grant for this project entitled “StructureActivity Relationship Studies of Bioactive Compounds in Natural Products” (IRPA Vot 74089). My sincere appreciation is also extended to abang Ir. Henry Nasution, MT and om Afitra Jaya, SH for help and kindness, so that I can pursue my study here. Many thanks to my entire beautiful sisters in S36 UTM and my best friend Kiki, I can not forget about our familiarity and friendship. Last but certainly not least, I want to thank my mama, papa, my sweet sister Mona, my lovely grandma (mak, tino, and nenek) and om Rahman and all of my big family, for their affection, prayer and support throughout my study. I love you all. ABSTRACT Due to drug resistant problems, there is an urgent need to discover and develop new anti bacterial and anti tuberculosis lead compounds. Quantitative structure activity relationship (QSAR) methodology have been used to develop models that correlate biological activity of chemicals derived from natural products and their molecular structure. The approach started by generation of a series of descriptors from three-dimensional representations of the compounds in the data set. In this study, the first data set consisted of 56 compounds isolated from natural products with their minimum inhibition concentration (MIC, µg/mL) against Escherichia coli. The second data set consisted of 122 plant terpenoids with moderate to high activity against Mycobacterium tuberculosis. Genetic algorithmpartial least square (GAPLS) and multiple linear regression analysis (MLRA) techniques have been used in the model development. The validated QSAR models were applied in mining chemicals in a large database. The same set of descriptors that appeared in the QSAR models were used in chemical similarity search (based on Euclidean distance) comparing active compounds of the training set and those in the database. The selected compounds were short-listed by applying the applicability domain criterion to reduce the number of candidates to be tested. Finally, the biological activity of these compounds was determined experimentally using disk diffusion method to confirm their predicted MIC values. ABSTRAK Masaalah ketidakberkesanan ubat-ubatan telah menyebabkan perlunya penemuan dan pengembangan sebatian-sebatian anti bakteria dan anti tuberkulosis. Kaedah hubungan kuantitatif struktur-aktiviti (QSAR) telah digunakan untuk membangunkan model yang dapat menghubungkan antara aktiviti biologi sebatiansebatian kimia daripada hasilan semula jadi dengan struktur molekularnya. Kaedah ini dimulakan dengan menjanakan deskriptor daripada model tiga dimensi sebatiansebatian yang terdapat dalam set data. Dalam kajian ini, set data pertama terdiri daripada 56 sebatian yang telah dipisahkan daripada hasilan semula jadi dengan minimum inhibition concentration (MIC, µg/mL) terhadap Escherichia coli. Set data kedua terdiri daripada 122 terpen tumbuh-tumbuhan dengan aktiviti sederhana dan tinggi terhadap Mycobacterium tuberculosis. Teknik algoritma genetik-kuasa dua terkecil separa (GAPLS) dan analisis linear berganda (MLRA) telah digunakan sebagai kaedah untuk membina model. Model QSAR yang sah kemudian digunakan untuk mencari sebatian kimia dari sebuah pangkalan data yang besar. Set deskriptor yang sama seperti di dalam model QSAR telah digunakan untuk mencari sebatian kimia yang sama (berdasarkan jarak Euclidean) antara sebatian yang aktif daripada set data dengan sebatian-sebatian daripada pangkalan data. Sebatian-sebatian yang terpilih disenarai pendekkan dengan mengaplikasikan kriteria applicability domain untuk mengurangkan jumlah sebatian yang akan diuji. Akhirnya, aktiviti biologi daripada sebatian-sebatian tersebut di uji secara eksperimen, menggunakan teknik disk diffusion untuk mengesahkan nilai MIC ramalan. vii TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENTS vii LIST OF TABLES xi LIST OF FIGURES xiii LIST OF SYMBOLS xv LIST OF ACRONYMS xvii LIST OF APPENDICES xix INTRODUCTION 1.1 Introduction 1 1.2 Quantitative Structure Activity Relationship (QSAR) 2 1.3 History and Development of QSAR 4 1.3.1 Data Set 7 1.3.2 Descriptors 8 1.4 1.3.2.1 Topological Descriptors 10 1.3.2.2 Electronic Descriptors 11 1.3.2.3 Geometric Descriptors 11 Feature Selection 12 1.4.1 Genetic Algorithm (GA) 14 viii 1.5 Tools and Techniques of QSAR 14 1.5.1 Multiple Linear Regression Analysis 15 1.5.2 Partial Least Squares 17 1.6 Applications of QSAR 20 1.7 Overview of Multidrug Resistance Mycobacterium 22 tuberculosis 1.7.1 Mycobacterium tuberculosis 23 1.7.2 How Does Tuberculosis Spread 24 Minimum Inhibition Concentration (MIC) 25 1.8.1 Escherichia coli 26 1.9 Database Mining 27 1.10 Research Scope 28 1.11 Research Objectives 28 1.12 Significance of Research 29 1.13 Layout of the Thesis 29 1.8 2 RESEARCH METHODOLOGY 2.1 Introduction 31 2.2 Data Set 32 2.3 Structure Entry and Molecular Modeling 33 2.4 Descriptor Generation 33 2.5 Feature Selection 35 2.5.1 Objective Feature Selection 35 2.5.2 Subjective Feature Selection 36 Model Development 37 2.6.1 Multiple Linear Regression Analysis 38 2.6.2 Partial Least Squares 39 2.7 Model Validation 40 2.8 Application of QSAR Models to Database Mining 42 2.8.1 Molecular Descriptors and Similarity 44 2.6 Calculation 2.8.2 Applicability Domain of QSAR Models 45 2.8.3 Biological Activity Predicted Using QSAR 46 ix Models 2.9 3 Laboratory Testing 46 2.9.1 Material and Method of Agar Diffusion 47 DEVELOPMENT OF QSAR MODELS AND DATABASE MINING FOR ANTI BACTERIAL AGENTS 3.1 Introduction 49 3.2 Selection of Descriptors and Feature Selection 49 3.3 Model Development Using MLRA Method 54 3.4 Model Development Using PLS Method 57 3.5 Model Validation 61 3.6 Application of QSAR Models to Database Mining 63 3.6.1 Application of QSAR Models in AmicBase Database Mining (without Scaling) 66 3.6.2 Application of QSAR Models in AmicBase 68 Database Mining (with Scaling) 3.7 Experimental Validation 3.8 Effects of Range Scaling and Applicability Domain to Search New Agents 4 71 74 DEVELOPMENT OF QSAR MODELS AND DATABASE MINING FOR ANTI TUBERCULOSIS AGENTS 4.1 Introduction 76 4.2 Descriptors generation and Objective Feature 76 Selection 4.3 Development of QSAR Models by using MLRA 80 Method 4.4 Development of QSAR Models by using PLS 83 Technique 4.5 Model Validation 86 4.6 Application of QSAR Models to Database Mining 89 4.6.1 Application of QSAR Models in AmicBase Database Mining (without Scaling) 4.6.2 Application of QSAR Models in AmicBase 90 x Database mining (with Scaling) 93 4.6.3 Effects of Applicability Domain to Search New Agents 4.7 5 Experimental Validation 96 96 CONCLUSIONS AND RECOMENDATION 5.1 Introduction 100 5.2 Conclusion 100 5.3 Limitation of the study 101 5.4 Future Research Recommendation 102 REFERENCES 103 APPENDIX A 110 APPENDIX B 114 APPENDIX C 121 xi LIST OF TABLES TABLE NO. TITLE PAGE 2.1 Type of descriptors in TSAR 34 3.1 List of selected descriptors and their statistical analysis 50 3.2 Correlation matrix of descriptors 52 3.3 Statistical output of MLRA model 54 3.4 Descriptors which were included in the QSAR model by using of MLRA 55 3.5 Statistical analysis of MLRA method 57 3.6 Statistical output of GA-PLS for each dimension 58 3.7 Statistical output of PLS model 59 3.8 Descriptors which were included in the QSAR model by using of PLS 60 3.9 Calculated MIC for compounds in the prediction set 62 3.10 List of probe compounds for database mining 64 3.11 Selected compounds with predicted MIC value 67 3.12 Selected compound with their biological activity predicted 70 3.13 MIC value of selected compounds (without scaling) using 71 agar diffusion method 3.14 MIC value of selected compounds (with scaling) using agar 73 diffusion 4.1 List of selected descriptors and their statistics analysis 77 4.2 Correlation matrix of descriptors 78 4.3 Statistical output of MLRA model 81 4.4 Descriptors which were included in the MLRA model 81 xii 4.5 Statistical plot output of GA-PLS for each dimension 83 4.6 Statistic of the PLS model 84 4.7 Descriptors which were included in the PLS model 86 4.8 Calculated MIC for compounds in the prediction set 87 4.9 Selected compounds with their predicted anti tuberculosis activity 90 4.10 List of probe compounds for database mining 91 4.11 Selected compounds with their predicted MIC value 95 4.12 MIC value of selected compounds (without scaling) using 4.13 agar diffusion method 97 MIC value of selected compounds (with scaling) using agar 98 diffusion method xiii LIST OF FIGURES TABLE NO. TITLE PAGE 1.1 The general QSAR problem 1.2 Flow diagram for the genetic algorithm (GA) 15 1.3 Illustration of the difference between PCR and PLS 19 1.4 Structure of E. coli 26 2.1 General QSAR methodology 32 2.2 Genetic algorithm process 38 2.3 Flowchart for the general model building process in QSAR studies 2.4 9 41 Flowchart of database mining that employs predictive QSAR models 43 3.1 Plot of experimental vs. predicted MIC for MLRA model 56 3.2 Plot of predicted value vs. standard residual for MLRA 56 model 3.3 Plot PRESS vs. No of component 58 3.4 Plot of experimental vs. predicted MIC for PLS model 59 3.5 Plot of predicted value vs. standard residual for PLS model 61 3.6 Flowchart to select new compounds in AmbicBase database 3.7 Flowchart to select new compounds in AmbicBAse database 3.8 3.9 66 69 Inhibition zone of E. coli using (a) m-cresol and (b) eugenol methyl ether 72 Inhibiton zone of E. coli using selective compounds 74 xiv 4.1 Plot of experimental value vs. predicted MIC for MLRA 82 4.2 Plot of predicted value vs. standard residual for MLRA 82 model 4.3 Plot PRESS vs. No. of component 84 4.4 Plot of experimental vs. predicted MIC for PLS model 85 4.5 Plot of predicted value vs. standard residual for PLS model 85 4.6 Step to select new compounds against M. tuberculosis 94 4.7 Inhibition zone of active and inactive agents 97 xv LIST OF SYMBOLS a, b, c, d ~ b̂ , b - Regression coefficient - Regression vector ~ˆ b - ~ The estimate of b ĉ - Activity of unknown compounds DT - Applicability domain Es - Steric component ρ - Proportionality reaction constant σ - Electronic properties of aromatic compounds, standard deviation of Euclidean distance π - Hydrophobicity of substituents px - Partition coefficients of derivative molecule pH - Partition coefficients of parent molecule r2 - How closely equation fits the data r2 (CV) - Predictive power of the model runk - Matrix of the known descriptor χ - Molecular connectivity indices X - Mean value y - Activity observed value y - Mean value, average Euclidean distance ŷ - Predicted value C - Concentration of molecule D - Distance matrix F - Degrees of freedom R - Matrix of descriptor xvi RT - Pseudo-inverse of matrix descriptor S - A diagonal matrix, standard error of the regression model s.d - Standard deviation U - Score matrix from PCA V - Matrix containing the loading W - Wiener index Z - An arbitrary parameter to control the significance level xvii LIST OF ACRONYMS BC3 - Benzo [c] quinolizin-3-ones CADD - Computer assisted drug design CAMD - Computer assisted molecular design DAT - Dopamine transporter EC50 - Effect concentration ED - Euclidean distance EDCs - Endocrine disrupting chemicals EIEC - Enteroinvasive EPEC - Enter pathogenic ETEC - Enterotoxigenic GA - Genetic algorithm GA-MLRA - Genetic algorithm-multiple linear regression analysis GAPLS - Genetic algorithm partial least squares GSA - Genetic simulated annealing HOMO - Highest occupied molecular orbital IC50 - Inhibition concentration KNN - K-nearest neighbor LDA - Linear discriminant analysis LFER - Linear free energy relationship LUMO - Lowest unoccupied molecular orbital MDR - Multi drug resistant MIC - Minimum inhibition concentration MLRA - Multiple linear regression analysis MLR - Multivariate linear regression MRA - Multiple regression analysis xviii NCI - National cancer institute PCA - Principal component analysis PCR - Principal component regression PLS - Partial least squares PRESS - Predictive sum of squares QSAR - Quantitative structure activity relationship QSPR - Quantitative structure property relationship RSS - Residual sum of squares TCH - Thiophene 2 carboxylic acid hyrazide SSR - Sum of squares SST - Total sum of squares VTEC - Verotoxigenic VOCs - Volatile organic compounds CHAPTER 1 INTRODUCTION 1.1 Introduction Malaysia is rich with chemical diversity of its natural products. It is estimated, there are about 12,000 species of plants found in this country and more than 1000 species are said to have therapeutic properties [1]. Much of these resources are still untapped although a number of research groups have been actively involved in systematically studying their chemical and biological properties. Some of these compounds and their derivatives have been shown to have antibacterial properties [2, 3]. For example, bioactive compounds can be produced from the family of Rubiaceae, Verbanaceae, Zingiberaceae and Piperaceae. Tuberculosis, mainly caused by Mycobacterium tuberculosis, is the leading killer among all infectious disease worldwide and is responsible for more than two million deaths annually. The recent increase in the number of multi-drug resistant clinical isolates of M. tuberculosis has created an urgent need for discovery and development of new anti tuberculosis lead compounds. It is expected that the quantitative structure-activity relationship (QSAR) approach which has been successfully applied to study factors involved in determining chemical properties or biological activities of chemical compounds can be applied here [4]. In a typical structure-activity relationship study, one is interested to develop models that can correlate the structural features of a series of chemical compounds 2 with their physicochemical properties or biological activities. These correlation models can be used to predict the activity of new compounds as well as to form a basis for understanding factors affecting their activities [5, 6]. QSAR models are constructed by analyzing known or computed property data and series of numerical descriptors representing the structural characteristic. Descriptors quantitative properties depend on the structure of the molecule. Various physicochemical parameters including thermodynamic properties (such as system energies), electronic properties (e.g. value of highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), molecular shape (e.g. surface area, length to breadth ratio) and simple structural characteristic (e.g. number of bonds, connectivity indexes, etc) have been used to get solid models which were able to predict the biological activity of unknown molecules [6]. In this study the structure activity relationship approach above was implemented to develop models that can correlate structural features of the compounds isolated from plants with their anti bacterial activity. Good models developed using the method were applied to screen a large chemical database. Results of the screening probes can be used to select and to postulate structure of leads molecules that can be synthesized in the production of new drugs in pharmaceutical industries. 1.2 Quantitative Structure Activity Relationship (QSAR) Drugs exert their biological effects by participating in a series of events which include transport, binding with the receptor and metabolism to an inactive species. Since the interaction mechanism between the molecule and the putative receptor are unknown in most cases (i.e., no bound crystal structure), one is reduced to making inferences from properties which can easily be obtained (molecular properties and descriptors) to explain these interactions for unknown molecules. 3 The pharmaceutical companies need to continuously discover and develop new drugs, particularly in the field of anti-infective agents, in order to fight the increase of resistance to older drugs and newly discovered types of infections such as mutated bacteria and viral infection. Traditional and novel approaches are used in drug discovery, which can be grouped into three categories [7]: 1. Random screening of a large number of compounds in search of desired biological properties. 2. Structural modifications of lead compounds, through the substitution, addition or elimination of chemical groups. 3. Rational drug design, including different approaches and techniques most of them with important computational component. These approaches are not necessarily incompatible, and most companies try to use new methods to accelerate the discovery of new compounds. QSAR is a new technique based on the reasonable premise that the biological activity of compounds is a consequence of its molecular structure, provided we can identify those aspects of molecular structure that relevant to a particular biological activity. QSAR is a part of chemometrics discipline that represents an attempt to correlate structural or property descriptors of compounds with activities. In other words, it is an indication of the explosion of techniques, procedures, and ideas, all relating in some fashion to attempt to summarise chemical and biological information in a form that allows one to generate and test hypotheses to facilitate an understanding of interactions between molecules. QSAR can also be referred to statistical analysis of potential relationships between chemical structure and biological activity. The goal of structure activity relationship is to analyse and detect the determining factors for the measured activity for a particular system, in order to have an insight on the mechanism and behaviour of the studied system. For such purpose, the strategy is to generate mathematical models that correlate experimental measurements with a set of chemical descriptors determined from the molecular structure for a set of compounds. 4 The formulation of thousands of equations using QSAR methodology attest to a validation of its concepts and its utility in the elucidation of the mechanism of action of drugs at molecular level and more complete understanding of physicochemical phenomena such as hydrophobicity. It is now possible not only to develop models for a system but also to compare models from a biological database and to draw analogies with model from physical organic database. 1.3 History and Development of QSAR More than a century ago, Crum-Brown and Fraser expressed the idea that the physiological action of a substance was a function of its chemical composition and constitution [8]. In 1863, Cros at the university of Strasbourg observed that toxicity of alcohols to mammals increased as the water solubility of alcohol decreased while in 1890’s, Hans Horst Meyer of the university Marburg and Charles Ernerst Overton of the university of Zurich, working independently, noted that the toxicity of organic compounds depended on their lipophilicity [9]. Basing on biological experiments, they correlated partition coefficients with anesthetic potencies. Besides, Overton also determined the effect of functional groups in the increase or decrease of partition coefficients [10]. Afterwards, Lazarev in St. Petersburg continued where Overton and Meyer left off, applying partition coefficients to the development of industrial hygiene standards. Lazarev reported correlations on a log scale, and developed a system for estimating partition coefficients from chemical structure. In 1893, Richet showed that the cytotoxicities of a diverse set of simple organic molecules were inversely related to their corresponding water solubilities and in 1939 the earliest mathematical formulation is attributed to Ferguson, who announced a principle for toxicity [8]. He observed the increase in anesthetic potency when ascending in a homologous series of either n-alkanes or alkanols to a point where a loss of potency, or at least no further increase occurred, using physical properties such as solubility in water, distribution between phases, capillarity and steam pressure. 5 Little additional development of QSAR occurred until the work of Louis Hammet (1937) within the field of organic chemistry, who observed that the addition of substituents to the aromatic ring of benzoic acid had an orderly and quantitative effect on the dissociation constant. He also correlated electronic properties of organic acid and bases with their equilibrium constants and reactivity. From empirical observation, he consequently derived the following linear relationship, the so called Hammet equation: log K = ρσ K0 1.1 where the slope ρ is proportionality reaction constant pertaining to a given equilibrium that relates the effect of substituents on that equilibrium to the effect on the benzoic acid equilibrium. σ is a parameter that describes the electronic properties of aromatic substituents i.e. donating power. Based on Hammett’s relationship, the electronic properties were utilized as the descriptors of structure [9]. Taft devised a way for separating polar, steric and resonance effects and introducing the first steric parameters, Es [11]. Working in the same direction, Swain studied the effects of field and resonance. He investigated the variation of reactivity of a given electrophilic substrate towards a series of nucleophilic reagents [10]. Free and Wilson partitioned the molecule in a different manner as Hammet. They postulated that the biological activity of a molecular set can be related with the addition of substituents, taking into account the number, type and position in the parent skeleton [10]. In 1962 Hansch and Muir published their brilliant study on the structure activity relationship of plant growth regulators and their dependency on Hammett constant and hydrophobicity. The parameter π, which is relative hydrophobicity of substituents, was defined in a manner analogous to the definition of sigma: 6 πx = log p x – log p H 1.2 Px and PH represent the partition coefficients of derivative and the parent molecule, respectively. In 1964 Hansch and Fujita combined these hydrophobic constant with Hammett’s electronic constants to yield the linear Hansch equation. Hansch analysis is powerful technique for use in optimizing the activity of lead compounds. All physicochemical factors that relate to the transport and receptor interaction can be broken down into hydrophobic, electronic and steric component. Correlation between hydrophobic, electronic and steric components to biological activity can be summarized in an equation like below: Log 1 C = aπ + bσ + cEs + d 1.3 where C is molar concentration of compounds, π, σ and Es is hydrophobic, electronic and steric component. a, b, c and d are regression coefficients. The combination of Hansch and Free-Wilson analysis in a mixed approach widens the applicability of both QSAR methods. The linear free energy relationship (LFER) approach was contributed as the first attempt to predict the property of a compound from an analysis of its structure [12]. LFER methods are widely used for the development of quantitative models for energy-based properties such as partition coefficients, binding constants, or reaction rate constant. This is based on the pioneering work of Hammet, who introduced this method for the prediction of chemical reactivity. The basic assumption is that influence of a structural feature on the free energy change of a chemical process is constant for a co generic series of compounds. The basic LFER approach was later extended to the more general concept of fragmentation. Molecules are dissected into substructures and each substructure is seen to contribute a constant increment to the free energy based property. The promise of strict linearity does not hold true in most cases, so correction have to be applied in the majority of methods based on 7 fragmentation approach. Correction terms are often related to long range interaction such as resonance or steric effect. Computer-assisted drug design (CADD), also called computer-assisted molecular design (CAMD), represent more recent application of computers as tools in the drug design process [9]. It is important to emphasize that computers cannot substitute for a clear understanding of the system being studied. A computer should therefore be considered as an additional tool to gain better insight into the chemistry and biology of the problem at hand. This tool has enabled the rapid synthesis of large number of molecules. Massive amount of data can be generated in relatively short period of time. In the middle of the 20th century, two QSAR approaches now considered as classical were developed [7]: 1. Techniques based on the recognition of molecular features (fragments, groups or sites) and calculation, generally by regression analysis of the contribution that these patterns make to activity, assuming additively of the effects. 2. Techniques based on physicochemical parameter as structural descriptors. The rationale of this method is the fact that biological responses of the living organism to drugs are frequently controlled by lipophilicity, electronic and steric properties. 1.3.1 Data Set Data set consists of compounds with molecular structure and biological activity; the compounds were divided between training and test set. Approximately 40 % were selected with a maximum dissimilarity algorithm and assigned to the test set, with the remaining 60 % assigned to training set [13]. The training set was used for QSAR model development and test set was used for model validation. Other techniques that can be used to make a division of a data set into training and test set are based on sphere-exclusion algorithms [14]. The procedure 8 implemented in this method starts with the calculation of the distance matrix D between representative points in the descriptor space. Each probe sphere radius corresponds to one division into training set and prediction set. A sphere-exclusion algorithm consists of the following steps: 1. Select a compound with the highest activity. 2. Include this compound in the training set. 3. Construct a probe sphere around these compounds. 4. Include compounds, corresponding to representative points within this sphere, except for the sphere center, in the test set. 5. Exclude all points within this sphere from the initial set of compounds. The procedure for division of a data set can also be done by sorting the list in increasing value of biological activity. Next, the odd numbered compounds are assigned to training set and even numbered compounds are assigned to prediction set or in the other way even numbered compounds are assigned to training set and odd numbered compounds are assigned in prediction set. 1.3.2 Descriptors QSAR models are constructed by analyzing known or computed property data and series of descriptors representing the system characteristic. An important class of these descriptors belongs to the empirical parameter category derived from physical organic chemistry. These parameters focus on how chemical reaction rates depend on differences in molecular structure. Encoding the molecules numerically allows an indirect link between structure and activity to be established. Descriptors are numerical quantities that characterize properties of molecules [11]; descriptors also can be defined as numerical values that encode certain aspects of molecular structure [12, 15]. For each structure in the data set, more than 200 descriptors can be calculated ranging from atom and bond counts to more detailed combinations of structural information. The relationship between biological activity and descriptors is: 9 Molecule activity = f (molecule structure) = f (descriptor) 1.4 The QSAR methodology begins with calculation of numerical descriptors for a set of compounds. Figure 1.1 shows how the generation of structural descriptors establishes the relationships between molecular structures and properties or biological activities. MOLECULAR STRUCTURES PROPERTIES Representation feature selection STRUCTURAL DESCRIPTORS Figure 1.1: The general QSAR problem Descriptors can be a quantitative property that depends on the structure of molecule. Various physicochemical parameters such as heat of formation, polarizability, hyperpolarizability, vibrational frequencies, etc have been used jointly with connectivity, topological indices and geometrical indices in order to get good model able to predict the anti bacterial activity [16]. The development of molecular structure descriptors is the most important part of any structure activity investigations because the descriptors must contain enough information to permit the correct classification of the compounds under study. Descriptors fall into three main categories: topological, electronic and geometric [17]. The following sections provide information and examples about each descriptor class to convey a clearer understanding of the descriptors routines. 10 1.3.2.1 Topological Descriptors The structures of organic compounds can be represented as graphs. The theorems of graph theory can then be applied to generate graph invariants, which in the context of chemistry are called topological descriptors. The topological description of a molecule contains information on the atom-atom connectivity in the molecule, and encodes the size, shape, branching, heteroatom and the presence of multiple bonds [18, 19]. This graph description of molecules neglects information on bond lengths, bond angles and torsion angles, but is able to encode in numerical form the important atom connectivity information that determine a wide range of physical, chemical and biological properties. Topological indices are widely used as structural descriptors in quantitative structure-property relationships (QSPR) and QSAR models. The Wiener index, W, defined in 1947, is widely used in QSAR and QSPR models as a part of topological descriptors, and it still represent an important source of inspiration for defining new topological indices. The path number W is defined as the sum of the distances between any two carbon atoms in the molecule, in terms of carbon-carbon bonds [17]. Hosoya extended the application of the wiener index by defining it from the distance matrix as the half sum of the diagonal elements of a distance matrix in the hydrogen depleted molecular graph [20]. Randic firstly introduced the concept of molecular connectivity in 1975. It is also called the connectivity index or branching index, to provide a topological index that could characterize the amount of branching in hydrocarbon molecules. This initial concept was extended by Kier and Hall to develop the well known χ indices [18]. They have found the branching index is seen to provide some basic information concerning the overall composition of the molecule. Another example of topological descriptors is the electro topological states. This descriptor is a numerical value computed for each atom in a molecule, which encode information about both topological environments of the atom and the electronic interactions due to all other atoms in the molecule [12]. 11 1.3.2.2 Electronic Descriptors A large variety of electronic whole molecule descriptors have been used to encode the electronic features in QSAR investigations. The electronic environment of each molecule is estimated with the electronic descriptor routines. Electronic descriptors provide information about the overall charge distribution by calculating values such as the partial charges on each atom. A number of electronic descriptors may encode the effects or strengths of intermolecular interactions. The more commonly recognized intermolecular forces arise from the following interactions; ion-ion, ion-dipole, dipole-dipole, etc. There are some examples of this descriptor, such as electric dipole moment, that encodes the strength of polar type interaction. Molecular polarizability and molar refractivity are closely related properties that measure a molecule’s susceptibility to becoming polarized. While descriptor related to intermolecular interactions are useful for predicting bulk physical properties and certain types of biological activities, they provide little direct information about the reactivities of compounds. This information is available through molecular orbital calculation [20]. The HOMO energy is roughly related to the ionization potential of a molecule, while the LUMO energy is related to the electron affinity. The magnitudes of these quantities are measures of the overall susceptibility of the molecule to losing a pair of the electron to an electrophile or accepting a pair of electrons from a nucleophile. 1.3.2.3 Geometric Descriptors Biological activity is often related to the shape and size of the active compounds as well as the degree of complementarity of the compound and a receptor. With the given methods for generating three-dimensional molecular models of compounds, these models can be used to develop geometric descriptors. 12 Geometric descriptors capture information about the overall threedimensional size and shape of molecules. As the name implies, they require that the molecules reside in accurate, three dimensional geometric conformations before descriptor generation. Examples of geometric descriptors: include the calculation of solvent accessible surface area and volumes and moment of inertia. These descriptors are useful in encoding steric effects that can occur between molecules. Geometric descriptors appear frequently in QSAR of biological activity, especially when solvent accessible surface area information is used in conjunction with partial charge information to form the polar surface area descriptors. Surface area has a prominent effect on the interactions which occur between a drug molecule and its surroundings [20]. The other calculated descriptor for biological activity investigation is the molecular volume. The total molecular volume is taken as the sum of the contributions for each atom in the structure. The volume contributions of attached hydrogen atoms are also included in the final volume. 1.4 Feature Selection Each descriptor contains useful information, but not all of these descriptors will be used to develop QSAR models. Feature selection was needed to reduce the number of descriptors. It is a step carried out in many analysis of reducing an initial too-large set of descriptors down to some smaller number that are felt to include the descriptors that matter [21]. The objective of feature selection is to identify the best subset of descriptors and to reduce the descriptors pool to a reasonable number; several stages of statistical testing are performed to remove descriptors that contain redundant information. Two methods to achieved feature selection: i) Objective feature selection uses only the independent variable; the goal is to remove redundancy amongst the descriptors and to deter chance effects during model development. Pair wise correlations coefficient are calculated for all pairs 13 of descriptors, if r2 value is greater than 0.8, one of the two descriptors will be rejected randomly. ii) Subjective feature selection which also uses the dependent variable is used to identify the most information rich descriptor subsets which best map an accurate link between structure and a property of interest. The genetic algorithm (GA) method can also be used to select the optimum number of descriptors for use in regression analysis. The GA could be useful technique for searching large probability space with a large number of descriptors for a small number of molecules. For example, this technique has been successfully applied to select the descriptors which can be used to correlate and predict effect concentration (EC50) values of fluorovinyloxyacetamides compounds [22]. K-nearest–neighbor (KNN) analysis has also been used as variable selection procedure. In principle, this technique seeks to optimize simultaneously the selection of variables from the original pool of all molecular descriptors that are used to calculate similarities between compounds. KNN technique has been applied to select descriptors and establish the QSAR models for predicting the anticonvulsant activity of functionalized amino acid [23]. Searching all combination of descriptors is impractical so a logical approach is taken by combining an optimization routine. It has been shown to be very efficient in screening the reduced pool to identify optimal models [24]. Generalized simulated annealing (GSA) attempts to find models with the best configuration of descriptors that will produce low error for the training set compounds [25]. Once the initial model is evaluated for fitness, a perturbation is made by randomly replacing one (or more) descriptor with another from the reduced pool. If the new model is better than the first, the step is accepted and third model is produced via perturbation of the descriptors in the second model. Multiple linear regression analysis can only handle data sets where the number of descriptors is smaller than the number of molecules, unless again a preselection of descriptors is carried out (e.g. by using GA). Genetic algorithmmultiple linear regression analysis (GA-MLRA) have been combined to make a new 14 classification and regression tool for predicting a compound’s quantitative or categorical biological activity based on a quantitative description of the compound’s molecular structure [26]. 1.4.1 Genetic Algorithm (GA) The GA approach is a general optimization method first developed by Holland [27] involves an iterative mutation/scoring/selection procedure on a constant-size population of individuals. The theory behind GA originates from the ‘survival of the fittest’ principle. Darwinian Theory states that individuals who possess dominant features will prevail in a population and produce children with even more superior features. In GA, models represent chromosomes while the descriptors comprising the model represent the genes encoding each chromosome. Mating and mutation allow GA to efficiently scan an error surface and assess the fitness for thousands of models. The advantages of GA methods are: it searches the descriptor space efficiently and it can find models containing combination of descriptors or features that predict well as group but poorly individually [28, 29]. GA methods were used to select the optimum number of descriptors for use in regression analysis. The general GA scheme is shown in Figure 1.2. 1.5 Tools and Techniques of QSAR QSAR studies include mathematical correlation between molecular structure and its activity. For quantitative modeling, two methods are primarily used to develop QSAR/QSPR models. Model complexity generally increases during the model development stages. Simple methods requiring low computational resources are examined first with the more complex and computationally demanding techniques being employed last in an effort to increase model quality. 15 Initialization (Random) Population Fitness function Mating and mutation Best model Figure 1.2: Flow diagram for the GA The first and most widely used mathematical technique in QSAR analysis is multiple regression analysis (MRA). Regression analysis is a powerful means for establishing a correlation between independent variables which in this case usually include physicochemical parameter and dependent variable such as biological activity [22, 30]. 1.5.1 Multiple Linear Regression Analysis (MLRA) The goal of MLRA is to find the best subset of descriptors which provide accurate predictions for each compound in the training set. For each model, the values for descriptor coefficients and y-intercept are found that provide the most accurate mapping between input descriptors and property of interest. Generally, linear regression is represented by the equation below: 16 c = Rb 1.5 where c is matrix of molecular activity (n sample x 1), R is a matrix of descriptors (n sample x n descriptors) and b is model coefficients (n descriptors x 1). Using the response matrix and the known activity of only one of the compounds c, the regression coefficients (equation 1.5) can be estimated as: b̂ = ( R T R) −1 R T c 1.6 where b̂ is the regression vector, RT is the pseudo-inverse of matrix descriptors, R is a matrix descriptors and c is activity of compounds. ĉ = runk bˆ 1.7 ( runk ) is matrix of the known descriptor, it is possible to use the estimated regression vector (b) to predict the activity of unknown compounds ( ĉ ), by using this equation (equation 1.7). Multiple regressions calculate an equation describing the relationship between a single dependent y variable and several explanatory X variables [31]. The independent variable, which in this case usually include the physicochemical parameter and biological data are assumed as dependent variable. The analysis derives an equation of the form [11]: Y = a1x1 + a2x2 + a3x3 + …..anxn + e 1.8 The multiple correlation coefficient r2 describes how closely the equation fits the data. If the regression equation describes the data perfectly then r2 will be 1.0 [32, 33]. r2 = SSR SST 1.9 17 Where SSR is the explained Sum of Squares of y and SST is the total sum of the difference between the observed y values and their mean. n SST = ∑ ( y − y ) 2 1.10 i =1 SSR is the sum of the difference between the predicted y values ( ŷ ) and mean. n SSR = ∑ ( yˆ − y ) 2 1.11 i =1 The major drawback of regression analysis is the danger of over fitting. This is the risk that an apparently good regression equation will be found, based on a chance numerical relationship between the y variable and one or more the x variable, rather than a genuine predictive relationship. When an over fitted model is used predictively, the predicted values for untested compounds will not be an accurate prediction of true values. 1.5.2 Partial Least Squares (PLS) PLS was developed in the 1960’s by Herman Wold as an econometric technique, but its most avid users are chemical engineers and chemometricians [33]. PLS has been applied to monitoring and controlling industrial processes; a large process can easily have hundreds of controllable variables and dozens of outputs. PLS analysis calculates equations describing the relationship between one or more dependent variables and a group of explanatory variables [34]. PLS include two steps procedure; they are principal component analysis (PCA) and multivariate linear regression (MLR). PLS analysis can be used in exactly the same way as regression, a single y (dependent) variable and two or more x (independent) variables are specified. PLS 18 always include all x variables in the analysis. As with regression, an equation is derived that allow the y values for unknown variables to be predicted from known x values [35]. Therefore, PLS is able to investigate complex structure activity problems, to analyze data in a more realistic way, and to interpret how molecular structure influences biological activity [10]. An important feature of the method is that usually a fewer factors (variables) are required. The precise number of extracted factors is usually chosen by some heuristic techniques based on the amount of residual variation. Another approach is to construct PLS model for a given number of factors on one set of data and then to test it on another, choosing the number extracted factors for which the total prediction error is minimized. Recall the form of linear regression model is c = Rb ( equation 1.5) The difficulty often encountered when solving for b is that the R T R matrix is not invertible because of redundancy in the variables. Principal component regression (PCR) eliminates this redundancy by constructing a new matrix U with column that is linear combinations of the original columns in R. Using the U matrix, a new model is written as shown in equation 1.12: ~ c = Ub 1.12 The technique of PLS is similar to PCR with the crucial difference that the quantities calculated are chosen to explain not only the variation in the independent (X) variable but also the variation in the dependent (Y) variables as well. PCR produces the weight matrix reflecting only the covariance of the predictor variables, while PLS regression includes the response variable Y in the process of reduction of the variables, and so the covariance is between the independent and dependent variables. PCR and PLS use different approaches for choosing the linear combinations of variables for the columns of U. PCR only uses the R matrix to determine the linear combinations of variables but in PLS technique, the covariance of the 19 measurements with the concentrations is used in addition to the variance in R to generate U [36]. The illustration of the difference between PCR and PLS is shown in Figure 1.3. Step 1 R R U U c PCR PLS Step 2 c = Ub PCR and PLS Figure 1.3: Illustration of the difference between PCR and PLS U is the score matrix from PCA, which defines the location of the samples relative to one another in row space. The score matrix is related to the original matrix R (matrix of descriptor) in the following manner: R = USV T 1.13 Where U is the score matrix, V is a matrix containing the loadings and S is a diagonal matrix. The orthonormal property of V (i.e., V T V = I) can be used to solve equation 1.13 for U as follows: U = RVS −1 1.14 The following equation is possible to solve equation 1.12 and can be used to predict the activity of unknown compounds: 20 ~ˆ b = UT c 1.15 where c is matrix of activity, UT is pseudo-inverse of score matrix from PCA and ~ˆ b is regression vector. 1.6 Applications of QSAR The major goal of QSAR in chemical research is to predict the behavior of new molecules, using relationships derived from analysis of the properties of previously tested molecules. QSAR studies represent one of the best methodologies in computer based drug design, offering valuable information about biological activity and providing a computationally inexpensive methodology to design of potential bioactive drugs. MRA was used to generate the QSAR models. These models were constructed by correlating the topological descriptors and anti tumor activity of 20 (S)-campotechin derivatives. Good QSAR models can be used to instruct the designing and predicting the anti tumor activity of new analogues [37]. QSAR approach also can be used to search new agents against Mycobacterium tuberculosis (M. tuberculosis) and other typical mycobacteria. It is significant due to the lack of effectiveness of known anti tuberculosis agents against opportunistic pathogens as a consequence of rapidly emerging resistance [38]. QSAR method was employed by using the hydrophobicity and electrophilicity as parameter to investigate the structural features that affect the toxicity of nitrobenzene derivatives to yeast and response-surface analysis was performed to develop a robust QSAR for predictive use [39]. QSAR model also have been developed between hydrazide potencies against Escherichia coli (E. coli) and Sacharomyices cerevisae. The study shows that an extra thermodynamic relationship can be established between two different cell systems [40]. 21 The inflammatory process is necessary for survival against pathogens and injury, but sometimes the inflammatory response is aggravated and sustained without benefit. A large number of Homoisoflavanones have been isolated several genera within the Hyacinthaceae family and have anti-inflammatory properties. The biological data was then correlated to the physicochemical descriptors of the compounds by applying statistical regression analysis and also to establish a quantitative structure activity relationship model with reliable predictive ability as the potential degree of anti-inflammatory activity of compounds within this class [41]. QSAR studies are being applied in Environment assessment; toxicity to aquatic life form is one of the crucial factors in evaluating the environmental risks of man-made chemicals. Chemicals could jointly cause toxic effects to fish at concentration as low as 2% of their individual inhibition concentration (IC50). The application of QSAR models derived from single chemicals toxicity assay are used to predict concentration of component in mixtures that would jointly cause 50% inhibition of microbial respiration [42]. Chemical and biological transformations, and degradations, play a role in the transport and mobility of such chemicals in the environment. Volatile Organic compounds (VOCs) are a class of organic chemicals largely present in the troposphere because of their vapor pressure. By using QSAR modeling can be used to predict the rate constant for hydroxyl radical trophospheric degradation of 46 heterogeneous organic compounds [43]. A variety of QSAR paradigms have been presented as possible computational tools to aid with the rapid assessment of endocrine disruptions potential for environmentally relevant component [44]. A large number of environmental chemicals known as endocrine-disrupting chemicals (EDCs) are suspected of disrupting endocrine functions by mimicking or antagonizing natural hormones in experimental animals, wildlife, and humans. EDCs may exert adverse effect through a variety of mechanisms, including estrogen receptor (ER)-mediated mechanisms of toxicity. Consequently, more than 58,000 environmental and industrial chemicals have been identified as candidates for possible experimental testing. QSAR could be used as an inexpensive prescreening 22 tool to prioritize the chemicals for further testing and to classify of chemicals according to their ability to bind the estrogen receptor [45]. The new approach of QSAR models is by developing a drug discovery strategy that employs QSAR models for chemical database mining. The approach classified the lead molecules to active and inactive classes also to predict their biological activity [12]. 1.7 Overview of Multidrug Resistance Mycobacterium tuberculosis TB, or tuberculosis, is a disease caused by bacteria called Mycobacterium tuberculosis (M. tuberculosis). It can affect several organs of human body, including the brain, the kidney and the bones; but most commonly it affects the lungs (pulmonary tuberculosis). It is estimated that one-third of the world’s population is infected with this bacteria. While only a small percentage of infected individuals will develop clinical tuberculosis, each year there are approximately eight million new cases and two million deaths. M. tuberculosis is thus responsible for more human mortality than any other single bacterial species [46]. Since tuberculosis spreads easily when people are in close contact with an infected person, it was more common in towns than in the countryside. People often came to towns to trade their goods or do other business. Between sixteenth and nineteenth centuries, many of the new arrivals in the major cities of Europe were consumed by tuberculosis plus other infectious diseases. A city’s population was maintained only by a steady supply of healthy young people coming to make their fortunes. The Industrial Revolution, which began in the late seventeenth century in England and perhaps a hundred years later in the United States, brought more people into the urban areas and city life became more perilous. People could not escape the risk of tuberculosis infection even in their own homes, away from the factories. All these factors created the perfect breeding ground for tuberculosis which became an epidemic in Europe and later in the United States. 23 A number of efficacious anti tuberculosis agents were discovered in the late 1940’s and 1950’s with the last, rifampin introduced in the 1960’s [47]. These agents have reasonable efficacy and when used in combination should preclude the development of drug resistance. However in 1962, Eleanor Roosevelt died of tuberculosis. It was learned that M. tuberculosis was resistant to this agent. The use (or in most cases misuse) of these drugs has lead over the year to an increasing prevalence of multi-drug resistant (MDR) strains and there is now an urgent need to develop new effective agents. 1.7.1 Mycobacterium tuberculosis The genus Mycobacterium consists of gram positive bacilli with distinctive cell wall characterized by the presence of unusual glycolipids. A number of mycobacteria are pathogenic for man but the most important is undoubtedly M. tuberculosis, the causative agents of tuberculosis [47]. M. tuberculosis is a part of tubercle bacilli species, it grow well (eugenic) on egg media containing glycerol or pyruvate. Colonies resemble breadcrumbs and are cream colored. Films show clumping and cord formation especially on moist medium and it is usually resistant to thiophene 2 carboxylic acid hyrazide (TCH) is nitrase positive, aerobic and susceptible to pyrazinamide. M. tuberculosis is an obligate aerobe which grows at different rates within cavities, caseous foci, and macrophages. The doubling time is 12-14 hours as compared to 0.25-1 hour for most other pathogenic bacteria. Since the efficiency of many bacterial agents is directly proportional to the rate of growth, eradication of infection requires prolonged therapy (6-18 months). 24 1.7.2 How Does Tuberculosis Spread The TB germ is carried on droplets in the air, and can enter the body through the airway. A person with active pulmonary tuberculosis can spread the disease by coughing or sneezing. To become infected, a person has to come in close contact with another person having active tuberculosis. The process of catching tuberculosis involves two stages: the first stage of the infection usually last for several months. During this period, the body’s natural defenses (immune system) resist the disease, and most or all of the bacteria are walled in by a fibrous capsule that develops around the area. Before the initial attack is over, a few bacteria may escape into the bloodstream and be carried elsewhere in the body, where they are again walled. In many cases, the disease never develops beyond this stage, and is referred to as TB infection. If the immune systems fails to stop the infection and it is left untreated, the disease progress to the second stage, active disease. There, the germ multiples rapidly and destroys the tissues of the lungs (or the other affected organ). Sometimes, the latent period is many years, and the bacteria become active when the opportunity presents itself, especially when immunity is low. The second stage of the disease is manifested by destruction or consumptions of the tissues of the affected organ. When the lungs is affected, it results is diminished respiratory capacity, associated with other organs are affected, even if treated adequately, it may leave permanent, disabling scar tissue [48]. Usually, the initial diagnostic/screening test for tuberculosis is the skin test. A small amount of fluid is injected under the skin of the forearm; the fluid contains a protein derived from the microorganism causing TB, and is absolutely harmless to the body. The area is visually examined by a health professional after 48-72 hours to determine the result of the test. 25 1.8 Minimum Inhibition Concentration (MIC) QSAR have been used widely to predict the hazard of untested chemicals with already tested chemicals by developing statistical relationship between molecular structure descriptors and biological activity [49]. The principles of determining the affectivity of noxious agents to a bacterium were well enumerated at the turn of the century, the discovery of antibiotic made these tests (or their modification) too cumbersome for the large numbers of test necessary to be put up as a routine analysis. Diffusion test widely used to determine the susceptibility of organisms isolated from clinical specimens have their limitation; when equivocal results are obtained or in prolonged serious infection e.g. bacterial endocarditis, the quantitation of antibiotic action needs to be more precise. The way to a precise assessment is to determine the MIC of the antibiotic to the organism concerned. MIC is the lowest concentration of the antibiotic which will inhibit the growth of microbes. Dilution methods are used to determine the MIC of antimicrobial agents [50]. In dilution test, microorganisms are tested for their ability to produce visible growth on a series of agar plates (agar dilution) or in microplate wells of broth (broth microdilution) containing dilution of the antimicrobial agent. The lowest concentration of an antimicrobial agent which will inhibit the visible growth of a microorganism is known as the MIC MIC methods are widely used in the comparative testing of new agents. In clinical laboratories they are used to establish the susceptibility of organisms that give equivocal result in disk test, for test on organisms where disk test may be unreliable, and when a more accurate result is required for clinical management. 26 1.8.1 Escherichia coli The bacteria E. coli was named after the Austrian doctor, Theodor von Esherich (1857-1911), who first isolated the genus of bacteria belonging to the family enterobacteriaceae, tribe Escherichia. This bacterium is the common inhabitant of the intestinal tract of man and other animal, it is needed to breakdown cellulose and it assists in the absorption of vitamin K, the blood clotting vitamin [51]. E. coli is a motile species, it can produce acid and gas from lactose at 44oC and at lower temperatures, is indole positive at 37oC, MR positive, fails to grow in citrate and is malonate and gluconate negative. It is H2S negative and usually decarboxylates lysine [52]. The structure of E. coli [53] is shown in Figure 1.4: Figure 1.4: Structure of E. coli E. coli is one of the normal bacterial floras of the gastrointestinal tract of poultry. Ten to fifteen percent of the intestinal coliforms in chicken are of pathogenic serotypes. Colibacillosis is a common systemic infection caused by E. coli in poultry. The disease causes considerable economic damage to poultry production world wide. As least four types of E. coli cause gastrointestinal disease in human: enter pathogenic (EPEC), enterotoxigenic (ETEC), enteroinvasive (EIEC) and verotoxigenic (VTEC). 27 The EPEC strains have been associated with outbreaks of infantile diarrhea and identified serologically. ETEC strains are thought to cause gastroenteritis in both adults and children. While EIEC strains cause diarrhea to that in shigellosis. The strains associated with invasive enteric infections are less reactive than typical E. coli and VTEC strains derive their name from their cytotoxicities on Vero cells in tissue culture. They have been associated with haemolytic uraemic syndrome and haemorrhagic colitis. 1.9 Database Mining Pharmaceutical lead compounds traditionally have been discovered by isolation of natural products from fermentation broth and plants extracts, and by screening corporate chemical databases. Recently, this process has been assisted by structure based rational drug design technology. Drug design is one of the most important fields of study for bioinformatics. A major goal of drug design is to discover and optimize novel chemical substances that specifically interact with target molecules and, as a consequence, compensate or reverse disease process [54]. Drug abuse continues to remain one of the most difficult and a costly issue of modern society and cocaine is among the most heavily abused and devastating illicit substances. QSAR models have been developed to correlate structural features of the dopamine transporter (DAT) ligands and their biological activities. It also have been employed to search new lead compounds in the national cancer institute (NCI) database and yielded five compounds that are suitable for development as novel DAT inhibitors [55]. Another application is in the search for anticonvulsant agents to treat epilepsy. Epilepsy is a chronic disorder, characterized by recurrent unprovoked seizures. Currently, the main treatment for epileptic disorder is the long term and consistent administration of anticonvulsant drugs. Therapies have failed to adequately control this disorder, documenting the need for new agents with different mechanisms of action. Development of variable selection KNN QSAR models have 28 been used to mine external chemicals databases or virtual libraries for lead identification. This strategy was successfully applied for the discovery of novel anticonvulsant agents [56, 57]. The national cancer institute (NCI) USA has been carrying out invitro screening of compounds to determine their in vitro inhibitory activity of cell growth in the NCI 60 human cancer cell lines for the purpose of anticancer drug discovery. These Web-based data mining tools allow robust analysis of the correlation between the in vitro anticancer activity of the drugs in the NCI anticancer database, the protein levels and mRNA levels of molecular targets (genes) in the NCI 60 human cancer cell lines for identification of potential lead compounds for specific molecular targets and for study of the molecular mechanism of actions for a drug molecule [58]. 1.10 Research Scope This study focused on developing QSAR models that correlate biological activity (e.g., anti bacterial, anti tuberculosis) and chemical structures. The validated QSAR models were applied to mining chemicals in a large database. Database mining is one of the most important follow-up applications of QSAR model development. The proposed model can be utilized to select compounds that have similar structural attributes as the active compounds in the training set and they are expected to demonstrate anti bacterial and anti tuberculosis activity. The compounds used in data set were limited to those that have been extracted from natural products, while the second data set consists of compounds derived from plant terpenoids. 1.11 Research Objectives The main objectives of this research were: 1. To develop computer models that correlate biological activity of chemical compounds in natural products with their structural characteristics. 29 2. To apply of the QSAR models in screening a large library of compounds (database mining). 1. 12 Significance of Research One potential contribution of this research is in the utilization of our natural products to develop anti bacterial agents. Successful development of new agents will undoubtedly increase the value of our natural resources. Terpenoids are also a class of compounds that have been extracted from natural products; they can be used to combat the growth of M. tuberculosis bacteria. As discussed previously there is an urgent need to develop new effective agents against M. tuberculosis. Billions of dollars are spent each year by the drug and chemical companies of the world in the effort to study the relationships between molecular structure and its bioactivities for generating new drugs. By using QSAR models, we can correlate structural feature of the compounds isolated from plants with their biological activity and the models can be used to predict the activity of new compounds. Knowledge from this study can be used in the production of new drugs in pharmaceutical industries. 1.13 Layout of the Thesis This thesis is organized into five chapters. Chapter 1 describes the background of research, some review of the literature to understand the issues and formulate research problem. The review describes about QSAR, GA, tools and techniques of QSAR, overview of multidrug resistance M. tuberculosis and MIC. It also presents the research scope, research objective, significance of research, and layout of the thesis. Chapter 2 presents the research methodology employed in conducting the study. Two main approaches were adopted; development of QSAR models and application of these models in database mining. QSAR models will be used to 30 predict the biological activity of unknown compounds not included in QSAR model development (prediction set) and it also can be further exploit for the design and discovery of new potent anti bacterial agents. Database mining can be used to search compounds that have similar attributes as the active compounds in the training set. Chapter 3 presents the results from data set which consists of compounds isolated from natural products with biological activity against E. coli. It describes development of QSAR models to predict the anti bacterial activity of unknown compounds, followed by application to search for new compounds in database mining. Results of biological testing of selected compounds are also presented. Chapter 4 presents the results of study in which plant terpenoids against M. tuberculosis was used as data set. It explains about the QSAR model development and it’s validation by predicting the anti tuberculosis activity of compounds not included in the model development process. These models were later used to search new potent anti tuberculosis agents. Finally, the activities of the new lead molecules were tested experimentally by using disk diffusion method. Chapter 5 presents the conclusions of this study. The report culminates with some suggestions for future research. 31 CHAPTER 2 RESEARCH METHODOLOGY 2.1 Introduction One method for gaining insight into the potential biological activity of a molecule is by comparison of its physical properties to a group of similar chemical compounds. QSAR can be used to predict the properties of molecules before they are actually being synthesized and tested for biological activity. QSAR models also can be used to efficiently screen large libraries of compounds to identify those with desired characteristics. The ultimate goal of any QSAR model is to establish a precise structure activity link with a set of training compounds, so accurate predictions can be made on unknown compounds based on structure alone. The QSAR paradigm is based on the assumption that there is an underlying relationship between molecular structure and biological activity, which arises from this systematic variation. Also, it is assumed that the multivariate physicochemical description of the set of compounds reveals these analogies. All physical, chemical, and biological properties of chemical substance can be computed from its molecular structure, encoded in a numerical form with the aid of various descriptors. In this research, data of bioactive compounds from natural products were used as sample to construct QSAR models. The important steps involved in QSAR 32 methodology included structure entry and molecular modeling, descriptor generation, feature selection, model construction and model validation, as illustrated in Figure 2.1. Structure entry and molecular modeling Descriptor generation Model validation Feature selection Model construction Figure 2.1: General QSAR methodology 2.2 Data Set The first stage in the process of QSAR model development includes selection of molecular data set for QSAR studies. In this study, structural and biological activity data were collected from the literature and from previous project carried out by other researchers in the Department of Chemistry, Universiti Teknologi Malaysia. The compounds used in the data set were limited to those that have been extracted from natural products. The first data set consisted of 56 compounds isolated from Piper advuncum leaves, Piper guineense root, Piper pedicellosum, Piper ungaromense, Premna integrifolia leaves, Vitex pubescens bark, Lantana camara leaves, and Macaranga triloba bark with MIC (µg/mL) against E. coli were measured as described in the references [59-61]. The compounds in data set were then divided into training set (28 compounds) for model development and prediction set (28 compounds) for model validation as shown appendix A. 33 The second data set comprising 122 plant terpenoids (isolated from Salvia multicaulis, Borrichia frutescens, Melia volkensii, Inula helenium and Rudbeckia subtomentosa), with moderate to high activity (MIC, µg/mL) against M. tuberculosis were taken from the literature [62]. They were divided into training set (61 compounds) to establish the QSAR model and prediction set (61 compounds) for testing the accuracy of constructed model. The division of data set into training set and prediction set were performed by first sorting the list in increasing value of biological activity. Next the even numbered compounds were assigned to the training set and odd numbered compounds were assigned to the prediction set. This method was chosen to produce more representative samples in the training set. 2.3 Structure Entry and Molecular Modeling Structure entry and molecular modeling is an important stage in QSAR methodology. A number of software packages were used in this stage. First, ChemDraw Ultra 6.0 (Cambridge Soft) was used to draw 2D model molecular structure of the compounds. TSAR 3.3 software package (Acellrys) which consists of Corina was used to convert the molecular structure to 3D structure and COSMIC was used to optimize the structures of compounds. Generation of molecular descriptors, feature selection and models generation were also achieved using TSAR 3.3 software package (Accellrys). These software were run on Microsoft Window XP on a Pentium IV computer. 2.4 Descriptor Generation A common issue in QSAR is how to describe molecules and their properties. The nature of the descriptors used and extent to which they encode the structural 34 features related to the biological activity is a crucial part of QSAR study. Descriptor can be defined as [11, 12]: 1. Numerical quantities generated to represent the molecular structures. 2. Numerical values that encode certain aspects of molecular structure. 3. Physicochemical properties that describe some aspect of the chemical structure TSAR 3.3 software package (Accellrys) can calculate 316 types of descriptors [33], which are summarized in Table 2.1. Table 2.1: Type of descriptors in TSAR Type of Descriptor Molecular attributes Explanation Simple descriptor including mass, surface area, volume, and verloop parameter, measure of inertia, dipole, molar refractivity, lipophilicity and lipole. Molecular indices Molecular indices to describe connectivity, shape topology and electropology. Atom counts Counts the number of each specified atom in selected structures. Ring counts The number of ring size in selected structures. Group counts The number of each specified group in selected structures. ADME screen Adsorption, distribution, metabolism, excretion behaviors of structures based on selected principles. ASP similarities An optional program calculates similarity indices. Vamp electrostatics An optional semi empirical molecular orbital package used to calculate electrostatics properties and perform structure optimizing. 35 It is unrealistic to think that all of the descriptors contain useful information. Therefore, after numerical descriptors have been calculated for each compound, the number of descriptors was reduced to a set of descriptors that are information rich but as small as possible. 2.5 Feature Selection Usually, number of descriptors that can be generated is very large and most probably there are high degree of correlation among them. Therefore, several steps of feature selection were performed on the descriptor pool to reduce it to a more manageable size. To reduce the number of descriptors, several stages of statistical testing were performed to remove descriptors that contain redundant information [17]. The purpose of descriptor selection is to ensure the stability of a model. The feature selection procedure can be broken into two steps. The first step, objective feature selection, involves reduction of the descriptor pool to such a level that the likelihood of finding a chance correlation will be minimal. This reduced pool was then used in subjective feature selection. 2.5.1 Objective Feature Selection Objective feature selection examines the independent variable values (i.e. descriptor values). The goal was to remove redundancy amongst the descriptors that were highly correlated (i.e. contain the same information) and to deter chance effects during model development. Correlation matrix is a table of all possible pair wise correlation coefficients for set of variables. It can be used to help identify highly correlated pairs of variables, and thus identify redundancy in data set. The correlation matrix showed the relationship between variables in columns and rows (consist of descriptors). 36 Pair wise correlations were examined between all pairs of descriptors in the reduced pool. If two descriptors were highly correlated, typically above a correlation coefficient of 0.9, one is randomly removed and the descriptors with the highest desirability will be retained [33]. 2.5.2 Subjective feature Selection Subjective feature selection was used to identify the most information rich descriptor subset which best map an accurate link between structure and property of interest [17]. Searching all combinations of descriptors is impractical so a logical approach is taken by combining an optimization routine with a fitness evaluator to find a good model. GA method was used to select the optimum number of descriptors for use in regression analysis [22]. GA is a simulated method based on ideas from Darwin’s theory of natural selection and evolution. The algorithm consists of the following steps [13, 28]: 1. Chromosome is represented by a binary bit string and initial population of chromosome is created in a random way. 2. A value for the fitness function of each chromosome is evaluated. 3. According to the values of fitness function, the chromosome of the next generation is reproduced by selection, crossover and mutation operation. In GA, QSAR models represent chromosomes while the descriptors comprising the models represent the genes encoding each chromosome. An initial population of models is randomly generated with descriptors from the reduced pool. The first population is evaluated for fitness with the best models being noted in the list top models. An average cost function is computed using all the models in the first population. The models that possess cost functions lower than the average are combined with another set of random models and passed through a mating and mutation stage to form the second population. The second population was evaluated 37 for fitness and if any of the second generation models are better than those in the first, they take their appropriate place in the rank of the top models. Mating and mutation are process to generate new population of children strings; each population of strings serves as parents to the subsequent population of children strings. To illustrate the process of mating and mutation, an example is given in Figure 2.2. We have two subset called parent 1 and parent 2, each containing five descriptors. The algorithm determines a fixed split point to perform a cross-over mating process whereby the first two descriptors of parent 1 and the last three descriptors of parent 2 combined to form child 1. The remaining descriptors from these two subsets are combined to form a second child. Occasionally, a mutation can arise which replaces one of the descriptors in a model with another randomly drawn from the reduced pool. MLRA can only handle data sets where the number of descriptors is smaller than the number of molecules, unless a preselection of descriptors is carried out (e.g. by using genetic algorithm). GA and MLRA have been combined to make a new regression tool for predicting a compound’s quantitative or categorical biological activity based on a quantitative description of the compound’s molecular structure [26]. The combination of PLS and GA is used to develop a regression technique, the hybrid approach that integrates GA as a powerful optimization tool and PLS as robust statistical method are applied to variable selection and modeling [14, 63]. Genetic algorithm partial least square (GAPLS) has been applied to QSPR studies of PCBs, and many good models were generated [28]. 2.6 Model Development After selecting the necessary subsets of descriptors, statistical models were generated. The descriptors generated from compounds in the training set were used to build the model. For quantitative modeling, two methods, multiple linear 38 regression analysis and partial least squares were primarily used to develop the QSAR models. The goal was to find the best subsets of descriptors which will produce stable QSAR model and have an ability to predict properties of unknown compounds. Parent 1 Parent 2 7 15 25 33 46 3 19 23 39 52 MATING Child 1 7 15 23 39 52 Child 2 3 19 25 33 46 27 46 MUTATION Child 2 3 19 25 Figure 2.2: Genetic Algorithm process 2.6.1 Multiple Linear Regression Analysis (MLRA) Multivariate regression procedure estimates the value of a dependent variable (biological activity) from independent variables represented by the different molecular descriptors. The first part of data analysis consists of using the data to determine values of parameter in the models so that the models fit the data well. Stepwise regression was chosen to construct the QSAR models. In stepwise multiple regressions, a selection algorithm is used to choose a subset of the input X 39 variables. The advantages of estimating a model with stepwise MLRA is only a few variables are selected to construct simple QSAR model [36]. The stepwise method combines two approaches, forward and backward stepping. At each step, partial F values were calculated for each variable. In forward stepping, the partial F values of all variables outside the model were calculated. If any variable has a value greater than F to enter, the variable with the highest partial F value is added to the model. The process was continued until no more variables qualify to enter the model. In backward stepping, the partial F values of all variables inside the model were calculated. A variable with the lowest partial F value was removed from the model. The process was continued until no more variables qualify to leave the model. In general, the model can be accepted if it has fewer variables with better predictive power ( r 2 CV ). Cross validation provides a rigorous internal check on the models derived using regression or partial least square analysis. It is used to give an estimate of the true predictive power of the model, i.e. how reliable predicted values for untested compounds. The calculation of cross validation is similar for multiple regression and PLS analysis. TSAR 3.3 software packages (Accelrys) leave out groups of rows in fixed pattern, using three cross validation groups of rows. A third data is deleted and the values for these rows predicted using the rest of the data. This is repeated until for second and then third groups. The model is judged base on these prediction. 2.6.2 Partial Least Squares (PLS) PLS is frequently used as regression method in QSAR model development. There are some additional advantages of PLS such as it is less influenced by noise, more stable, increased the predictability and improved the interpretation of the result as compared to other methods applied to the data set (e.g. PCR, LDA), insensitive to colinearity among the predictor variables and allow one to handle data set where the number of variables is larger than number of observations [35, 64]. 40 PLS is able to investigate complex structure-activity problems, to analyze data in a more realistic way, and to interpret how molecular structure influences biological activity. The goal of PLS is to seek the direction in the space of X , which yields the biggest covariance between X and Y . Cross validation in this technique is similar to the method described in MLRA earlier. By default, TSAR stops the PLS iteration when the statistical significance of the current vector goes above a fixed value (1.0 by default). This is a sensible criterion, but not the only possible one. Other sensible methods for choosing the number of components include: 1. Stopping at the lowest value of Predictive sum of squares (PRESS). 2. Stopping when the PRESS value first starts to increase. Finally, a model has to be tested using an independent data set with compounds yet completely unknown to the model: the prediction set. The complete process of building a prediction model is depicted in Figure 2.3. 2.7 Model Validation The final part of QSAR model development is model validation, when the predictive power of the model and hence its ability to reproduce biological activities (anti bacterial) of untested compounds is established. While this provides some assessment of the goodness of fit of a model, it does not provide a thorough and independent assessment of how a model may predict new compounds. To assess such predictivity the use of prediction set is essential. Validation of a model involves demonstration of predictive ability by predicting the property of interest for compounds not used during the generation of the models, that is, an external prediction set of compounds. For the most part, the prediction set error should be on par or slightly above the error of the training set. The second method of validation involves performing randomization experiments to test for chance correlations. 41 Structures and experimental data Calculation of molecular descriptors Descriptors analysis and optimization Model development Evaluation No Predictive quality Sufficient Yes Test No Predictive quality Sufficient Yes Final model validation Figure 2.3: Flow chart for the general model building process in QSAR studies 42 2.8 Application of QSAR Models to Database Mining Various QSAR models have the common purpose of establishing meaningful correlation between activity and quantitative descriptors of chemical structure, thus the successful development of alternative QSAR model confirms the existence of structure activity relationship intrinsic to a data set. QSAR models must be able to update easily from the information flow produced by the synthesis/biological testing activities and able to virtually screen large structural databases in a short time and sometimes conflicting requirements to be effective in modern drug discovery and development process [65]. Structure activity relationship studies also can be used in principle to identify active/inactive untested compounds and to design new active compounds. Braga and Galvao [66] used the structure activity relationship studies to directly correlate some molecular descriptors with Benzo [c] quinolizin-3-ones (BC3) biological activity and this information can be used to classify active or inactive of non tested compounds. Database mining is an obvious future application of QSAR models by searching compounds that have similar structural attributes as the active compounds in the training set. In this study, the Anti microbial database (AmicBase database) which consists of 3339 compounds [67] was used as a source to search new active compounds against E. coli and M. tuberculosis. The general flowchart of the database mining procedure is summarized in Figure 2.4 and includes the following major steps. 1. Develop predictive QSAR models for a data set of compounds with known structures and activities. 2. Select probe molecules (i.e. biologically active compounds) and calculate their chemical descriptors. 3. Compute the same chemical descriptors for all compounds in the database. 4. Calculate chemical similarity values (we use the Euclidean distance) between all active probes and every structure in the database. 5. Rank all database structures by their similarity to a probe and select M structures within certain similarity cutoff value. 43 6. Predict biological activity values of these M structures based on pre-constructed QSAR models using applicability domain as additional similarity threshold. 7. Select structures predicted by all of QSAR models to have high values of biological activity as hits. A database of N structures (e.g., AmicBase) Experimental QSAR data set (e.g., antibacterial) Calculate chemical descriptors Calculate chemical descriptors Evaluate similarity Select M (of N) structures with the highest similarity to active compounds Predict bioactivity Develop QSAR model Select hits Figure 2.4: Flowchart of database mining that employs predictive QSAR models 44 2.8.1 Molecular Descriptors and Similarity Calculation Molecular descriptors were calculated with the TSAR 3.3 software package (Acellrys) for the probe molecules and the compounds in the database. Next the similarity of active compounds in the training set and database were calculated by using descriptors that were included in the QSAR model [56]. The perception of structural similarity is relative and should always be considered in the context of a particular biological target. In similarity searching systems as well as in QSAR, a molecule is represented by a set of M numerical descriptors, denoted as X 1 , X 2 , X 3 ,…, X M , where X k ' s are the values of individual descriptors [24]. Thus, a molecule can be geometrically represented by a point in M-dimensional descriptor space with coordinates X 1 , X 2 , X 3 …, X M . Many QSAR methods require scaling of the original data to extract significant and useful information and to remove unimportant, not interesting features. Scaling the descriptors is a very delicate procedure because we do not know the underlying relationship between the descriptors and the activity for most of them and therefore cannot foresee the influence of these manipulations [68]. In this research, range scaling was used to avoid giving descriptors with significantly higher ranges a disproportionably higher weight upon the molecular similarity calculations. It was calculated as follow: yi = x i − min (x ) max ( x ) − min ( x ) 2.1 yi is scaled value, xi is original value while min ( x ) is minimum of the collection of x object and max ( x ) is maximum of the collection of x objects. Euclidean distance was used as the measure of similarity in the multidimensional descriptor space between all active probes (i.e., molecules used for QSAR model development) and every structure in the database. The distance 45 d ij between any two compounds i and j in N-dimensional descriptor space was calculated using the following equation: N ∑(X d ij = n =1 in − X jn ) 2 2.2 Where xin and x jn are the values of nth descriptor for compounds i and j, respectively, and the summation is over all descriptors. Compounds with the smallest distance (highest similarity) from the active probes were considered as hits and subjected to the prediction of their similarity based on QSAR model. 2.8.2 Applicability Domain of QSAR Model Formally a QSAR model can predict the target property for any compounds for which chemical descriptors can be calculated. The nearness is measured by an appropriate distance metric (e.g. a molecular similarity measure as applied to the classification of molecular structures). The structural similarity of bioactive compounds in the training set and the database were calculated and a special similarity threshold were introduced to avoid making predictions for compounds that differ substantially from the training set molecules. Threshold ( DT ) was calculated from the training set models as follows: DT = y + Zσ 2.3 where y the average Euclidean distance between each compound, σ is the standard deviation of these Euclidean distances and Z is an arbitrary parameter to control the significance level. The default value of this parameter Z is 0.5 [56, 69]. If the distance of the external compounds from at least one of its nearest neighbors in the training set exceeds this threshold, it is considered impossible to evaluate its activity accurately and this compound was excluded from consideration [23]. 46 2.8.3 Biological Activity Predicted using QSAR Models QSAR technique consists of construction of a mathematical model relating the structure of molecules from a database to a property or a biological activity by means of statistical tool. Once correlation has been established and it can be used to predict the property or biological activity of new molecules. The essential characteristic of any QSAR model is its predictive power, defined as the ability of the QSAR model to predict unknown compounds in the prediction set The biological activities (in this case, anti bacterial and anti tuberculosis activity) of selected compounds from database mining were predicted using QSAR models. The average of activity predicted in each model was used as predicted MIC value. The successes of the development of QSAR models were also measured by doing laboratory testing of new compounds that have been found in the database. 2.9 Laboratory Testing Mining of the database using the best QSAR models yielded some probable active compounds and to confirm the computational results, it is necessary to perform laboratory experiments to test the biological activity of selected compounds against E. coli for first data set and M. tuberculosis for the second data set. Some measures of effectiveness of a chemotherapeutic agent against a pathogen can be obtained from minimal inhibition concentration (MIC). MIC testing can show which agents are most effective against a pathogen and give an estimate of the proper therapeutic dose. Agar diffusion method was used to determine action of E. coli and M. tuberculosis to some testing agents. 47 2.9.1 Material and Method of Agar Diffusion In this study, the activity testing of the compounds was achieved against two types of bacteria. The anti bacterial activity was achieved against E. coli BL21 (Promega) as representative of gram negative bacteria, while the anti tuberculosis activity was achieved against M. tuberculosis as representative of gram positive bacteria. Actually, M. tuberculosis is a pathogenic bacterium and it is highly risky to use this bacterium in the laboratory. Therefore, Rhodococcus sp which is similar to Mycobacterium, since they belong to same class (Actinobacteria) and Order (Corynebacterineae) [70], was used instead. Furthermore it was readily available in the Department of Biology, Universiti Teknologi Malaysia. Some reagents were needed to measure the activity of bacteria, like nutrient agar, as it was used to grow the bacteria. Agar preparation included the following steps [71]: 1. Nutrient agar (Sigma) was prepared from a commercially available dehydrated form. 2. Immediately after autoclaving, it was allowed to cool in a 45 to 50oC water bath. 3. The freshly prepared and cooled medium were poured into glass or plastic, flat bottomed Petri dishes on a level, horizontal surface to give a uniform depth of approximately 4 mm. This corresponds to 60 to 70 mL of medium for plates with diameters of 150 mm and 25 to 30 mL for plates with a diameter of 100 mm. 4. The agar medium was allowed to cool to room temperature and unless the plate was used the same day, it was stored in a refrigerator (2 to 8oC). 5. Plates were used within seven days after preparation unless adequate precautions, such as wrapping in plastic, have been taken to minimize drying of the agar. 6. A representative sample of each bath of plates was examined for sterility by incubating at 25 to 30oC for 24 hours or longer. Anti bacteria agent were accurately weighed and dissolved in the appropriate diluents to yield the required concentration, using sterile glassware. Standard strains of stock cultures were used to evaluate the anti bacterial activity stock solution. If 48 satisfactory, the stock can be a liquated in 5 mL volumes and frozen at -20oC or 60oC. Normal Whatman filter paper was used to prepare disc approximately 0.5 cm in diameter, which was placed in a Petri dish and sterilized in a hot air oven. The loop used for delivering the anti bacterial was made of 20 gauges wire and has a diameter of 2 mm. This delivers 10µL of anti bacterial to each disc. Treatment disc was placed with flame-sterilized forceps onto the inoculated plate with sufficient space between the discs. No more than 8 or 9 (with one in the middle) disc should be placed on a 100 mm diameter plate, in order to accommodate resulting zones of inhibition without significant overlap of adjacent zones. During incubation, the agent diffuses from the filter paper to the agar. The further it gets from the filter paper, the smaller the concentration of the agent. At some distance from the disc, the MIC is reached and a zone of inhibition is thus created. The diameter of the zone is proportional to the amount of antimicrobial agent added to the disc. 49 CHAPTER 3 DEVELOPMENT OF QSAR MODELS AND DATABASE MINING FOR ANTI BACTERIAL AGENTS 3.1 Introduction This chapter describes the results of the development of QSAR models using GAPLS and MLRA, and application of the models to mining chemicals in a database. It begins with selection of descriptors using correlation matrix for objective feature selection and genetic algorithm for subjective feature selection. The statistical results of QSAR models using GAPLS and MLRA technique will be described in the next section. It is followed by validation of QSAR models to predict the biological activity of unknown compounds that were not included in the model development process. The results of mining chemicals in a database are presented in section 3.6. The predicted activity of new active agents using agar diffusion method and discussion related to them are described in the last section. 3.2 Selection of Descriptors and Feature Selection The development of molecular structure descriptors is the most important part of any structure-activity investigation because the descriptors must contain enough information to permit the correct characterization of the compounds. Descriptors are 50 numerical values that characterize properties of molecules. Common descriptors used for QSAR models include topological, geometrical and electronics descriptors. In this study, 316 descriptors were calculated for the compounds in the data set, but not all descriptors were used to develop the model. Having identified a set of suitably distributed, non-correlated descriptors, it is necessary to decide which should be incorporated into the QSAR equation. Feature selection was applied to reduce and select the best descriptors that will be included in the model development process. Selection of descriptors was done by correlating each descriptor with one another using data reduction techniques [20, 33]. Redundant or highly correlated descriptors were removed from the descriptor pool during objective feature selection. Redundancy lessens the discriminating power of descriptors, thereby reducing their worth in model development This results in a correlation matrix for all input variables. A coefficient of 1.0 indicates that two variables are perfectly correlated. A coefficient of 0.0 indicates no correlation. Pair-wise correlations were performed on members of the descriptor pool, removing one of the two descriptors randomly if their correlation coefficient exceeded 0.90. The reduced descriptor pool used to develop the models reported in this work contained 58 descriptors; it is summarized in Table 3.1. The correlation matrix of these descriptors is presented in Table 3.2. This pool of descriptors was held constant throughout the entire model building process. Table 3.1: List of selected descriptors and their statistical analysis Statistics Descriptor Inertia moment 1 size Inertia moment 3 length Verloop B1 (sub.1) Verloop B1 (sub. 3) Verloop B2 (sub. 2) X S.d 133.30 81.83 3.17 0.68 1.63 0.11 1.34 0.28 1.71 0.28 Statistics Descriptor Inertia moment 2 length Ellipsoidal volume Verloop B1 (sub. 2) Verloop B2 (sub.1) Verloop B3 (sub. 1) X S.d 3.74 0.86 1102.32 1432.25 1.62 0.16 1.87 0.41 2.38 0.68 51 Cont. Table 3.1: List of selected descriptors and their statistical analysis Statistics Descriptor Verllop B3 (sub. 2) Verloop B5 (sub. 2) Total dipole moment Dipole moment Y Log P Lipole X component Lipole Z component Kier ChiV3 (ring) Kier ChiV6 (ring) Balaban topological Vamp LUMO Vamp pol. XY Vamp pol. YY Vamp pol. ZZ Vamp quadpole XY Vamp quadpole YY Vamp quadpole ZZ Vamp octupole XXY Vamp octupole YYX Vamp octupole YYZ Vamp octupole ZZY Vamp octupole XYZ ADME H bond acceptors ADME violation X S.d 2.04 0.59 2.37 0.85 2.71 1.57 0.02 1.74 3.41 2.04 0.33 4.34 -0.56 1.88 0.01 0.03 0.04 0.04 1.79 0.39 0.03 0.68 0.94 2.26 39.57 11.33 31.81 13.59 1.30 16.14 -1.82 16.23 -3.15 10.96 -13.19 70.52 9.84 103.40 -12.50 50.36 19.18 67.40 -0.21 67.18 2.71 2.33 0.28 0.46 Statistics Descriptor Verloop B5 (sub. 1) Verloop B5 (sub.2) Dipole moment X Dipole moment Z Total lipole Lipole Y component Kier Chi6 (ring) Kier ChiV5 (ring) KAlpha 3 index Vamp heat of formation Vamp HOMO Vamp pol. XZ Vamp pol. YZ Vamp quadpole XX Vamp quadpole XZ Vamp quadpole YZ Vamp octupole XXX Vamp octupole XXZ Vamp octupole YYY Vamp octupole ZZX Vamp octupole ZZZ ADME weight ADME H bond donors Cosmic total energy X S.d 3.22 1.43 1.57 0.65 -0.42 2.29 -0.31 1.23 4.22 3.50 -1.02 2.60 0.08 0.06 0.02 0.03 4.21 3.41 -98.32 85.45 -9.31 0.46 0.54 2.34 -0.39 2.59 4.97 16.01 -2.64 10.25 -2.97 9.61 -58.83 177.63 -0.88 61.7 -58.39 213.74 -6.28 74.26 3.44 109.72 266.42 81.36 0.96 1.29 -87.71 115.50 54 3.3 Model Development Using MLRA Method After descriptors generation, subset of descriptors were examined to form predictive models using two computational methodologies i.e. MLRA and PLS. In multiple regressions, a selection algorithm is used to choose a subset of the input X variables [32]. It finds a correlation between molecular structures and their corresponding property through a linear combination of structural descriptors, and only the chosen descriptors will be included in the model. This can mean that a variable which appears to be highly significant in the final model will be selected. Final model developed using MLRA method, and some results of statistical tests are presented in Table 3.3. It will be the key to predict the ability of the model to achieve biological activity of the data from prediction set. Multiple regressions calculate an equation describing the relationship between a single dependent y variable and several explanatory x variables. The independent variable in this case is MIC. The best model generated using MLRA for first data set has r 2 value of 0.87 and r 2 (CV) of 0.74. The equation is: Y = 0.021 x Verloop B1 (subst.3) + 0.002 x Lipole Z component + 0.063 x Kier Chi6 (ring) + 0.0007 x Vamp Quadpole YZ + 0.007 x ADME H bond donors + 0.005 3.1 Table 3.3: Statistical output of MLRA model Statistical output Value r2 0.874 Cross validation r 2 (CV) 0.748 Residual sum of square ( RSS ) 2.452 Predictive sum of square ( PRESS ) 4.902 There are five variables which were included in QSAR models by using MLRA technique. The explanation about these descriptors is presented in table 3.4. 55 A plot of experimental vs. predicted MIC is shown in Figure 3.1 while a plot of predicted value vs. standard residual is presented in Figure 3.2. Although this is a very good model in terms of r 2 , the value of cross validated r 2 is a bit smaller that could indicate an unstable model and might not be very useful for predicting purposes. Furthermore, the five term equation will be very dependent on the trend of data in the training set. A brief explanation about the statistical analysis of MLRA method is summarized in Table 3.5. Table 3.4: Descriptors which were included in the QSAR model by using of MLRA Descriptor Verloop parameter Symbol Explanation Verloop B1 The smallest distance from the (substituents 3) axis of the attachment bond, measured perpendicularly to the edge of the substituents. Connectivity indices Kier Chi6 (ring) Numeric descriptor derived from molecular topology that reflects the atom identities, bonding environment and number of bonding Hydrogen. Electrostatic Vamp Quadpole YZ parameter Properties of molecule arising from the interaction between a charge probe such as positive unit point reflecting a proton and target molecule. ADME parameter ADME H bond donor Adsorption, distribution, metabolism, excretion number of H bond donor. 56 Figure 3.1: Plot of experimental vs. predicted MIC for MLRA model Figure 3.2: Plot of predicted value vs. standard residual for MLRA model 57 Table 3.5: Statistical Analysis of MLRA method Component Analyzing S value Standard error of the regression model. For a model with good predictive power, this is an estimate a how accurately the model predict unknown y values. F Value Derived from the sum of squares values and degrees of freedom. r2 The fraction of the total variance of the y variable that is explained by regression equations the closer value is 1.0 the better regression equation explains Y variable. r 2 (CV) Is a key measure of predictive power of the model The closer value is to 1.0 the better predictive power for good model r 2 (CV) should be fairly close to r 2 (it will usually be lower) Residual sum of squares The variance of the residuals not explained by the regression equation Predictive sum of squares A measure of how well the use of the fitted values for subset model can predict the observed responses Yi. 3.4 Model Development Using PLS Method To overcome the limitation of the MLRA model, PLS technique has also been used to develop the QSAR model. PLS is insensitive to co-linearity among the predictor variables and allows one to handle data set where the number of variables is larger than number of observations [72]. PLS analysis calculates equation describing the relationship between one or more dependent variable and a group of explanatory variables. PLS may also be used in exactly the same way as MLRA; a single Y (dependent) variable and two or more X (independent) variables are specified. 58 The PLS method was also aided by GA technique to select the descriptors to be included in the model [15, 28]. The PLS routine in TSAR stops the iteration if a model with one the following criterion is attained [30, 33]: the lowest value of PRESS or when PRESS value starts to increase. Table 3.6 shows the statistical output of GA-PLS for each dimension and plot of PRESS vs. no. of PLS component is shown in Figure 3.3. According to this data and the plot, it was indicated that the best model generated using PLS for this data set has six components with r 2 of 0.96 and r 2 (CV) of 0.86. The high value of r 2 (CV) and the lowest value of PRESS indicate a more stable model and more suitable for predicting compounds not included in the training set. Table 3.6: Statistical output of GA-PLS for each dimension Statistical output r2 PLS dim 1 PLS dim 2 PLS dim 3 PLS dim 4 PLS dim 5 PLS dim 6 PLS dim 7 0.661 0.864 0.927 0.955 0.967 0.968 0.968 r 2 (CV) 0.397 0.605 0.824 0.853 0.858 0.861 0.855 RSS 8.474 3.393 1.819 1.112 0.824 0.781 0.781 PRESS 15.070 9.851 4.417 3.681 3.539 3.471 3.634 Figure 3.3: Plot PRESS vs. No. of component 59 Statistical output of the PLS model is shown in Table 3.7 while list of descriptors which were included in QSAR model by using of PLS technique is shown in Table 3.8, and a plot of experimental vs. predicted MIC is shown in Figure 3.4. Table 3.7: Statistical output of PLS model Statistical output Value Fraction of variance 0.9687 Cross validation r 2 (CV) 0.8611 Residual sum of squares( RSS ) 0.7817 Predictive sum of squares ( PRESS ) 3.4714 Figure 3.4: Plot of experimental vs. predicted MIC for PLS model This plot displayed the activity predicted by a QSAR model against the experimentally measured or observed activity. The data are plotted as a scatter plot, where each point represents one compound of the data set. Ideally the scatter plots showed form a straight line. From the plot (Figure 3.4), it can be concluded that PLS 60 technique has generated a QSAR model with high degree of accuracy, and this was confirmed with the spread out of each point around the ideal line. Table 3.8: Descriptors which were included in the QSAR model by using of PLS Descriptor Verloop parameter Symbol Explanation Verloop B1 The smallest distance from the (substituents 3) axis of the attachment bond, measured perpendicularly to the edge of the substituents. Molecular attributes Lipole Z component Measure of the lippophilic distribution. It is calculated using the substituents point of attachment as an origin with this bond placed along the x-axis. Connectivity indices Kier Chi 6 (ring) Numeric descriptor derived from molecular topology that reflects the atom identities, bonding environment and number of bonding Hydrogen. Connectivity indices Kier ChiV3 (ring) Numeric descriptors indexes are derived from number of skeletal neighbor of each atom and include information about atomic identities. Electrostatic Vamp polarization XY parameter An optional semi empirical molecular orbital that perform structure optimization ADME parameter ADME H bond donor Adsorption, distribution, metabolism, excretion number of H bond donor. 61 The plot of predicted value vs. standard residual is presented in Figure 3.5. The residuals are the difference between predicted and observed activities. According to this plot the residuals were evenly distributed and there was no observation that can be considered as an outlier. Figure 3.5: Plot of predicted value vs. standard residual for PLS model 3.5 Model Validation It is important to evaluate the robustness and the predictive capacity or validity of the model before using the model for interpretation and prediction of the biological activity. The purpose of model validation is to predict the biological activities of non tested compounds. The models were validated by predicting MIC for compounds in the prediction set. The calculated MIC values are shown in Table 3.9 and the correlation coefficients ( r 2 ) between predicted and experimental values for both models were also calculated. 62 The high value of r 2 (0.88) between calculated and experimental values indicated that both models were stable and capable of predicting the anti bacterial activity of compounds not included in the model development process. Table 3.9: Calculated MIC for compounds in the prediction set Calculated MIC Compound No. Experimental MIC 1 PLS MLRA 0.10 0.091 0.089 2 0.07 0.064 0.067 3 0.06 0.061 0.058 4 0.06 0.047 0.055 5 0.05 0.059 0.056 6 0.05 0.049 0.048 7 0.05 0.054 0.055 8 0.05 0.051 0.053 9 0.05 0.044 0.042 10 0.05 0.043 0.046 11 0.05 0.042 0.040 12 0.05 0.057 0.048 13 0.05 0.052 0.055 14 0.05 0.051 0.048 15 0.05 0.049 0.046 16 0.04 0.041 0.045 17 0.04 0.032 0.034 18 0.04 0.047 0.044 19 0.04 0.039 0.036 20 0.04 0.036 0.039 21 0.03 0.027 0.031 22 0.03 0.033 0.034 23 0.03 0.035 0.034 24 0.03 0.025 0.024 25 0.026 0.027 0.023 63 Cont. Table 3.9: Calculated MIC for compounds in the prediction set Calculated MIC Compound No. Experimental MIC 26 3.6 PLS MLRA 0.025 0.035 0.034 27 0.025 0.025 0.027 28 0.020 0.020 0.023 Application of QSAR Models to Database Mining One popular computational approach to rational drug discovery is database mining, which relies on the structure of known active molecules as queries. Applications of QSAR can be extended to any molecular design purpose, including environmental sciences, prediction of different kinds of biological activity by correlation of congeneric series of compounds, lead compound optimization, classification, diagnosis and elucidation of mechanisms of drug action and prediction of novel structural leads in drug discovery. The developed QSAR models were capable of predicting the anti bacterial activity of the excluded 28 compounds in the prediction set with high degree of accuracy. In the next stage, the models were applied to search for biologically active compounds in a large database. Potentially active compounds in the database were selected based on the similarity of these compounds with active compounds in the training set (Table 3.10). Compounds that demonstrated a minimum inhibition concentration (MIC) of 64 µg/mL or lower [62] were selected as the similarity probes for database mining. 64 No Table 3.10: List of probe compounds for database mining MIC (µg/mL) No Structure Structure O OH HO HO OH 0.1 1 OH 2 O OH HO OH HO 0.06 OMe 4 O 0.06 O O O OH OH O O OH OH HO OMe O 0.05 5 6 COOH O O MeO 0.07 O O 3 MIC (µg/mL) O O 0.05 O OH O O C OH 7 0.05 8 HO OH O C O OH O CHO 0.05 9 10 0.05 11 0.05 CHO O 12 O 0.05 HO Me CH2OH Me CH2 Me OMe OH 13 0.05 OCH3 0.05 0.05 14 O 65 Cont. Table 3.10: List of probe compounds for database mining No Structure MIC (µg/mL) No 15 0.05 16 17 0.04 18 Structure MIC (µg/mL) O C HO 0.04 OH O 0.04 O O O 19 21 O CH CHCH HO HO 0.04 20 0.03 22 0.04 O 0.03 OH O O 23 O O O 0.03 0.03 24 O OMe O 25 0.026 26 0.025 28 OMe 0.025 O O 27 O O N H 0.02 66 3.6.1 Application of QSAR Models in AmicBase Database Mining (without scaling) Database mining of large number of compounds have been used as a facility to discover new active anti bacterial agents. Similarity searching was the measure that was used to calculate the inter-molecular structural similarities [74], this concept was used to search new agent in a database. Twenty eight compounds with anti bacterial activity of less than 64 µg/mL were selected as the similarity probes for database mining; their structure and activity are shown in Table 3.10. Degree of similarity, based on Euclidean distance between active compounds in data set (28 compounds) and those in database was calculated using the same set of descriptors used in the QSAR model. Out of 3339 compounds in the AmicBase database [67], it was found that 659 compounds were within the chosen similarity cutoff value (0.5 Euclidean distance unit) of any the 28 probes. These compounds were further subjected to consensus hits criteria (i.e. selected by using descriptors from both models) and resulted in only 16 compounds. Finally, after applying the applicability domain criterion only three compounds were selected and were predicted their anti bacterial activity. complete process and number of selected compounds is shown in Figure 3.6. 3339 Compounds Applicability Domain Euclidian distance ( Dij < 0.5) 16 compounds 659 Compounds Consensus hits 3 Compounds Figure 3.6: Flowchart to select new compounds in AmicBase Database The 67 The final stage of applying QSAR models in database mining is to confirm the ability of QSAR models to predict biological activity of selected compounds from the database. 3 compounds were selected from AmicBase database and their predicted anti bacterial activities are shown in Table 3.11. Table 3.11: Selected Compounds with predicted MIC value No. of compounds in database 1515 MIC Predicted Structure H3C O H3C 29.72µg/mL O eugenol methyl ether 1893 37µg/mL HO H3C O CH3 4-methyl guaiacol 2488 37.75µg/mL H3C OH m-cresol Based on the computerized measurement, the predicted anti bacterial activity value (MIC) of these selected compounds were less than 64 µg/mL. It indicates that QSAR models were able to select active compounds that have the same properties as active compounds in the training set. From their predicted MIC value, it was also shown that these compounds were able to inhibit the growth of E.coli at very low concentrations. Two of these compounds (eugenol methyl ether and m-cresol) were chosen as test compounds to determine their MIC value by using laboratory analysis. 68 3.6.2 Application of QSAR Models in AmicBase Database Mining (With Scaling) Active compounds in the training set were used as probes to calculate the degree of similarity and with compounds in the database. Euclidean distance was employed to measure this similarity using the same set of descriptors that appeared in QSAR model. Scaling was done to reduce the risk of over fitting [68]. Euclidean distance between each probe and every compound in the database [67] were calculated for each descriptor that appeared in the QSAR model. Numbers of compounds in the database within the chosen similarity cutoff value (0.5 Euclidean distance units in multidimensional descriptor space) were 1138 compounds. The initial list was further refined by selecting consensus hits, i.e. those calculated by using descriptors from both models, reducing the number of candidates to 80 compounds. Subsequently the 80 compounds were subjected to the applicability domain criteria i.e. similarity threshold and the number of possible candidates were further narrowed down to 8 compounds. Applicability domain is specific for each QSAR model if the distance of the compounds in database from at least one of its neighbors in training set exceeds this threshold, the prediction is considered unreliable and these compounds will be rejected. In Figure 3.7 is shown the summary of steps to get new lead molecule in AmicBase database. The rigorously validated of QSAR models have been used to predict the anti bacterial activity for new molecules, or screening a large group of molecules with unknown activity. Usually, the prediction model is elaborated using the parameters calculated for a well-determined data of training set on the unknown test set. If the training set is a sufficiently representative pattern of the system, then, it can be assumed that the introduction of new elements with an unknown property will not affect their stability and that confident prediction can be attempted. In Table 3.13 is shown the structure and predicted activity of these 8 compounds. 69 Structure activity relationship studies can give information to obtain the activity of new candidate molecule from database mining [74]. Based on the QSAR calculation, the predicted anti bacterial activity value (MIC) of these selected compounds were less than 64 µg/mL. This indicated that QSAR models were capable of finding active compounds that have the same properties as active compounds in training set. From their predicted MIC value was also shown that these compounds were able to inhibit the growth of E.coli at the minimum concentration. Database 3339 compounds Training set (28 compounds) 3339 Compounds Active compounds (28 compounds) Generate descriptors Range scaling Euclidean distance < 0.5 Initial hits 1138 compounds Appear in both models Consensus hits 80 compounds Applicability domain 8 compounds Figure 3.7: Flowchart to select new compounds in AmicBase Database 70 Table 3.12: Selected compounds with their biological activity predicted MIC Structure Structure predicted No. No. MIC predicted OH O 145 0.047 0.057 185 HO 2-cis-6-cis-Farnesol (5-Isopropenyl-2-methyl-cyclohex-1-enyl)-acetic acid O 283 0.038 O O 444 0.037 O O OH 3,5a,9-Trimethyl-3a,5,5a,9b-tetrahydro3H,4H-naphtho[1,2-b]furan-2,8-dione O 675 Methyl 4-hydroxy-3-(3-methyl-but-2enyl)-benzoate 0.040 O 814 0.051 O O Benzal acetylacetone 2-Benzylideneglutaraldehyde O 1106 0.033 O OH Dec-3-enoic acid OH 0.034 1201 tridecanoic acid 71 3.7 Experimental Validation The QSAR models predicted the minimum inhibition concentration of the chosen molecules in a database. Agar diffusion technique was used to confirm the predicted activity (MIC value) of these compounds. The concentration of test compounds has been modified around the predicted range and it was applied as a control to measure the activity. Ampicilin was used as positive control while distilled water was used as negative control. Table 3.13 shows the results from laboratory analysis for compounds which were selected without scaling. Table 3.13: MIC value of selected compounds (without scaling) using agar diffusion method No Structure MIC (µg/mL) Zone Diameter (mm) >128 0.9 H3C O 1 O H3C eugenol methyl ether 2 38-50 H3 C OH 1.0 m-cresol One of the selected compounds (i.e. eugenol methyl ether) was not active against E. coli BL21 but it might be active to the other strains of E. coli or other gram negative bacteria, depending on strain of bacteria which have been used to measure the MIC value of active compounds in the training set. In general, the QSAR model was able to predict the activity of hit compounds from database mining. The difference of MIC value obtained by using agar diffusion method and MIC predicted by using the QSAR models was not too large and their values were less than 64 µg/mL. This is case for m-cresol which can be classified as active compounds, because they were able to inhibit the growth of E. coli BL21 and 72 attack its cell wall at low concentration. Figure 3.8 are shown the inhibition zone of hit compounds from database mining. a. m-cresol b. eugenol methyl ether Figure 3.8: Inhibition zone of E.coli using: (a) m-cresol and (b) eugenol methyl ether Basically, QSAR models were able to choose compounds with the same similarity like as active compounds in the training set and also can predict the anti bacterial activity of these compounds against E. coli. But to verify it using laboratory analysis, it would be better to use the same strain of E. coli which has been used to determine the MIC value of these compounds in the training set. Laboratory testing was also done for hits compounds selected by using scaling to confirm the biological activity predicted using QSAR models. Table 3.14 shows the biological activity predicted using agar diffusion method. 73 Table 3.14: MIC value of selected compounds (with scaling) using agar diffusion No Structure 1 O MIC (µg/mL) Diameter zone (mm) 47 0.25 0.348 0.80 >62.5 No inhibition 0.049 3.40 >62.5 No inhibition HO (5-isopropenyl-2-methyl-cyclohex-1enyl)-acetic acid or linalyl acetate 2 OH 2-cic-6-cis farnesol 3 O O OH Methyl 4-hydroxy-3-(3-methyl-but-2enyl)-benzoate or ethoxycinnamate O 4 O Benzal acetylacetone O 5 tridecanoic acid OH 74 Five of the eight selected compounds (using scaling) were used for laboratory testing. One of them i.e. tridecanoic acid was not tested because it was not soluble in water. From table 3.15 three compounds (i.e. linalyl acetate, 2-cis-6-cis farnesol and benzal acetylacetone) were able to inhibit the growth of E. coli BL21 at low concentration. QSAR model was able to predict the MIC value of these compounds accurately; and this was confirmed by laboratory testing. The inhibition zone of these selective agents is shown in Figure 3.9. Figure 3.9: Inhibition zone of E.coli using selective compounds 3.8 Effects of Range Scaling and Applicability Domain to Search New Agents Combination of range scaling and applicability domain in QSAR models applied to mining chemicals in a large database was found to be effective in accurately predicting MIC value. Active compounds in the training set with similarity concept were used to search active agents in database. A set of descriptors which were included in the QSAR models consists of wide range of values; furthermore scaling was needed to decrease the effects of large descriptors to others. 75 QSAR models which were developed using MLRA and GAPLS techniques have specific applicability domain. We can think of this domain as an area in which the model is applicable, i.e. prediction of activity will be reliable. Screened compounds with distances larger than the applicability domain were rejected because these compounds were expected to be ‘different’ from the majority of the active compounds in the training set. Mining chemicals in a large database without applying certain criteria like this will results in discovering new agents that fails the laboratory test. 76 CHAPTER 4 DEVELOPMENT OF QSAR MODELS AND DATABASE MINING FOR ANTI TUBERCULOSIS AGENTS 4.1 Introduction This chapter presents the results of development of QSAR models using the anti tuberculosis data set, followed by its application in database mining. The results of the data analysis are presented and described in the next five sections. In section 4.2 results of the chosen descriptors which were used to generate QSAR models are presented. Section 4.3 describes the statistical analysis of QSAR models which were generated using MLRA and GAPLS technique. Validation of both QSAR models for predicting the anti tuberculosis activity of compounds in the prediction set will be presented in the next section. Section 4.5 describes the application of QSAR models to discover new active agents against M. tuberculosis. Finally, this chapter will present the predicted activity of new agents using agar diffusion method. 4.2 Descriptors Generation and Objective Feature Selection Numerical descriptors that encode topological, electronic and geometric features of each molecule were calculated by using descriptors generation routines in TSAR. Initially, 316 descriptors were generated for the compounds in the data set. Objective feature selection was carried out to remove descriptors that contain 77 identical information or that are highly correlated with other descriptors. A descriptor was removed if it had the same value for over 90% of the training set compounds [21]. Furthermore, highly correlated descriptors provide nearly identical information and only one is needed for model development. Pair-wise correlation was examined to remove descriptors that were highly correlated. The objective feature selection reduced the number of descriptors to 56 for the QSAR model development, which are summarized in Table 4.1. The correlation matrix of these descriptors is presented in Table 4.2. Table 4.1: List of selected descriptors and their statistics analysis Descriptors Class Statistical X S.d Molecular volume Inertia moment 2 size Inertia moment 2 length Ellipsoidal volume Verloop L (sub. 2) Verloop B1 (sub. 1) Verloop B1 (sub. 3) Verloop B2 (sub. 2) Total dipole moment Dipole moment Y Log P Lipole X component Lipole Z component Kier chi3 (ring) Kier Chi6 (ring) Balaban topological Vamp LUMO Vamp. Pol. XY Vamp. Pol. YZ Vamp quadpole XY Vamp quadpole YY Vamp quadpole ZZ Vamp octupole XXY Vamp octupole YYX Vamp octupole YYZ Vamp octupole ZZY Vamp octupole XYZ 253.18 727.15 3.83 876.79 3.04 1.61 1.25 1.61 3.53 0.59 3.81 2.40 0.58 0.04 0.07 1.86 0.73 -1.22 0.76 -4.97 1.55 1.43 -11.51 -1.86 3.02 1.78 7.76 85.93 596.04 0.78 735.96 0.76 0.08 0.28 0.43 1.90 1.87 2.41 4.49 2.53 0.11 0.07 0.71 1.07 2.09 2.11 11.92 12.74 11.28 68.14 86.57 50.11 45.03 36.76 ADME H bond donors 1.01 1.05 Descriptors Class Inertia moment 1 size Inertia moment 1 length Inertia moment 3 length Verloop L (sub. 1) Verloop L (sub. 3) Verloop B1 (sub. 2) Verloop B2 (sub. 1) Verloop B3 (sub. 1) Dipole moment X Dipole moment Z Total lipole Lipole Y component Kier ChiV6 (path) Kier Chi5 (ring) Kappa 2 (index) Vamp heat of formation Vamp HOMO Vamp. Pol. XZ Vamp quadpole XX Vamp quadpole XZ Vamp quadpole YZ Vamp octupole XXX Vamp octupole XXZ Vamp octupole YYY Vamp octupole ZZX Vamp octupole ZZZ ADME H bond acceptors Cosmic total energy Statistical X S.d 169.17 19.14 3.20 3.65 2.52 1.51 1.79 2.06 0.67 0.18 5.27 -0.99 3.10 0.06 6.73 -122.60 -9.76 -0.02 -2.98 0.28 -1.79 12.18 1.97 24.58 0.32 10.60 2.75 99.03 23.36 .064 1.09 0.61 0.25 0.36 0.63 2.58 2.30 3.96 3.18 2.38 0.06 3.21 62.82 0.53 2.10 16.52 11.24 10.48 254.73 66.53 145.93 64.91 98.94 1.16 -10.30 95.27 80 4.3 Development of QSAR Model Using MLRA Method The mathematical structure-activity relationships quantify the connection between the structures and the properties of molecules. The relationships are presented in mathematical models that allow the prediction of properties from structural parameters [26]. Regression analysis has been used in QSAR studies to perform on a series of analogues of tuberculosis drugs of isotonic acid hydrazide with multi parameter [75]. The best QSAR model developed using MLRA technique has r 2 of 0.77 and r 2 (CV) of 0.72. The equation is: Y= -0.671 x inertia moment 1 length + 16.389 x Verloop L (subst.2) – 144.683 x verloop B1 (subst.3) – 10.412 x Dipole moment Y component + 8.853 x ADME H bond donors + 207.345 4.1 A summary of the model statistics is provided in Table 4.3. MLRA method requires at least as many molecules as independent variables. However, to produce reliable results, minimizing collinearities and the possibility of chance correlations, typically the ratio of compounds to variable should be at least five to one [76]. When the number of independent variables is greater than the number of molecules, MLRA can not be applied. Brief descriptions about descriptors which were included in the QSAR model are shown in Table 4.4. The development of QSAR models by using of MLRA technique can be accepted, if the models have r 2 (CV) greater than 0.5 and r 2 greater than 0.6 [73]. In this case, these models are still capable accurately for predicting the activities of compounds that are not included in the model development process. A plot of experimental vs. predicted MIC is shown in Figure 4.1, while a plot of standard residual vs. predicted value (residual plot) is presented in Figure 4.2. 81 Table 4.3: Statistical output of MLRA Model Statistical output Value r2 0.772 Cross validation r 2 (CV) 0.729 Residual sum of squares ( RSS ) 2.409 Predictive sum of squares ( PRESS ) 2.856 Table 4.4: Descriptors which were included in the MLRA model Descriptor class Symbol Molecular attributes Inertia moment 1 length Explanation Indicates the strength and orientation behaviors of molecule in an electrostatic field. Verloop parameter Verloop L (subst. 2) The maximum length of the substituents along the axis of the bond between the first atom of the substituents and the parent molecule. Verloop parameter Verloop B1 (subst. 3) The smallest distance from the axis of the attachment bond, measured perpendicularly to the edge of the substituents. Molecular Dipole moment Y The moments are calculated using attributes component the substituents point of attachment as an origin with this bond placed along the x-axis. ADME parameter ADME H bond donor Adsorption, distribution, metabolism and excretion number of H bond donor. 82 Figure 4.1: Plot of experimental vs. Predicted MIC for MLRA Figure 4.2: Plot of predicted value vs. standard. residual for MLRA model 83 4.4 Development of QSAR Model Using PLS Technique PLS is model development technique of particular interest in QSAR because, unlike MLRA, data with strong colinearity, noisy or with numerous X variables can be analysed [72]. Therefore, PLS is able to investigate complex structure-activity problem, to analyze data in more realistic way, and to interpret how molecular structure influences biological activity. PLS also can be used quite effectively as a tool for interpreting QSAR models and that the information extracted is much more detailed than that obtained by simply considering the overall model equation. Genetic Algorithm (GA) technique was used to select the descriptors for the second data set which consisted of compounds with moderate to high activity against M. tuberculosis. In Table 4.5 the statistics for each dimension of GA-PLS are shown. PLS with three dimensions were selected because it has the lowest PRESS value and highest r 2 value. The resulted QSAR model was stable and can be used for predicting compounds that were not included in the training set. Table 4.5: Statistical plot output of GA-PLS for each dimension Statistical output PLS dim1 PLS dim2 PLS dim3 PLS dim4 r2 0.818 0.819 0.819 0.820 r 2 (CV) 0.798 0.799 0.801 0.796 RSS 0.780 0.776 0.777 0.772 PRESS 0.873 0.860 0.857 0.876 Plot of PRESS vs. no. of PLS component is shown in Figure 4.3. It was shown that a good QSAR model was selected with the lowest PRESS value and the highest r2 (CV) value (PLS dim3). The highest r2 (CV) value was indicated that the PLS model have the high predictive power for predicting the activity of compounds not included in the training set. Otherwise in the last component (PLS dim4), PRESS value was increased. anymore. It can mean that PLS model (PLS dim4) was not stable 84 Figure 4.3: Plot PRESS vs. No. of component The combination of GA and PLS produced models with r 2 value of 0.81 and r 2 (CV) value of 0.80 in PLS with three components. The statistical diagnostics of the model is shown in Table 4.6, while a plot of experimental vs. predicted MIC is shown in Figure 4.4 and Figure 4.5 is shown a plot of standard residual vs. predicted value. Prior to the acceptance of a final model, PLS analysis was performed to ensure that the model was not overfit. An overfit model can predict the activities of the training set but may not accurately predict the activity of unknown samples [77]. Table 4.6: Statistic of the PLS model Parameter Value Fraction of variance ( r 2 ) 0.819 Cross validated r 2 (CV) 0.801 Residual sum of square ( RSS ) 0.776 Predictive sum of square ( PRESS ) 0.857 Based on the summary of statistical test (Table 4.4) using PLS technique, the high value of r 2 (CV) indicated a stable model and the ability for predicting 85 compounds that were not included in the training set. Although the PLS model was slightly better, both can be used as predictive models in the database mining. Brief descriptions of parameters which were included in the QSAR model are shown in Table 4.7. Figure 4.4: Plot of experimental vs. predicted MIC for PLS model Figure 4.5: Plot of predicted value vs. standard residual for PLS model 86 Table 4.7: Descriptors which were included in the PLS model Descriptor class Molecular attributes Symbol Inertia moment 1 length Explanation Indicates orientation the strength and behaviors of molecule in an electrostatic field. Verloop parameter Verloop B1 (subst.3) The smallest distance from the axis of the attachment bond, measured perpendicularly to the edge of the substituent. Molecular attributes Dipole Y component The moment are calculated using the substituent point of attachment as an origin with this bond placed along the xaxis 4.5 Model Validation The most used method to determine the stability of a predictive model is by means of the analysis of the influence of each of its elements upon the final model. Any model, even with excellent goodness of fit and satisfactory predictions, may lack a real relationship between structural descriptors and activity. As evidence of the existence of chance correlations, a reliable validation procedure must be carried out. The definitive validity of the model is examined by means of external validation, which evaluates how well the equation generalizes. Both models were validated by predicting the anti tuberculosis activity of 61 compounds excluded during the model development process (prediction set) (Table 4.8). The correlation coefficient ( r 2 ) between predicted and experimental values was also calculated. High value of r 2 (0.93) indicated both models were capable of prediction unknown compounds in the prediction set. 87 Table 4.8: Calculated MIC for compounds in the prediction set Compound No. Calculated MIC Experimental MIC PLS MLRA 1 128 92.3 92.3 2 128 140.3 140.3 3 128 122.2 122.2 4 128 108.7 108.7 5 128 133.8 133.8 6 128 120.2 120.1 7 128 104.1 104.1 8 128 96.1 96.1 9 128 128.7 128.6 10 128 123.5 123.5 11 128 107.0 107.0 12 128 117.2 117.2 13 128 83.6 83.6 14 128 126.5 126.5 15 128 126.9 126.9 16 128 106.4 106.4 17 128 108.5 108.5 18 128 108.7 108.7 19 128 84.1 84.0 20 128 76.1 76.2 21 128 127.8 127.8 22 96.0 98.4 98.4 23 64.0 52.9 52.8 24 64.0 64.0 62.9 25 64.0 64.7 64.7 26 64.0 70.0 69.0 27 64.0 65.8 65.8 28 64.0 67.9 67.9 88 Cont. Table 4.8: Calculated MIC for compounds in the prediction set Compound No. Calculated MIC Experimental MIC PLS MLRA 29 64.0 61.1 61.1 30 64.0 53.6 53.6 31 64.0 59.2 59.2 32 32.0 33.1 33.2 33 32.0 26.3 26.4 34 32.0 35.0 35.0 35 32.0 26.5 26.5 36 32.0 75.7 75.8 37 32.0 54.3 54.3 38 32.0 38.5 38.5 39 32.0 38.5 38.5 40 32.0 38.8 38.8 41 20.0 26.2 26.3 42 16.0 11.2 11.2 43 16.0 12.1 12.2 44 16.0 13.6 13.5 45 16.0 10.7 10.6 46 16.0 18.6 18.6 47 16.0 17.9 17.9 48 15.0 12.4 12.4 49 8.0 7.4 7.4 50 8.0 7.7 7.7 51 8.0 9.5 9.5 52 8.0 7.4 7.3 53 7.3 12.4 12.5 54 5.6 8.1 8.2 55 4.0 20.2 20.2 56 2.0 1.2 1.2 57 2.0 6.5 6.5 89 Cont. Table 4.8: Calculated MIC for compounds in the prediction set Compound No. Calculated MIC Experimental MIC PLS MLRA 58 2.0 1.8 1.2 60 1.0 1.1 1.1 61 0.25 0.2 0.2 Based on predicted value of MIC for each model in table 4.8, the combination of GA and PLS was able to produce better prediction than MLRA, although the difference of predicted values of MLRA and GAPLS was not too large. RSS and PRESS value of MLRA model was higher than GAPLS, indicating that MLRA model has high residual value (difference between actual and predicted value) and was not as good as PLS model to predict the activity of unknown compounds. 4.6 Application of QSAR Models to Database Mining QSAR models can be used in database mining i.e. finding molecular structures that are similar to the probe molecules and or even predicting the activities for the compounds in a database [74]. A QSAR model with high degree of accuracy can be used as a means of screening compounds from existing databases for anti tuberculosis activity. Alternatively, variable selected by QSAR optimization can be used for similarity searches to improve the performance of the database mining methods. In this study, the effect of range scaling (before calculation of the Euclidean distances) to molecular structure and properties of new lead compounds in database mining was also examined. 90 4.6.1 Application of QSAR Models in AmicBase Database Mining (Without Scaling) The applicability of QSAR model to mining chemicals in a database was tested. This stage began with generation of descriptors for all compounds in database using the same set of descriptors that appeared in the QSAR model. Euclidean distances between 32 probe compounds and 3339 compounds in the database were calculated to measure their similarity. A distance of 0.5 units in multidimensional descriptors space was chosen as similarity cutoff value, resulting in as many as 36 compounds. The initial list was further refined by selecting consensus hits, i.e. molecules found in both models, reducing to 18 compounds. The anti tuberculosis activity of these 18 consensus hits were predicted by using the two best QSAR models, each of this model has specific applicability domain criteria. This step produces four compounds 579, 1792, 2399 and 2918 respectively, which are summarized in table 4.9 and list of probe compounds is presented in Table 4.10. Table 4.9: Selected compounds with their predicted anti tuberculosis activity Ambicase entry No. Name of Compounds 579 3-hexanol 1792 2-isobuthyl-4,5- MIC predicted (µg/mL) Structure 48.8 OH 24.2 OH dimethyl-phenol HO 2399 Vanilin O O 2918 Cineole O 17.0 16.4 91 No Table 4.10: List of probe compounds for database mining MIC (µg/mL) No Structure Structure H H 1 N H 62.5 2 H HO MIC (µg/mL) 50.0 O O H 3 H 32.0 O H H 4 O O O O O O O 5 H 32.0 O 32.0 H O 6 H 32.0 H2C H 3C 7 32.0 OH 8 CH2OH 32.0 HO H 9 32.0 CH2OH H O O 10 32.0 O HO Me H OH H 11 O O 16.0 O 14 H O O 16.0 16 COOH HO H H 16.0 O OH 15 16.0 O O O O O 12 O H 13 32.0 16.0 92 Cont. Table 4.10: List of probe compounds for database mining No Structure H H 17 OAc 19 MIC (µg/mL) No 16.0 18 14.4 20 8.0 22 Structure H MIC (µg/mL) H OH OH OH 16.0 8.0 OH AcO HOOC H 21 H O 8.0 OH H O OH HO CH3 OH 23 O 25 O H 24 4.0 26 2.0 28 6.0 OH CH3 C 2H5 H O 8.0 HOH2C H N N H CH2OH 3.8 C2 H5 OH H 27 H H H H OH H 2.0 O O OMe H H 29 OH 2.0 30 O 1.2 O OH OMe O O 31 H N O OAc 0.89 NH2 0.25 32 N 93 Based on Table 4.9, QSAR models was able to search and predict the biological activity of new lead compounds where by all of these selected compounds can be classified as active agents against M. tuberculosis. To validate these results, it was necessary to experimentally measure the biological activity of these agents. Three of these compounds (3-hexanol, vanilin and cineole) were chosen as test compounds against gram positive bacteria (e.g. M. tuberculosis, Rhodococcus sp) [47]. 4.6.2 Application of QSAR Models in AmicBase Database Mining (with scaling) New compounds with high activity against M. tuberculosis can be found by applying the QSAR models to mining chemicals in a database (i.e. AmicBase) [67] which consisted of 3339 chemicals. The similarity search was based on Euclidean distances between active plant terpenoids and those in the database by using the set of descriptors that appeared in the QSAR model. An active plant terpenoid has MIC value less than 64µg/mL [21, 62] and there were 32 plant terpenoids in training set with those activities (Table 4.9). The similarity cutoff value was set to 0.5 units and a total of 545 compounds were short – listed. If the value of each descriptor in the QSAR models are significant by different in magnitude and it will give an effect in Euclidean distance calculation, therefore scaling was needed to avoid the domination of one descriptor to another. Out of 545 compounds selected as initial hits, 12 compounds appear in both models. The anti tuberculosis activity of these 12 consensus hits were predicted by using two best QSAR models, each of this model has specific applicability domain criteria. This step produced 5 compounds. Figure 4.6 summaries the steps to discover new lead compounds against M. tuberculosis. The anti tuberculosis activity of these selected compounds were predicted by using both QSAR models. Structures and predicted MIC of these compounds are 94 shown in Table 4.11. Based on predicted MIC value; it was confirmed five compounds can be classified as active compounds which were able to prevent the growth of M. tuberculosis. Laboratory testing (agar diffusion method) which defined the biological activity of those selected compounds was needed to proof this and also to ensure the applicability of QSAR models. Database 3339 compounds Training set (61 compounds) 3319 Compounds Active compounds (32 compounds) Generate descriptors Range scaling Euclidean distance < 0.5 Initial hits 545 compounds Appear in both models Consensus hits 12 compounds Applicability domain 5 compounds Figure 4.6: Step to select new compounds against M. tuberculosis 95 No. Table 4.11: Selected Compounds with their predicted MIC value MIC Structure Structure predicted No. MIC predicted HO OH 5.7774 1061 1437 HO O 60.602 O Geranyl geraniol 8,9a-Dihydroxy-3,6,9-trimethylenedecahydro-azuleno[4,5-b]furan-2-one OH 8.5275 2181 56.737 2393 HO O O 2-(3,7,11-Trimethyl-dodecyl)-hydroquinone 3,6,9a-Trimethyl-3a,4,5,6,6a,7,9a,9boctahydro-3H-azuleno[4,5-b]furan-2-one HO 3101 HO N 1-[(2-Hydroxy-ethyl)-methyl-amino]-dodecan-2ol 61.707 96 4.6.3 Effects of Applicability Domain to Search New Agents QSAR models based on the mechanism of action approach tend to rely on expert judgment to define the domain. The applicability domain may be defined in terms of general properties and on much more detailed structural basis for specific toxicities. For a prediction to be valid, the compound must fall within the applicability domain of the models [78]. Both QSAR models (GAPLS and MLRA) consisted of a set of descriptors, which were used to measure the similarity. The applicability domain can be used to ensure the prediction of new compounds is reliable. Compounds with the distance larger than applicability domain indicated that the property of these compounds is not similar with active compounds in the QSAR model. Therefore, these compounds must be rejected. 4.7 Experimental Validation The rigorously validated of QSAR models are confirmed if they were able to predict the biological activity of unknown compounds in the prediction set and can be used to search and predict anti tuberculosis activity of new molecules in database mining [76]. M. tuberculosis is one of pathogen bacteria and very dangerous to human, therefore activity testing has been done on Rhodococcus sp which has similar properties with M. tuberculosis. Furthermore, it was much easier to purchase and more readily available in the Department of Biology, Faculty of Science, Universiti Teknologi Malaysia. In Table 4.12 the predicted biological activity of selected compounds (without scaling) by using of agar diffusion methods are shown. Ampicilin was used as positive control and distilled water as negative control [70]. Inhibition zone of active and inactive agents is shown in Figure 4.7. There was no inhibition of Rhodococcus bacteria to these test compounds around active concentration. The MIC values of these molecules were more than 128 µg/mL, indicating that they can 97 not be classified as active agents. The QSAR models predicted them as inactive agents. Figure 4.7: Inhibition zone of active agents and inactive agents Table 4.12: MIC value of selected compounds (without scaling) using agar diffusion method No Structure MIC (µg/mL) Diameter zone (mm) >128 No inhibition >128 No inhibition 128 No inhibition OH 1 3-hexanol HO 2 O O Vanilin 3 O Cineole 98 Agar diffusion method with Ampicilin as positive control and distilled water as negative control was also used to calculate the MIC value of compounds which were selected by using range scaling descriptors. Three of the selected compounds i.e. geranyl geraniol, 8,9a-dihydroxy-3,6,9-trimethylene-decahydro-azuleno[4,5-b] furan-2-one, and 2-(3,7,11-trimethyl-dodecyl)-hyroquinone which were chosen as test compounds. Table 4.13 presented the minimum inhibition concentrations of these selected compounds. 2-(3, 7, 11-Trimethyl-dodecyl)-hydroquinone was not tested because it was not soluble in water but two of these compound i.e. geranyl geraniol and leucomicine were confirmed as active agents against M. tuberculosis with MIC value less than 64µg/mL. It was shown that application of QSAR model with range scaling descriptors prior to Euclidean distance calculated were able to search similar compounds with accurate biological activity prediction. Table 4.13: MIC value of selected compounds (with scaling) using agar diffusion method No Structure OH 1 MIC (µg/mL) Diameter zone (mm) 25 3.20 32 3.90 Geranyl geraniol HO HO O 2 O 8,9a-Dihydroxy-3,6,9-trimethylene-decahydroazuleno[4,5-b]furan-2-one or leucomicine 99 Cont. Table 4.13: MIC value of selected compounds (with scaling) using agar diffusion method No Structure MIC (µg/mL) Diameter zone (mm) >128 No OH 3 HO inhibition 2-(3,7,11-Trimethyl-dodecyl)-hydroquinone or phenantren 100 CHAPTER 5 CONCLUSIONS AND RECOMENDATION 5.1 Introduction This chapter presents the conclusions of this study. The first section provides the conclusions of research finding in an attempt to answer the research objectives. The next section addresses the limitation of the study and the last section presents the potential areas for future research. 5.2 Conclusion The main objective of this study was to develop QSAR models that correlate the biological activity of chemical compounds found in natural products with their chemical structure. The models were used in searching for new active agents against M. tuberculosis and E. coli in a database mining. Quantitative structure activity relationship (QSAR) approach can be used to develop models with high predictive power to predict the activity of compounds that are not included in the training set. Very good models that correlate the structural descriptors with anti bacterial and anti tuberculosis activity have been developed by using genetic algorithm-partial least square (GA-LPS) and multiple linear regression analysis (MLRA). It was noted that better models (in term of predictive ability) were 101 produced by using genetic algorithm (GA) to select the descriptors in the model development process. QSAR models with high degree accuracy were applied to screening and searching for new active agents in a large database by using structural similarity concept, i.e. by using Euclidean distance to measure similarity. Variables that appeared selected in the QSAR models (descriptors) were used to measure the similarity of active compounds in the data set and those in the database. The domination of descriptors with significantly large value were eliminated by using range scaling. Applicability domain was used in the last step of database mining to make sure the prediction of new compounds is reliable. Applicability domain has specific value for each models, furthermore it can be used to reject the compounds which were not similar with the active compounds. The biological activities of the selected compounds were calculated using the QSAR models. The predicted values were later compared with experiment values. By using agar diffusion method, it was confirmed that geranyl geraniol and leucomicine as new agents with high potential to inhibit the growth of Rhodoccocus sp (similar characteristics with M. tuberculosis) at low concentration. In addition, 2cis-6-cis farnesol and pentanedione were confirmed as new anti bacterial agents from database. Based on the results, the concept of QSAR can be used in the production of new drugs in the pharmaceutical industries 5.3 Limitation of the Study There are some limitations and weaknesses have been found during the course of this research. Structure entry and molecular modeling is the first step in the QSAR approach. A long time was required to optimize the energy of molecular structures in the data set and to generate some of the electrostatics descriptors. The same can be said about the feature selection process, especially the objective feature 102 selection. The main problem is how to select the set of descriptors that should be included in the model development process and how to reject the poor descriptors. Due to the large number of descriptors that can be generated, this step requires a lot of judgment from the researcher. Obviously this step cannot be simply automated. 5.4 Future Research Recommendation Future study on QSAR models and database mining could emphasize on development of new methodology to improve model accuracy, since quantitative agreement between actual and predicted biological activity (i.e. anti bacterial, anti tuberculosis) is not excellent for all compounds. In principle, other than MLRA or PLS approach such as KNN or any other rigorous model building techniques could also be adopted for this kind of study. The main concept in the database mining process is similar biological activities. In this study degree of similarity was determined by using Euclidean distance calculated from descriptors that appeared in the QSAR models. Other technique of similarity calculation can be applied for future research, one popular similarity measure is using Tanimoto coefficient [79]. For future studies on the specific application to anti bacterial and anti tuberculosis agents, we propose that we should examine derivatives of the chosen compounds which were identified from the database mining and try to develop QSAR models for these compounds. It is hoped that more active agents can be discovered from this derivatives. 103 REFERENCES 1. Said, I. M. Sebatian Semula Jadi daripada Tumbuhan : Potensi, Prospek dan Kenyataan. Bangi.: Penerbit Universiti Kebangsaan Malaysia. 1995. 2. Kawai, T., Kinoshita, K. and Takahashi, K. Anti-emtic Principles of Magnolia obovota and Zingiber officinale rhizome. Planta Med. 1994. 60: 17-20. 3. Sirat, H. M., Hong, L. F. and Khaw, S. H. Chemical Composition of the Essential Oil of the Fruits of Amomum tetraceum ridl. J. Essent. Oil Res. 2001. 13: 86. 4. Besalu, E., Ponec, R., Vicente, J. Virtual Generation of Agents against Mycobacterium tuberculosis. A QSAR study. Mol. Diversity. 2003. 6: 107-120. 5. Parvu, L. QSAR-a Piece of Drug Design. J. Cell. Mol. Med. 2003. 7(3):333-335. 6. W. J. Dunn lll, Quantitative Structure Activity Relationships in Chemical and Biochemical System. Chemom. Intell. Lab. Syst. 1989. 6: 181-190. 7. Gozalbes, R., Doucet, J. P., Derouin, F. Application of Topological Descriptors in QSAR and Drug Design: History and New trends. Current Drug Targets: Infect. Disord. 2002. 2: 93-102. 8. Selassie, C. D. History of Quantitative Structure Activity Relationship. Burger’s Medicinal Chemistry and Drug Discovery 6th. ed. New York: Wiley Interscience. 2003. 9. Bevan, D. R. QSAR and Drug Design. Netsci Home Page, http://www.netsci.org/science/compchem/fetaure12.html (accessed 5th January 2004). 10. http://www.tdx.cesca.es/tesis.UDG/available/tdx-1210104-133736/tags2de4.pdf (accessed 26th November 2004). 11. Stuper, A. J., Brugger, W. E., Jurs, P. C. Computer Assisted Studies of Chemical Structure and Biological Function. New York: Wiley Interscience. 1979. 12. Gasteiger, J., Engel, T. Chemoinformatic. Weinhein: Wiley-VCH GmbH and Co. KgaA. 2003. 104 13. Kovatcheva, A., Golbraikh, A., Oloff, S., Xiao, Y. D., Zheng, W., Wolschan, P., Buchbauer, G., and Tropsha, A. Combinatorial QSAR of Ambergris Fragnance Compounds. J. Chem. Inf. Comput. Sci. 2004. 44: 582-595. 14. Sutherland, J. J., Weaver, D. F. Development of Quantitative Structure-Activity Relationships and Classification Models for Anticonvulsant Activity of Hydantoin Analogues. J. Chem. Inf. Comput. Sci. 2003. 43: 1028-1036. 15. Mattioni, B. E. The Development of QSAR Dodel for Physical Property and Biological Activity Prediction of Organic Compounds. Ph.D Thesis. Pennsylvania State University; 2003. 16. Mishra, R. K. Getting Discriminant Functions of Antibacterial Activity from Physicochemical and Topological Parameter. J. Chem. Inf. Comput. Sci. 2001. 41: 387-393. 17. Gasteiger, J. Handbook of Chemoinformatics. Vol.3. Weinheim: Wiley VCH verlag GmbH and Co. 2003. 18. Kier, L. B., Hall, L. H. The Meaning of Molecular Connectivity: a Biomolecular Accessibility Model. Croat. Chem. Acta. 2003. 75 (2): 371-382. 19. Liu, S., Cao, C., Li, Z. Approach to Estimation and Prediction for Normal Boiling Point (NBP) of Alkanes Based on a Novel Molecular Distance-Edge (MDE) Vector, λ. J. Chem. Inf. Comput. Sci. 1998. 38: 387-394. 20. Waterbeemd, H. V. D. Chemometric Methods in Molecular Design. Weinheim: Wiley VCH verlag GmbH and Co. 1995. 21. Wessel, M. D. Computer-Asisted Development of Quantitative Structure Property Relationships and Design of Feature Selection Routines. Ph.D thesis. Pennsylvania State University; 1997. 22. Cho, D. H., Lee, S. K., Kim, B. T., No, K. T. Quantitative Structure-Activity Relationship (QSAR) Study of New Fluorovinyloxycetamides. Bull. Korean Chem. Soc. 2001. 22(4): 388-394. 23. Shen, M., LeTiran, A., Xiao, Y., Golbraikh, A., Kohn, H., Tropsha, A. Quantitative Structure Activity Relationship Analysis of Functionalized Amino Acid Anticonvulsant Agents Using k Nearest Neighbor and Simulated Annealing PLS Methods. J. Med. Chem. 2002. 45: 2811-2823. 24. Tropsha, A., Zheng, W. Identification of the Descriptor Pharmacophores Using Variable Selection QSAR: Applications to Database Mining. Curr. Pharm. Des. 2001. 7: 599-612. 105 25. Sutter, J. M., Kalivas, J. H., Jurs, P. C. Automated Descriptors Selection for Quantitative Structure Activity Relationship Using Generalized Simulated Annealing. J. Chem. Inf. Comput. Sci. 1995. 35: 77-84. 26. Svetnik,V., Liaw, A., Tong, C., Culberson, J. C., Sheridon, R. P., Feuston, B. P. Random Forest : A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003. 43: 19471958. 27. Dianati, M., Song, I., and Treiber, M. An Introduction to Genetic Algorithm and Evolution Strategies. http://www.swen.uwaterloo.com/~mdianati/articles.pdf (access on 2th May 2004). 28. Daren, Z. QSPR Studies of PCBs by the Combination of Genetic Algorithm and PLS Analysis. J. Comp. Chem. 2001. 25: 197-204. 29. Leardi, R. Genetic Algorithm in Chemometrics and Chemistry: a Review. J. Chemom. 2001. 15: 559-56. 30. Hawkins, D. M., Basak, S. C., and Shi, X. QSAR with Few Compounds and Many Features. J. Chem. Inf. Comput. Sci. 2001. 41: 663-670. 31. Srivastava, M. S. Methods of Multivariate Statistics. New York: John Wiley & Sons. Inc. 2002. 32. Kutner, M. M., Nachtsheim, C. J., Neter. J. Applied Linear Regression Models. New York: MC. GrawHill. 2004. 33. Oxford Molecular. TSAR 3.3 for Windows Reference Guide. UK: Oxford Molecular, Ltd. 2000. 34. Accelrys Home Page, http://www.accelrys.com/tool/QSAR (accessed 6th January 2004). 35. Tobias, R. D. An Introduction to Partial Least Squares Regression. http://support.sas.com/techsup/technote/ts509.pdf (accessed 14th June 2004). 36. Beebe, K. R., Pell, R. J., Seasholtz, M. B. Chemometrics, a Practical Guide. New York: Wiley Interscience. 1998. 37. Li, M. J., Jiang, C., Li, M. Z., You, T. P. QSAR Studies of 20(S)-Campotechin Analogues as Antitumor Agents. J. Mol. Struct: THEOCHEM, 2005. 723: 165170. 38. Ragno, R., Marshall, G. R., Santo, R. D., Costu, R., Massa, S., Rompei, R., Artico, M. Antimycobacterial Pyroles: Synthesis, Anti Mycobacterium 106 tuberculosis Activity and QSAR Studies. J. Bioorg. Med. Chem. 2000. 8:14231432. 39. Montanari, M. L. C., Beezer, A. E., Montanari, C. A and Verloso, D. P. QSAR Based on Biological Microcalorimetry. J. Med. Chem. 2000. 43: 3448-3452. 40. Wang, X., Yin, C., Wang, L. Structure Activity Relationship and Response Surface Analysis of Nitro Aromatics Toxicity to the Yeast (Sacharomyces cerevicae). Chemosphere. 2002. 46: 1045-1051. 41. Xu, S., Nirmalakhandan, N. Use of QSAR Models in Predicting Joint Effect in Multi-Component Mixtures of Organic Chemicals. J. Wat. Res. 1998. 32: 23912399. 42. Gramatica, P., Pilutti, P., and Papa, E. Validated QSAR Prediction of OH Trophosperic Degradation of VOCs, Splitting Into Training-Test Set and Consensus Modeling. J. Chem. Inf. Comput. Sci. 2004. 44: 1794-1802. 43. Waller, C. H. A Comparative QSAR Study Using CoMFA, HQSAR and FRED/SKEYS Paradigms for Estrogen Receptor Binding Affinities of Structurally Diverse Compounds. J. Chem. Inf. Comput. Sci. 2004. 44: 758-765. 44. Du Toit, K., Elgorashi, E. E., Malan, S. F., Drewes, S. E., Van Staden, J., Crouch, N. R., Mulholland, D. A. Anti-Inflammatory Activity and QSAR Studies of Compounds Isolated from Hyacinthaceae Species and Tachiadenus longiflorus grisb. (gentianaceae). J. Bioorg. Med. Chem. 2005. 13: 2561-2568. 45. Tong, W., Xie, Q., Hong, H., Shi, L., Fang, H., Perkins, R. Assessment of Prediction Confidence and Domain Extrapolation of Two Structure Activity Relationship Models for Predicting Estrogen Receptor Binding Activity. Environ. Health Perspectives. August 2004. 112(2): 1249-1254. 46. Cruez, A. F. TB Returns as Number One Infectious Killer Disease. Prime News, Tuesday, April 5, 2005. 47. Henderson, B., Wilson, M., McNab, R., Lax, A. J. Celular Microbiology. New York: John Wiley & Sons. 1999. 48. http://www.mckinley.edu/health -info/dis-cond/tb/TB.html (accessed 7th October 2004). 49. Wang, X., Dong. Y., Wang. L. and Han. S. Acute Toxicity of Substituted Phenols to Rana japonica Tadpoles and Mechanism-Based Quantitative Structure Activity Relationship (QSAR) study. Chemosphere. 2001. 44: 447-455. 107 50. European Committee for Antimicrobial Susceptibility Testing (EUCAST) of the European Society of Clinical Microbiology and Infectious Diseases (ESCMID). Determination of Minimum Inhibition Concentrations (MICs) of Antibacterial Agents by Agar Dilution. J. Clin. Microb. Infect. 2000. 6: 509-515. 51. Collins, C. H., Lyne, P. M., and Grange, J. M. Microbial Methods. London: Butterworth-Heinemann court, Jordan Hill. 1989. 52. Tabatabaei, R. R., Nasirian, A. Isolation, Identification and Antimicrobial Resistance Patterns of E. coli Isolated from Chicken Flocks. J. Pharmacol. Exp. Ther. 2003. 2: 39-42. 53. http://cwx.prenhall.com/horton/medialib/media_portfolio/text (accessed 29th July 2005). 54. Schneider, G. Neural Networks are Useful Tools for Drug Design. Neural Network. 2000. 13: 15-16. 55. Hoffman, B. T., Kopajtic, T., Katz, J. L., Newman, H. 2D QSAR Modeling and Preliminary Database Searching for Dopamine Transporter Inhibitors Using Genetic Algorithm Variable Selection of Molconn Z Descriptors. J. Med. Chem. 2000. 43: 4151-4159. 56. Shen, M., Beguin, C., Golbraikh, A., Stables, J. P., Kohn, H., Tropsha, A. Application of QSAR Models to Database Mining; Identification and Experimental Validation of Novel Anticonvulsant Compounds. J. Med. Chem. 2004. 47: 2356-2364. 57. Shen, M. Implementation and Application of Machine Learning Algorithm in Computer-Assisted Drug Design. Ph.D Thesis. University of North California; 2003. 58. Fang, X., Shao, L., Zhang, H., Wang, S. Web- Based Tools for Mining the NCI Databases for Anticancer Drug Discovery. J. Chem. Inf. Comput. Sci. 2004. 44: 249-257. 59. Cheng, L.L. Kandungan Kimia dan Bioaktiviti daripada Spesies Premna, Vitex, Lantana dan Macaranga. M.Sc. Tesis. Universiti Teknologi Malaysia; 2002. 60. Ramalu, J.C.D. Kajian Sebatian Semula Jadi daripada Empat Spesies Piper. M.Sc. Tesis. Universiti Teknologi Malaysia; 1999. 61. Jamil, S. Komponen Semula Jadi Bagi Spesies Curcuma dan Boesen bergia (Zingiberaceae). M.Sc. Tesis. Universiti Teknologi Malaysia; 1997. 108 62. Cantrell, C. L., Franzblau, S.G., and Fischer. N.H. Antimycobacterial Plant Terpenoids. Planta Med. 2001. 67: 685-694. 63. Senese, C.L., Hopfinger, A.J. A Simple Clustering Technique to Improve QSAR Model Selection and Predictivity Application to a Receptor Independent 4DQSAR Analysis of Cyclic Urea Derived Inhibitors of HIV-1 Protease. J. Chem. Inf. Comput. Sci. 2003. 43: 2180-2193. 64. Leardi, R. and Gonzalez, A.L. Genetic Algorithms Applied to Feature Selection in PLS Regression: How and When to Use Them. Chemom. Intell. Lab. Syst. 1998. 41: 195-20. 65. Bourin, N., Mozziconacci, J.C., Arnoult, E., chavatte,P., Marot, C., Allory, L.M. 2D QSAR Consensus Prediction for High-Throughput Virtual Screening an Application COX-2 Inhibition Modeling and Screening of the NCI Database. J. Chem. Inf. Comput. Sci. 2004. 44: 276-285. 66. Braga, S.F., and Galvao, D.S. Benzo [C] Quinolizin-3-ones Theoretical Investigation: SAR Analysis and Application to Non Tested Compounds. J. Chem. Inf. Comput. Sci. 2004. 44: 1987-1999. 67. Review Science Amicbase: Database on Antimicrobials. http://www.reviewscience.com/Compounds.htm (accessed 8th October 2004) 68. Mazzatorta, P., and Benfenati, E. The Importance of Scaling in Data Mining for Toxicity Prediction. J. Chem. Inf. Comput. Sci. 2002. 42(5): 1250-1255. 69. Zheng, W., and Tropsha, A. Novel Variable Selection Quantitative StructureProperty Relationship Approach Based on the k-Nearest Neighbors Principle. J. Chem. Inf. Comput. Sci. 2000. 40: 185-194. 70. Madigan, M. T., Martinko, J. M., Parker, J. Brock Biology of Microorganism 9th Ed. Upper Saddle River, N. J.: Prentice Hall, 2000. 71. Lalitha, M.K. Manual on Antimicrobial Susceptibility Testing. Department of Microbiology Christian Medical College, Velore, Tamil. Nadu. http://www.arches.uga.edu/~lace52/procedure.html (accessed October 2004). 72. Tang, K., Li, T. Comparison of Different Partial Least Squares Methods in QSAR. Anal. Chem. Acta. 2003. 476: 75-92. 73. Golbraikh, A., Tropsha, A. Predictive QSAR Modeling Diversity Sampling of Experimental Datasets for the Training and Test Set Selection. J. Comput-Aided Mol Des. 2002. 5: 231-243. 109 74. Gillet, V. J., Wild, D. J., Willet, P., Bradshaw, J. Simmilarity and Dissimilarity Methods for Processing Chemical Structure Databases. the Computer Journal. 1998. 8:547-558. 75. Bagachi, M.C., Maiti, B.C., Bose, S. QSAR of Anti Tuberculosis Drugs of INH Type Using Graphical Invariants. J. Mol. Struct: THEOCHEM. 2004. 679:179186. 76. Stanton, D.T. On the Physical Interpretation of QSAR Models. J. Chem. Inf. Comput. Sci. 2003. 43: 1423-1433. 77. Rogers, D., Hopfinger, A. J. Application of Genetic Function Approximation to Quantitative Structure Activity Relationship and Quantitative Structure Property Relationship. J. Chem. Inf. Comput. Sci. 1994. 34: 854-866. 78. Cronin, M. Oppurtunities for Computer Aided Prediction of Toxicity in Drug Discovery. A report. Computational Chemistry, School of Pharmacy and Chemistry. Liverpool: John Moores University. 2002. 79. Martin, Y. C., Kofron, J. L., Traphagen, L. M. Do Structurally Similar Molecules Have Similar Biological Activity? J. Med. Chem. 2002. 45: 4350-4358. 110 No Structure Appendix A: List of Compounds in the First Data Set MIC No Structure (µg/mL) MIC (µg/mL) OH OMe OH 0.1 1 HO 2 OH OH O O 3 OH 5 COOH 0.1 4 0.06 6 0.1 O O HO OH HO OH HO 0.07 OMe 0.06 HO OH O O O 0.06 7 8 O OH 0.06 O O O OH O HO O 9 Me OH OH OMe O CH2 0.06 10 MeO O O 0.05 O Me Me COOH 11 0.05 COOH 12 0.05 O HO O C O O C C COOH H 13 HO 0.05 14 C O OH 0.05 111 No MIC (µg/mL) Structure No Structure OH OCH3O 15 H3CO 0.05 O 16 HO MIC (µg/mL) O OH O 0.05 OH OH 17 HOOC 19 O 0.05 18 0.05 20 OH 0.05 CHO 0.05 CHO 21 C O O H 0.05 22 CH2OH O O 0.05 O HO 23 Me Me CH2 0.05 Me 24 Me Me CH2 0.05 Me OH OCH3 0.05 25 0.05 26 O OMe HO 27 29 HO 0.05 28 0.05 30 O 0.05 0.05 112 No Structure MIC (µg/mL) No 0.04 32 0.04 34 Structure MIC (µg/mL) O O 31 O O H H O C HO OH 0.04 O O 33 HO O 0.04 OH O O 35 0.04 N N H 0.04 36 O N O OH 37 0.04 O O 38 0.04 O O HO 0.04 39 O 0.04 40 O 0.03 41 O 42 HO O CH CHCH 0.03 113 No Structure O O 43 O MIC (µg/mL) No 0.03 44 MIC (µg/mL) Structure O 0.03 OH O O O 0.03 45 O 46 O 0.03 O HO O O 47 O O 0.03 48 0.03 50 0.03 O O 49 O 0.026 O OMe O 51 0.026 52 0.025 54 OMe 0.025 O O HOOC 53 0.025 O O 55 0.025 56 O N H 0.02 114 No Appendix B: List of Compounds in the Second Data Set MIC No Structure (µg/mL) Structure OH OH 128 1 H 128 2 H O O O O H 3 OH 128 O O 4 128 H OH O OH OH OH 5 H 128 6 128 H OH OH OH OH H H HO 7 H 128 H O H 8 H 128 OH O O O HAcO O O 9 MIC (µg/mL) O O O 128 10 128 12 128 H O O OH CH2OH 11 H 128 H HO H H H O 13 128 O 128 14 O AcO OH OH 128 15 128 16 OH OAc H H H2C H2C Me 17 AcO 128 18 Me O 128 115 No MIC (µg/mL) Structure No O O 19 128 Me 20 Me 128 AcO HO HO 21 MIC (µg/mL) Structure HO 128 COOH HO 22 HO COOH HO 128 HO Me CH2OH H O H 23 OH 128 128 24 OH 128 25 O 26 O O O O 128 27 128 28 O O O O O O O 29 128 O O O H 128 30 128 O O H O O O 31 O O H 128 O H OAng 32 OH O O O H H 33 OH 128 OAc 34 O O 128 OH O 128 O O O O H H OH 35 O O 128 36 O 128 O O O O HO CH3 116 No MIC (µg/mL) Structure No H HO HO 37 128 O H 38 O O O O O 39 128 H O 40 O 128 O HO 41 128 O O O MIC (µg/mL) Structure COOH HO 128 N 42 100 NH2 N HO Me HO H H 96 43 44 64 H OH H CH2OH O H3C H OH 45 64 OH 46 64 47 64 48 H O O O O H OH 64 49 H 64 50 H O O O O OH O H 64 51 H 64 52 H O O O O O H 53 64 H O HO 64 O H 64 54 O O O O O HO 55 O O H H 64 56 COOH HO 64 117 No MIC (µg/mL) Structure No MIC (µg/mL) Structure H2 C 57 64 CH2OH 64 58 Me HO HO H2 C HO 64 59 64 60 COOH COOH H HO HO Me H H 61 N H 62.5 50 62 H HO O O H 63 32 O 64 32 O H O O O O H H 32 65 H 66 32 H O O O O O O O 67 32 70 32 72 O H H O 71 O 32 O H H 3C OH 32 H H2 C 73 32 H O O HO O 68 O H 69 32 H2C COOH 32 74 CH2OH HO HO 32 118 No MIC (µg/mL) Structure No MIC (µg/mL) Structure H H 75 32 COOH H 32 76 CH2OH H HO Me HO Me HO HO 77 32 COOH H 78 32 COOH HO HO AcO Me Me H H H NH2 79 32 80 20 82 OH H 32 OH 81 H 16 O O O O H 83 O 16 84 16 86 O O 16 H O O O O 85 O OH O OH 16 87 OH H O OCH3 OH 16 H O 16 COOH 16 O 88 O 89 16 O H 90 HO O H 16 91 COOH O 92 H 16 H OAc 119 No Structure H H 93 OCH3 MIC (µg/mL) No Structure 16 94 H MIC (µg/mL) OH OH 15 95 16 H OH 14.4 96 H H H COOH AcO HOOC O 97 O 8 98 8 100 8 OH O H 99 H 8 O H O H O AcO H 101 8 O 102 8 OH H HO OH H H 103 O O 105 OH 8 104 7.3 106 5.6 108 OH HO CH3 O 6 OH CH3 OH 8 OH OMe 107 H O 109 H HO COOH OAc H HO 4 110 O O H OH C2 H5 HOH2C N H 4 H N CH2OH C 2 H5 3.8 120 No Structure H N 111 H N C HO HO O O H NH OH NH C H N N H H MIC (µg/mL) No 2 112 OHO H N O H2C HO 2 H H H H3 C H C O MIC (µg/mL) Structure H O O CH3 OH OH O O 113 MeO H H 115 2 114 2 116 2 118 H H OH H H OH 2 2 OH O OMe O 117 HO 1.2 O O OH OMe 119 1.2 H 1.2 O 120 O O H OAc HO 0.25 HO O 121 OH O OH OH O O O OH H N O NH N O O 122 N N N NH2 0.25 121 Appendix C: Presentation and Publication Parts of this work have been presented at the following symposia: 1. Mohamed Noor Hasan, Neni Frimayanti. “Development of QSAR Models for Predicting Anti Bacterial Activity of Compounds in Natural Products” Proceedings of the 17th Malaysian National Symposium on Analytical Chemistry, Pahang, Malaysia, 24-26 August 2004, pp 340-342. 2. Mohamed Noor Hasan, Neni Frimayanti. “Development of QSAR Models for Predicting Anti Tuberculosis Activity of Plant Terpenoids” Proceeding of the Symposium on Science and Mathematics, Johor, Malaysia, 14-15 December 2004, pp 14-15. The following article has been based on parts of this thesis: 1. Mohamed Noor Hasan, Neni Frimayanti. “Development of QSAR Models for Predicting Anti Bacterial Activity of Compounds in Natural Products” Malaysian Journal of Analytical Sciences, In Press.