vii TABLE OF CONTENTS CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENTS vii LIST OF TABLES xi LIST OF FIGURES xiii LIST OF SYMBOLS xv LIST OF ACRONYMS xvii LIST OF APPENDICES xix INTRODUCTION 1.1 Introduction 1 1.2 Quantitative Structure Activity Relationship (QSAR) 2 1.3 History and Development of QSAR 4 1.3.1 Data Set 7 1.3.2 Descriptors 8 1.4 1.3.2.1 Topological Descriptors 10 1.3.2.2 Electronic Descriptors 11 1.3.2.3 Geometric Descriptors 11 Feature Selection 12 1.4.1 Genetic Algorithm (GA) 14 viii 1.5 Tools and Techniques of QSAR 14 1.5.1 Multiple Linear Regression Analysis 15 1.5.2 Partial Least Squares 17 1.6 Applications of QSAR 20 1.7 Overview of Multidrug Resistance Mycobacterium 22 tuberculosis 1.7.1 Mycobacterium tuberculosis 23 1.7.2 How Does Tuberculosis Spread 24 Minimum Inhibition Concentration (MIC) 25 1.8.1 Escherichia coli 26 1.9 Database Mining 27 1.10 Research Scope 28 1.11 Research Objectives 28 1.12 Significance of Research 29 1.13 Layout of the Thesis 29 1.8 2 RESEARCH METHODOLOGY 2.1 Introduction 31 2.2 Data Set 32 2.3 Structure Entry and Molecular Modeling 33 2.4 Descriptor Generation 33 2.5 Feature Selection 35 2.5.1 Objective Feature Selection 35 2.5.2 Subjective Feature Selection 36 Model Development 37 2.6.1 Multiple Linear Regression Analysis 38 2.6.2 Partial Least Squares 39 2.7 Model Validation 40 2.8 Application of QSAR Models to Database Mining 42 2.8.1 Molecular Descriptors and Similarity 44 2.6 Calculation 2.8.2 Applicability Domain of QSAR Models 45 2.8.3 Biological Activity Predicted Using QSAR 46 ix Models 2.9 3 Laboratory Testing 46 2.9.1 Material and Method of Agar Diffusion 47 DEVELOPMENT OF QSAR MODELS AND DATABASE MINING FOR ANTI BACTERIAL AGENTS 3.1 Introduction 49 3.2 Selection of Descriptors and Feature Selection 49 3.3 Model Development Using MLRA Method 54 3.4 Model Development Using PLS Method 57 3.5 Model Validation 61 3.6 Application of QSAR Models to Database Mining 63 3.6.1 Application of QSAR Models in AmicBase Database Mining (without Scaling) 66 3.6.2 Application of QSAR Models in AmicBase 68 Database Mining (with Scaling) 3.7 Experimental Validation 3.8 Effects of Range Scaling and Applicability Domain to Search New Agents 4 71 74 DEVELOPMENT OF QSAR MODELS AND DATABASE MINING FOR ANTI TUBERCULOSIS AGENTS 4.1 Introduction 76 4.2 Descriptors generation and Objective Feature 76 Selection 4.3 Development of QSAR Models by using MLRA 80 Method 4.4 Development of QSAR Models by using PLS 83 Technique 4.5 Model Validation 86 4.6 Application of QSAR Models to Database Mining 89 4.6.1 Application of QSAR Models in AmicBase Database Mining (without Scaling) 4.6.2 Application of QSAR Models in AmicBase 90 x Database mining (with Scaling) 93 4.6.3 Effects of Applicability Domain to Search New Agents 4.7 5 Experimental Validation 96 96 CONCLUSIONS AND RECOMENDATION 5.1 Introduction 100 5.2 Conclusion 100 5.3 Limitation of the study 101 5.4 Future Research Recommendation 102 REFERENCES 103 APPENDIX A 110 APPENDIX B 114 APPENDIX C 121 xi LIST OF TABLES TABLE NO. TITLE PAGE 2.1 Type of descriptors in TSAR 34 3.1 List of selected descriptors and their statistical analysis 50 3.2 Correlation matrix of descriptors 52 3.3 Statistical output of MLRA model 54 3.4 Descriptors which were included in the QSAR model by using of MLRA 55 3.5 Statistical analysis of MLRA method 57 3.6 Statistical output of GA-PLS for each dimension 58 3.7 Statistical output of PLS model 59 3.8 Descriptors which were included in the QSAR model by using of PLS 60 3.9 Calculated MIC for compounds in the prediction set 62 3.10 List of probe compounds for database mining 64 3.11 Selected compounds with predicted MIC value 67 3.12 Selected compound with their biological activity predicted 70 3.13 MIC value of selected compounds (without scaling) using 71 agar diffusion method 3.14 MIC value of selected compounds (with scaling) using agar 73 diffusion 4.1 List of selected descriptors and their statistics analysis 77 4.2 Correlation matrix of descriptors 78 4.3 Statistical output of MLRA model 81 4.4 Descriptors which were included in the MLRA model 81 xii 4.5 Statistical plot output of GA-PLS for each dimension 83 4.6 Statistic of the PLS model 84 4.7 Descriptors which were included in the PLS model 86 4.8 Calculated MIC for compounds in the prediction set 87 4.9 Selected compounds with their predicted anti tuberculosis activity 90 4.10 List of probe compounds for database mining 91 4.11 Selected compounds with their predicted MIC value 95 4.12 MIC value of selected compounds (without scaling) using 4.13 agar diffusion method 97 MIC value of selected compounds (with scaling) using agar 98 diffusion method xiii LIST OF FIGURES TABLE NO. TITLE PAGE 1.1 The general QSAR problem 1.2 Flow diagram for the genetic algorithm (GA) 15 1.3 Illustration of the difference between PCR and PLS 19 1.4 Structure of E. coli 26 2.1 General QSAR methodology 32 2.2 Genetic algorithm process 38 2.3 Flowchart for the general model building process in QSAR studies 2.4 9 41 Flowchart of database mining that employs predictive QSAR models 43 3.1 Plot of experimental vs. predicted MIC for MLRA model 56 3.2 Plot of predicted value vs. standard residual for MLRA 56 model 3.3 Plot PRESS vs. No of component 58 3.4 Plot of experimental vs. predicted MIC for PLS model 59 3.5 Plot of predicted value vs. standard residual for PLS model 61 3.6 Flowchart to select new compounds in AmbicBase database 3.7 Flowchart to select new compounds in AmbicBAse database 3.8 3.9 66 69 Inhibition zone of E. coli using (a) m-cresol and (b) eugenol methyl ether 72 Inhibiton zone of E. coli using selective compounds 74 xiv 4.1 Plot of experimental value vs. predicted MIC for MLRA 82 4.2 Plot of predicted value vs. standard residual for MLRA 82 model 4.3 Plot PRESS vs. No. of component 84 4.4 Plot of experimental vs. predicted MIC for PLS model 85 4.5 Plot of predicted value vs. standard residual for PLS model 85 4.6 Step to select new compounds against M. tuberculosis 94 4.7 Inhibition zone of active and inactive agents 97 xv LIST OF SYMBOLS a, b, c, d ~ b̂ , b - Regression coefficient - Regression vector ~ˆ b - ~ The estimate of b ĉ - Activity of unknown compounds DT - Applicability domain Es - Steric component ρ - Proportionality reaction constant σ - Electronic properties of aromatic compounds, standard deviation of Euclidean distance π - Hydrophobicity of substituents px - Partition coefficients of derivative molecule pH - Partition coefficients of parent molecule r2 - How closely equation fits the data r2 (CV) - Predictive power of the model runk - Matrix of the known descriptor χ - Molecular connectivity indices X - Mean value y - Activity observed value y - Mean value, average Euclidean distance ŷ - Predicted value C - Concentration of molecule D - Distance matrix F - Degrees of freedom R - Matrix of descriptor xvi RT - Pseudo-inverse of matrix descriptor S - A diagonal matrix, standard error of the regression model s.d - Standard deviation U - Score matrix from PCA V - Matrix containing the loading W - Wiener index Z - An arbitrary parameter to control the significance level xvii LIST OF ACRONYMS BC3 - Benzo [c] quinolizin-3-ones CADD - Computer assisted drug design CAMD - Computer assisted molecular design DAT - Dopamine transporter EC50 - Effect concentration ED - Euclidean distance EDCs - Endocrine disrupting chemicals EIEC - Enteroinvasive EPEC - Enter pathogenic ETEC - Enterotoxigenic GA - Genetic algorithm GA-MLRA - Genetic algorithm-multiple linear regression analysis GAPLS - Genetic algorithm partial least squares GSA - Genetic simulated annealing HOMO - Highest occupied molecular orbital IC50 - Inhibition concentration KNN - K-nearest neighbor LDA - Linear discriminant analysis LFER - Linear free energy relationship LUMO - Lowest unoccupied molecular orbital MDR - Multi drug resistant MIC - Minimum inhibition concentration MLRA - Multiple linear regression analysis MLR - Multivariate linear regression MRA - Multiple regression analysis xviii NCI - National cancer institute PCA - Principal component analysis PCR - Principal component regression PLS - Partial least squares PRESS - Predictive sum of squares QSAR - Quantitative structure activity relationship QSPR - Quantitative structure property relationship RSS - Residual sum of squares TCH - Thiophene 2 carboxylic acid hyrazide SSR - Sum of squares SST - Total sum of squares VTEC - Verotoxigenic VOCs - Volatile organic compounds