IN SILICO TARGET PREDICTION BY TRAINING NAÏVE BAYESIAN MODELS ON CHEMOGENOMICS DATABASES Nidhi Submitted to the faculty of the Chemical Informatics Graduate Program in partial fulfillment of the requirements for the degree Master of Science in the School of Informatics, Indiana University December 2005 Accepted by the Faculty of Indiana University, in partial fulfillment of the requirements for the degree of Master of Science ________________________ Mahesh Merchant, Ph.D., Chair _______________________ Jeremy Jenkins, Ph.D. _______________________ Douglas Perry, Ph.D. ii ACKNOWLEDGEMENTS I extend my gratitude and appreciation to the people who made this master’s thesis possible. My foremost thanks go to my research advisor, Dr. Jeremy Jenkins. I want to thank him for his insights and suggestions that helped me have a deeper understanding of the subject matter. I am grateful to Dr. Mahesh Merchant for the support and encouragement that I received from him throughout this project. I am indebted to Dr. Douglas Perry for his invaluable guidance and support right from the very beginning of my work. I would also like to acknowledge my manager Dr. John Davies and colleague Dr. Meir Glick at Novartis Institutes for Biomedical Research for their constant feedback on my work which made me strive to do better. Many thanks go to the faculty and staff at School of Informatics. These include Dr. Kelsey Forsythe, Dr. David Wild, Ms. Mary O’Neil, Ms.Vickie Bucker and Mr. Dale Ray. Special thanks are due to MDL for awarding me with the Elsevier MDL Excellence in Informatics fellowship. Last but not the least, I thank my parents and my brother for always being there for me and supporting me in all my endeavors. iii TABLE OF CONTENTS ACKNOWLEDGEMENTS ............................................................................................... iii LIST OF TABLES .............................................................................................................. v LIST OF FIGURES ........................................................................................................... vi INTRODUCTION .............................................................................................................. 1 METHODS ......................................................................................................................... 9 RESULTS ......................................................................................................................... 18 DISCUSSION ................................................................................................................... 32 CONCLUSION ................................................................................................................. 35 REFERENCES ................................................................................................................. 36 APPENDIX ....................................................................................................................... 39 iv LIST OF TABLES Table 1. Datasets used for training and testing the naïve Bayesian models Table 2. MDDR activity classes used for study Table 3. Murcko assemblies for COX-2 inhibitors derived from WOMBAT database Table 4. Biggest target predicted by naïve Bayes model for MDDR generic activity classes v LIST OF FIGURES Figure 1a. Chemogenomics approach to drug discovery Figure 1b. Organization of a Chemogenomics Database Figure 2a. Substructure search vs Extended connectivity fingerprint structure search Figure 2b. Extended Connectivity Fingerprint generation method Figure 3. Pipeline Pilot protocol for building multiclass naïve Bayes model Figure 4. Murcko assembly distribution of MDDR and WOMBAT. Figure 5a-b. Percentage of test compounds with order of prediction for WOMBAT85% model Figure 6a. Percentage of test compounds derived from ten MDDR activity classes with order of prediction for WOMBAT model Figure 7a. Top target distribution for MDDR activity class Antineoplastic Figure 7b. Top target distribution for MDDR Kinases Figure 7c. Top target distribution for MDDR activity class Antiinflammatroy, Intestinal Figure 7d. Top target distribution for MDDR activity class Antihypertensive Figure 7e. Top target distribution for MDDR activity class Antiarithritic Figure 8a. MDDR Extreg: 142057 (N-(2-aminophenyl)-4-(3-hydroxypropanamido) benzamide). A substructure in many compounds with anti-cancer activity and predicted as a HDAC inhibitor. Figure 8b. MDDR Extreg 142055 (N-(2-aminophenyl)-4-isobutyramidobenzmide). Analog of 142057 and predicted as a HDAC inhibitor Figure 8c. N-aryl benzamide class that is patented as a HDAC inhibitor class Figure 9a. Trichostatin D. MDDR Extreg: 286017. Predicted as a HDAC inhibitor vi Figure 9b. Trichostatin A. Known HDAC inhibitor Figure 10. Closest similarity histogram for MDDR and WOMBAT COX-1 inhibitors Figure 11. Tanimoto similarity coefficient vs. the order of prediction for MDDR Kinases vii INTRODUCTION The completion of Human Genome Project is seen as a gateway to the discovery of novel drug targets (Jacoby, Schuffenhauer, & Floersheim, 2003). How much of this information is actually translated into knowledge, e.g., the discovery of novel drug targets, is yet to be seen. The traditional route of drug discovery has been from target to compound. Conventional research techniques are focused around studying animal and cellular models which is followed by the development of a chemical concept. Modern approaches that have evolved as a result of progress in molecular biology and genomics start out with molecular targets which usually originate from the discovery of a new gene .Subsequent target validation to establish suitability as a drug target is followed by high throughput screening assays in order to identify new active chemical entities (Hofbauer, 1997). In contrast, chemogenomics takes the opposite approach to drug discovery (Jacoby, Schuffenhauer, & Floersheim, 2003). It puts to the forefront chemical entities as probes to study their effects on biological targets and then links these effects to the genetic pathways of these targets (Figure 1a). The goal of chemogenomics is to rapidly identify new drug molecules and drug targets by establishing chemical and biological connections. Just as classical genetic experiments are classified into forward and reverse, experimental chemogenomics methods can be distinguished as forward and reverse depending on the direction of investigative process i.e. from phenotype to target or from target to phenotype respectively (Jacoby, Schuffenhauer, & Floersheim, 2003). The identification and characterization of protein targets are critical bottlenecks in forward 1 Homologous Proteins Gene Families Genomics/ Proteomics Drug Chemotypes Chemoproteomics Figure 1a. Homologous proteins are characterized by studying gene families and traditional drug discovery route is from protein targets to drug molecules. Chemogenomics explores the relationship between structure (“chemical genotype”) and biological activity (“chemical phenotype”) and establish genetic connection based on chemical phenotype. Protein1 Compound1 Compound3 S i m i l a r i t y Cheminformatics Bioinformatics Protein2 Compound2 Figure 1b.Cheminformatics techniques are based on structural similarity between molecules while bioinformatics deals with sequence similarity to establish homologous proteins. Chemogenomics can be used to derive the missing links between compounds and targets as well as genetic connections between two targets. 2 chemogenomics experiments. Currently, methods such as affinity matrix purification (Taunton, Hassig, & Schreiber, 1996) and phage display (Sche, McKenzie, White, & Austin, 1999) are used to determine targets for compounds. None of the current techniques used for target identification after the initial screening are efficient. In silico methods can provide complementary and efficient ways to predict targets by using chemogenomics databases to obtain information about chemical structures and target activities of compounds. Annotated chemogenomics databases integrate chemical and biological domains and can provide a powerful tool to predict and validate new targets for compounds with unknown effects (Figure 1b). A chemogenomics database contains both chemical properties and biological activities associated with a compound. The MDL Drug Data Report (MDDR) (Molecular Design Ltd., San Leandro, California) is one of the well known and widely used databases that contains chemical structures and corresponding biological activities of drug like compounds. The relevance and quality of information that can be derived from these databases depends on their annotation schemes as well as the methods that are used for mining this data. In recent years chemists and biologist have used such databases to carry out similarity searches and lookup biological activities for compounds that are similar to the probe molecules for a given assay. With the emergence of new chemogenomics databases that follow a wellstructured and consistent annotation scheme, new automated target prediction methods are possible that can give insights to the biological world based on structural similarity between compounds. The usefulness of such databases lies not only in predicting targets, but also in establishing the genetic connections of the targets discovered, as a consequence of the prediction. 3 The ability to perform automated target prediction relies heavily on a synergy of very recent technologies, which includes: i) Highly structured and consistently annotated chemogenomics databases. Many such databases have surfaced very recently; WOMBAT (Sunset Molecular Discovery LLC, Santa Fe, New Mexico), KinaseChemBioBase (Jubilant Biosys Ltd., Bangalore, India) and StARLITe (Inpharmatica Ltd., London, UK), to name a few. ii) Chemical descriptors (Xue & Bajorath, 2000) that capture the structure-activity relationship of the molecules as well as computational techniques (Kitchen, Stahura, & Bajorath, 2004) that are specifically tailored to extract information from these descriptors. iii) Data pipelining environments that are fast, integrate multiple computational steps, and support large datasets. A combination of all these technologies may be employed to bridge the gap between chemical and biological domains which remains a challenge in the pharmaceutical industry. 4 BACKGROUND The current trend in drug discovery is to focus on gene families, which makes it possible to work with multiple targets simultaneously, because of their genetic interrelationship. Chemogenomics seeks to apply this relationship by starting with chemical compounds to find novel targets. In silico methods can be used to further advance this field. A number of different in silico approaches are possible, depending on the objective of a particular study. The most widely used method to make a selection is to carry out similarity searches on a structural compound database with reference molecules (Hert et al., 2004). Cutoff values derived from similarity metrics like the Tanimoto coefficient (Willett, 2005) can be used to filter uninteresting compounds. Three-dimensional pharmacophorebased searches can also be used when 3-D structural information and active ligands are available (Toledo, Lydon, & Elbaum, 1999). Classification methods are useful to generate focused libraries as well as identify novel compounds from large compound collections. Genetic algorithms (GA) (Gillet, Willett, & Bradshaw, 1998), neural networks (NN) (Manallack et al., 2002), support vector machines (SVM) (Weinstein et al., 1997), recursive partitioning (RP) (P. Blower, Fligner, Verducci, & Bjoraker, 2002; Rusinko, Farmen, Lambert, Brown, & Young, 1999), naïve Bayes classifier (NB) (Klon, Glick, & Davies, 2004; Xia, Maliski, Gallant, & Rogers, 2004), and standard and partial least square methods (Zhu, Hou, Chen, & Xu, 2001) are some of the classification techniques that have gained popularity among researchers to model structure-activity relationships of compounds. Clustering methods are also widely used to group compounds with similar bioactivity based on their structural information (Shemetulskis, Dunbar, Dunbar, 5 Moreland, & Humblet, 1995). A number of hierarchical methods such as Wards and group-average and non hierarchical clustering methods such as Jarvis Patrick were compared in a study conducted by Brown and Martin (Brown & Martin, 1998). The Wards clustering method is reported to perform best in separating active from inactive compounds in this study. K-means is another nonhierarchical clustering method that has been reported to perform well in a study to identify ligand binding sites on proteins (Glick, Robinson, Grant, & Richards, 2002). The performance of a particular similarity search, clustering or classification method in terms of predicting activity for a given compound depends on the dataset as well the choice of descriptors. Numerous descriptors have been developed to describe molecules: 1-D descriptors such as molecular weight, Simplified Molecular Input Line Entry Specification (SMILES), and melting point, 2-D descriptors such as connectivity tables and 2-D fingerprints, and 3-D descriptors such as pharmacophore definition triplets (PDTs) (Abrahamian et al., 2003) have been used to represent chemical structures. 2-D fingerprinting is by far the most commonly used structure abstraction technique. A recent study done by Hert et al. to compare a number of 2-D fingerprints such as BCI, Daylight, Extended Connectivity fingerprints (ECFPs), Unity and Avalon shows the effectiveness of ECFPs, which are based on circular substructures, over other topological descriptors (Hert et al., 2004). Also, Klon et al. have reported considerable enrichment in high throughput docking data by using ECFPs in combination with Naïve Bayesian classifier (Klon, Glick, & Davies, 2004). Modeling biological activities based on chemical structure abstraction for a wide range of compounds has been attempted recently in the form of a computer program, 6 PASS (Lagunin, Stepanchikova, Filimonov, & Poroikov, 2000) that has also been integrated with the National Cancer Institute’s open database browser (Poroikov et al., 2003). PASS makes use of substructure descriptors called multilevel neighborhoods of atoms (MNA) (Fomenko, Sobolev, Filimonov, & Poroikov, 2003). After an initial training with known biologically active compounds, it can be used to predict a wide range of biological activities such as mutagenicity, carcinogenicity, and embryotoxicity for test compounds. The usefulness of annotated chemical libraries, overlaying the ligand-target domains, has been recently recognized. Studies analyzing ligand-target space (Weinstein et al., 1997) and finding correlation between substructures and gene expression patterns (P. E. Blower et al., 2002) have been reported. Kohonen self-organizing maps were recently employed in an attempt to classify 20,000 compounds tested against 80 of NCI’s tumor cell lines (Rabow, Shoemaker, Sausville, & Covell, 2002) with very useful results. In another effort to integrate bio- and cheminformatics domains, annotation schemes for ligands of four target classes (enzymes, G protein-coupled receptors, nuclear receptors and ligand-gated ion channels) have been proposed (Schuffenhauer et al., 2002). The amalgamation of a robust fingerprinting method and an appropriate computational model trained on a well-annotated chemogenomics database can be an extremely useful tool to discover novel targets for compounds with unknown activities. Further, the structure-activity relationship homology principle (Frye, 1999) can be applied to discover leads for homologous targets. 7 Hypothesis It is possible to predict targets for compounds with unknown activities by training multiple naïve Bayes classifiers on compounds with known biological targets. The intent of this research is to train multiple Laplacian-modified naïve Bayes models with extended connectivity fingerprints on compound records derived from the WOMBAT database in order to predict targets for test compounds derived from the MDDR database. 8 METHODS Extended Connectivity Fingerprints Extended connectivity fingerprints, developed by SciTegic (SciTegic, San Diego, California), are structure fingerprints based on the Morgan algorithm (Morgan, 1965). An ECFP feature represents an exact structure with limited and specified attachment points. Therefore, a substructure search using the fragment para-substituted benzene (a) would return compounds (b, c) while a search with ECFPs would return only (c) (Figure 2a). This is because for ECFPs, (b) does not conatin (a) as there is substitution on the benzene ring at locations other than the specified attachment atoms marked as X. a b c Figure 2a Substructure search vs Extended connectivity fingerprint structure search ECFPs are generated in an iterative fashion. Initially, each atom is assigned a code that is derived from the number of connections to the atom, element type, charge and mass of the atom which corresponds to ECFP with a neighborhood size of “0”. In the first iteration, information about each atom’s immediate neighbors is collected and a new code representing the atom and its immediate neighbors is generated. In each iteration, the neighborhood size becomes larger and the updated codes of the atoms from previous iterations are used for assigning new codes. When the desired neighborhood size is reached, the set of all features is returned as a fingerprint (Figure 2b). 9 Figure 2b Extended connectivity fingerprint generation method ECFPs with a neighborhood size of 6 (i.e., ECFP_6) were used as structural descriptors to train the naïve Bayes classifiers as they represent a fairly large set of features or structural units that are crucial for structural comparison of compounds (Klon, Glick, Thoma, Acklin, & Davies, 2004). Multiclass Naïve Bayesian Modeling Naïve Bayes Naïve Bayes, a statistical classification method, can be efficiently used for large datasets as it scales linearly with the number of molecules. It is based on Bayes rule of conditional probability (Weisstein) which states that given two events A and B, the probability of occurrence of event A given that event B has already occurred, P(A|B), is given by: P(A|B) =P(B|A)P(A)/P(B) where P(A) and P(B) are the probability of occurrence of events A and event B, respectively. In this study, naïve Bayes is used to classify compounds into actives and inactives against a particular target based on the frequency of occurrence of a given feature, which is encoded by extended connectivity fingerprint bits. Given an ECFP, the probability of a 10 given compound being active against a particular target that contains ECFP, is given by P(active|ECFP), where event active refers to the event that the compound is active. Thus, P(active|ECFP) = P(ECFP|active) P(active)/P(ECFP) This method is said to be naïve because it naively assumes independence of the features from one another. Making this assumption, it is valid to multiply the individual feature probabilities. Given n features or ECFP bits, P(active|ECFP) = P(ECFP1|active) X P(ECFP2|active) X P(ECFPn|active)P(active)/P(ECFP) Laplacian Correction A Laplacian correction is used as different features are sampled at different frequencies. If N is the number of compounds in a given dataset, of which M are active, the baseline probability of a compound being active is given by: P(active) = M/N The probability of a compound being active given feature F present in B number of compounds of which A are active can be given by P(active|F) = A/B As the number of compounds B diminishes, the probability estimate of a feature starts getting biased. For example if A =1 and B =1 then P(active|F) = 1, which is not true as the feature is undersampled. 11 To correct for the above anomaly, it is assumed that most of the features have no relationship with activity, i.e., for most Fi, P(active|Fi) = P(active), which is the baseline probability. If the feature is sampled X additional times, then P(active) X X of those new samples should be active. Therefore, each feature can be corrected by adding those virtual samples. P(active|F) = (A + P(active) X X) / (B + X) Now, as the number of samples, B, decreases, the feature’s probability contribution converges to P(active). Relative Estimator Naïve Bayes as implemented in Pipeline Pilot is a relative estimator which is obtained by dividing the Laplacian-corrected Bayes classifier by P(active). Thus, P(active|F) = P(active|F)/P(active) which implies: i) for most features, log P(active|F) ~ 0 ii) for features more common in actives, log P(active|F) > 0 iii) for features less common in actives, log P(active|F) < 0. The Learn Molecular Categories component in Scitegic’s Pipeline Pilot software can build multiple naïve Bayes models in one protocol. The user needs to specify a “CategoryProperty” property on each data record, in this case target name, which is subsequently used to create different equations for different values of that property. Once a multiple naïve Bayes model is created, it can be added to another protocol to calculate naïve Bayes scores for multiple targets. A test compound is passed through each of the equations of this model and scores corresponding to each target name is calculated. The 12 highest score for a target is taken as most likely target for that compound. Similarly, the next highest score for a target can be assigned as second most likely target and so on. Databases and Datasets WOMBAT WOMBAT (World of Molecular Bioactivity) is a consistently annotated chemogenomics database. The version used for this study, WOMBAT 2005.1, contains 117,007 compounds and 104,230 unique SMILES. 230,000 biological activities and 1,021 unique targets are reported for these compounds. It contains 4,786 different chemical series from 4,773 papers published in medicinal chemistry journals between 1975 and 2004. The database follows a hierarchical scheme such that it is possible to look up the target family. The target families are based on the functional properties of targets. Targets are usually proteins but can also be DNA or RNA. For the purpose of this study, compounds with activity threshold, IC50/EC50/Ki/Kb/Kd/MIC/ED50 < 30µm were selected. Thus, a reduced dataset of 103,735 compounds with 964 biological targets was generated. Also, WOMBAT is organized in a way in which one compound may have more than one target. The database was preprocessed to associate one target per compound so that there is unique value for the “CategoryProperty” for the Learn molecular component of Pipeline Pilot. As a result, 192,373 compound records were generated because of duplication. This dataset will be referred to henceforth as the WOMBAT database. MDDR MDDR contains information about biologically relevant compounds. Many biological activities reported in MDDR are very generic e.g. antineoplastic, antihypertensive and 13 anti-inflammatory. It covers information available in patent literature, journal articles, and meetings. MDDR is updated annually. The version used for this study contains 159,967 compounds and 761 biological activity classes. There are 2,789 compounds in MDDR with no structural information. Since the descriptors used for statistical analysis in this study are based on structural information, these compounds were removed from the dataset and the remaining 156,873 compounds with 659 biological activity classes were kept for the study. Datasets We built two separate multiclass Bayes models for the purpose of this study. The WOMBAT 85% Model has 85% of WOMBAT compound records as training set and the other 15% is used as the test set. Cross validation of WOMBAT database was carried out because there are large number of records available for the study. There is no duplication of compounds across two sets. The second model, the WOMBAT model, uses WOMBAT as the training set. Test sets are derived from ten MDDR specific biological activity classes and ten generic biological activity classes. Table 1 summarizes the number of compounds and respective databases from which training and test sets are extracted for both models. Table 2 shows the MDDR activity classes with the total number of compound records corresponding to each class. Model Name Wombat85% Model Wombat Model Number of compounds Training Set Test Set 162238 30135 (WOMBAT) (WOMBAT) 192373 53137 (WOMBAT) (MDDR) Table 1. Datasets used for training and testing the naïve Bayesian models 14 MDDR Activity Class Number of Compounds Phosphodiesterase IV Inhibitor 2000 Cyclooxygenase-1 Inhibitor 88 Cyclooxygenase-2 Inhibitor 1055 Acetylcholinesterase Inhibitor 810 Angiotensin II AT1 Antagonist 2185 Angiotensin II AT2 Antagonist 53 ACE Inhibitor 570 Reverse Transcriptase Inhibitor 819 HIV-1 Protease Inhibitor 1027 H+/K+-ATPase Inhibitor 751 Antineoplastic 19621 Kinases* 1858 Anti-inflammatory, Intestinal 29 Antihypertensive 11124 Anti-arthritic 11147 Table 2. MDDR activity classes used for study *Kinases are derived from six MDDR activity classes viz. Adenosine Kinase Inhibitor, Nucleoside Diphosphate Kinase Inhibitor, Phosphatidylinositol Kinase Inhibitor, Protein Kinase C Inhibitor, Thymidine Kinase Inhibitor and Tyrosine-Specific Protein Kinase Inhibitor Software Scitegic’s Pipeline Pilot™ version 4.5.2 was employed to build the multiclass naïve Bayesian models. Hardware The computations were carried out on a Pipeline Pilot server with Intel® Xeon™ 2.7 GHz dual processors and 4 GB of RAM. 15 Protocol for Building Multiclass Naïve Bayesian Models Figure 3 shows the components used by the Pipeline Pilot protocol that was used to build multiclass Bayesian models. The preprocessing steps involved: a. Reading of the compound library into the protocol b. Conversion to Pipeline Pilot molecule format from SMILES string c. Removal of salts if any from the molecule d. Standardization of stereo-atoms and charges e. Reformatting of text string contained in target name to remove any occurrences of single quotes f. Calculation of extended connectivity fingerprints for all molecules Figure 3 Pipeline Pilot protocol for building multiclass naïve Bayes models The protocol generates a new component called Learn Molecular properties. This component can be used in another protocol with test data as input to this component. Diversity Analysis Diversity analysis is carried out to ensure that the model is applicable over a wide range of structural classes. An analysis of structural diversity among the training and test set can thus establish the effectiveness of the computational model. The chemical diversity was assessed by carrying out Murcko assemblies analysis (Bemis & Murcko, 1996) of WOMBAT and MDDR compounds. Murcko assemblies are obtained by removing the side chains from the molecule and keeping the core ring system and associated atom type, 16 bond order and hybridization information. Table 3 shows examples of COX-2 inhibitors derived from the WOMBAT database, and their corresponding Murcko assemblies. After generating Murcko assemblies for all the compounds in WOMBAT and MDDR, unique as well as assemblies common to both can be determined. Based on the number of unique assemblies it can be determined if two datasets differ significantly in terms of chemical diversity. Compound Structure Murcko Assembly Table 3. Murcko assemblies for COX-2 inhibitors derived from WOMBAT database. 17 RESULTS Diversity Analysis Figure 4 shows the results of the diversity analysis based on Murcko scaffold analysis. It shows that there are 65,806 Murcko assemblies unique to MDDR, 24,772 assemblies unique to WOMBAT and 6,395 assemblies common to both datasets. Also, the figure depicts the frequency of assemblies in the datasets, e.g., the bar region corresponding to blue accounts for Murcko assemblies that occur just once in the complete dataset (singletons). The red region corresponds to assemblies that occur twice in the dataset, and so on. The results show that MDDR is more diverse than WOMBAT, as it contains larger number of distinct Murcko fragments, but it cannot be used for effective target prediction simply because it is not highly annotated. There is a significant overlap between both datasets, but in spite of that there are large numbers of Murcko fragments unique to each database. 18 M- MDDR Scaffolds; MC, WC-Common Scaffolds in MDDR & WOMBAT; W-WOMBAT Scaffolds Frequency of scaffold occurrence denoted by color; blue: 1 occurrence, red: 2 occurrences, yellow: 3 occurrences, black: 4 occurrences, green: 5 or more occurrences Figure 4. Murcko assembly distribution of MDDR and WOMBAT Method Validation Wombat 85% Model The top 3 scores or top 3 targets were evaluated and the results were interpreted in terms of percentage of compounds and rank in which the model predicted the right target. Figure 5a shows the results of target prediction for the WOMBAT 85% model. This model predicts the right target on the first guess for 82% of the compounds, on the second guess for 8% of the compounds, and on the third guess for 2% of the compounds. 19 Also, 8% of the compounds were not correctly predicted in the top 3 guesses. A further analysis was done to examine predicted target families. The model predicts the right target family for 89% of the compounds in first guess as shown in, Figure 5b. First Target Prediction right Second Target prediction right Third Target Prediction right Not predicted right in top three predictions Figure 5a Target Prediction 20 First Target Prediction right Second Target prediction right Third Target Prediction right Not predicted right in top three predictions Figure 5b Target Family Prediction Figure 5a-b. Percentage of test compounds with order of prediction for WOMBAT85% model WOMBAT Model Figure 6a shows the percentage of compounds predicted with the correct target in the top three predictions, as well as compounds for which the model failed to predict the correct target. Figure 6b correspondingly depicts the percentage of compounds that were predicted in the correct target family, as well as the unsuccessful predictions for the target family. On average, 77% of the compounds were predicted with correct targets and 79% of the compounds were predicted with the correct family in top three predictions. 21 First Target Prediction right Second Target prediction right Third Target Prediction right Not predicted right in top three predictions Figure 6a Target Prediction 22 First Target Family Prediction right Second Target Family prediction right Third Target Family Prediction right Not predicted right in top three predictions Figure 6b Target Family Prediction Figure 6a-b Percentage of test compounds derived from ten MDDR activity classes with order of prediction for WOMBAT model Qualitative Analysis Top targets predicted for five generic activity classes, viz. Antineoplastic, Kinase, Antiarthritic, Antihypertensive and Anti-inflammatory, Intestinal, derived from MDDR were examined. The results were binned on target frequencies and interpreted in terms of the type of targets assigned to the generic activities by naïve Bayes Models. Figures 7a-e show the distributions of targets for five generic activity classes. Verification with the 23 scientific literature available on corresponding targets revealed that the predicted target with maximum frequency of compounds was associated with the corresponding generic activity. Antineoplastics have tubulin, a well known antineoplastic target, as the biggest target (8%) (Figure 7a) (Hadfield, Ducki, Hirst, & McGown, 2003). Tubulin is essential to many vital processes such as cell division and movement of materials within a cell. Drugs such as Taxol® bind to tubulin and cause the protein to lose its flexibility, which prevents cell division. Figure 7a. Target Distribution: Top targets predicted for generic MDDR activity class Antineoplastic. Unregulated kinase activity is a frequent cause of cancer where kinases regulate aspects that control cell growth, movement and death (Melnikova & Golden, 2004). 24 Therefore many kinases are anticancer targets, which is evident by the target distribution for Antineoplastics. Kinases have epidermal growth factor receptor (EGFR), which is a tyrosine protein kinase (Mendelsohn & Baselga, 2003), as the biggest target (27%) (Figure 7b). These receptors exist on the cell surface and are involved in initiating a signal transduction cascade which ultimately leads to DNA synthesis and cell proliferation. Figure 7b Target Distribution: Top targets predicted for MDDR Kinases* The Anti-inflammatory, Intestinal activity class has NK1 as the biggest target (Figure 7c). NK1 is a tachykinin receptor. Tachykinin receptor antagonists have been strongly implicated for the treatment of at least 6 major disease groups which includes CNS disorders, pain and emesis, airway disease, urinary incontinence and intestinal dysfunction (Khawaja & Rogers, 1996). 25 Figure 7c Target Distribution: Top targets predicted for MDDR activity class Anti-inflammatory, Intestinal Antihypertensives have Angiotensin II type 1 (AT1) receptor as their biggest target (Figure 7d). AT1 receptor has a well established role in maintaining blood pressure, and water and electrolyte homeostasis (Wexler et al., 1992). AT1 antagonists work as antihypertensive agents by causing either arterial or mixed arterial and venous dilation. 26 Figure 7d Target Distribution: Top targets predicted for MDDR activity class Antihypertensive The Antiarthritic class of compounds has COX-2 as the biggest target (14%). COX-2 is an enzyme responsible for mediating pain and inflammation (Figure 7e). inhibitors are often used to treat arthritis (Hochberg, 2005). 27 COX-2 Figure 7e Target Distribution: Top targets predicted for MDDR activity class Antiarthritic Table 4 summarizes the biggest assigned target for each activity class along with binning frequency and respective the target families. MDDR Activity Class Binning Frequency Biggest Target Target Family Antineoplastic 50 Tubulin Tubulin Kinases* 10 EGFR Tyrosine protein Kinase Anti-inflammatory, Intestinal Antihypertensive Not Binned NK1 GPCR 50 AT1 GPCR Antiarthritic 50 COX-2 Prostaglandin G/H synthase, oxidoreductase, peroxidase, dioxygenase Table 4. Biggest target as predicted by naïve Bayes Models for MDDR generic activity classes 28 Examples of compounds with right prediction Histone Deacytlase (HDAC) inhibitors have been recently recognized as anticancer agents due to their role in the regulation of transcription (Vigushin & Coombes, 2002). HDAC, when inhibited, stops differentiation and induces cell cycle arrest. For nearly 2% of MDDR compounds with antineoplastic activity, HDACs are predicted as top targets by naïve Bayesian Models (Figure 7a) though MDDR contains no information on specific targets for these compounds. I here report three such compounds which have high structural similarity to the known HDAC inhibitors. i) N-(2-aminophenyl)-4-(3-hydroxypropanamido) benzamide (Figure 8a) MDDR Extreg 142057 is predicted as a HDAC inhibitor. This compound is a substructure in several other compounds that have antineoplastic activity. Also, one of these compounds belongs to a class of compounds known as N-aryl benzamides which are synthetic HDAC inhibitors (D. F. Schuppan, 2003). Figure 8c shows the Markush representation of the entire class and example of a compound belonging to that class. ii) N-(2-aminophenyl)-4-isobutyramidobenzmide (Figure 8b) MDDR Extreg 142055 is an analogue of N-(2-aminophenyl)-4-(3-hydroxypropanamido) benzamide and occurs as a substructure in various compounds with anticancer activity. HDAC is predicted as the top target for this compound. 29 Figure 8a N-(2-aminophenyl)-4-(3-hydroxypropanamido) benzamide Figure 8b N-(2-aminophenyl)-4-isobutyramidobenzmide Figure 8c N-aryl benzamide class iii) MDDR Extreg 286017: Trichostatin D (Figure 9a) is listed as an Antineoplastic in MDDR with no reported literature on a specific target. The top target for this compound as predicted by naïve Bayes Models was HDAC. Trichostatin A (Figure 9b) is a known natural HDAC inhibitor (Vigushin et al., 30 2001). It can be seen that both compounds have the same chemotypes except for the extra sugar ring in Trichostatin D. A further discussion with a chemist working on HDAC inhibitors in Novartis revealed that the compound can be very likely a prodrug. A prodrug is a pharmacological compound which is administered in an inactive form and is metabolized into an active form in the body. The sugar ring would be cleaved off in vivo resulting in a hydroxamic acid group as present in Trichostatin A. HO OH H N N O O OH O OH O Figure 9a. Trichostatin D 7-(4-dimethylaminophenyl)-4,6-dimethyl-7-oxo-N-3,4,5-trihydroxy-6- (hydroxymethyl) tetrahydro-2H-pyran-2-yloxy) hepta-2,4-dienamide Figure 9b. Trichostatin A 7-(4-dimethylaminophenyl)-4,6-dimethyl-7-oxo-hepta-2,4-dienehydroxamic acid 31 DISCUSSION Naïve Bayesian modeling in conjunction with extended connected fingerprints has been shown to perform well in identifying actives in comparison to other substructure based searching methods (Klon, Glick, & Davies, 2004). I here investigated the application of the Multicategory Naïve Bayes Model in predicting targets for compounds with unknown or little known activity information. The method is applicable over a wide range of activity classes and 77% of the test compound sets used for validation are predicted with correct targets. The method also performs well while correlating targets with the generic activities. The training of the Multicategory Naïve Bayes Model with a dataset of ~190,000 compound records with 964 targets takes nearly six hours. The subsequent target prediction rate is about 100 minutes for 100,000 compounds. Thus, it can be efficiently used as a complementary tool for target prediction for a big library of compounds. Data mining algorithms are largely dependent on the training datasets. First, the structural diversity among underlying compounds is essential to the success of such algorithms. In this study, COX-1 inhibitors have the lowest correct prediction percentage. Only ~27% of the COX-1 inhibitors are successfully predicted by the algorithm. To account for such a low prediction, I carried out a closest Tanimoto similarity analysis between the test set (COX-1 compound sets derived from MDDR) and training set (COX-1 compound sets derived from WOMBAT). Figure 10 shows the percentage of MDDR compounds with a given Tanimoto Similarity coefficient value with WOMBAT compounds. Nearly 95% of the MDDR compounds have a Tanimoto value <0.4 with WOMBAT COX-1 compounds. 32 Closest Similarities COX-1 % of MDDR Samples 30% 25% 20% 15% Frequency 10% 5% 0% 0 0.2 0.4 0.6 0.8 1 Tanimoto Similarities Figure 10. Closest Tanimoto similarity distribution for MDDR COX-1 compounds This implies that WOMBAT COX-1 compounds do not have sufficient structural diversity. The use of a structurally diverse dataset for training would thus result in better predictions. Second, high similarity between compounds in training and test sets can lead to good predictions. Further analysis of the kinases derived from MDDR was conducted to check how well the Bayes classifier works when there is low similarity between test and training sets. Figure 11 is a plot of Tanimoto similarity vs. the order of prediction for MDDR kinases. For a high similarity between the MDDR and WOMBAT, the first prediction is almost always correct. There is a decrease in correct prediction percentage with decrease in similarity but even for a low similarity score of 0.2-0.5, correct predictions are made by naïve Bayes classifier. This shows that Bayes can classify data even at low similarity values. 33 Not Predicted as Kinase in top 10 predictions Figure 11. Tanimoto similarity coefficient and the order of correct prediction for MDDR kinases Third, the consistency and quality of annotation in chemogenomics database is of prime importance when it comes to deriving meaningful and reliable conclusions from the predictions. The information contained in such databases is derived from research data published in medicinal chemistry journals, patents, and other such sources. There is a constant need for updating such databases as information about new targets and compounds becomes available. 34 CONCLUSION Naive Bayes models trained on a well-annotated chemogenomics database can be used to predict targets for compounds with unknown activities and ascertain genetic connections between targets. Structural diversity among compounds used for building models plays an important role in automated target prediction methods. Therefore, using a structurally diverse chemogenomics database as training set would result in better predictions. The use of in silico techniques in abstracting the structure activity relationship is valid as long as underlying information is correct and up-to-date. As new discoveries are made in the biological and chemical world, new information should be incorporated as it becomes available. 35 REFERENCES Abrahamian, E., Fox, P. C., Naerum, L., Christensen, I. T., Thogersen, H., & Clark, R. D. (2003). Efficient generation, storage, and manipulation of fully flexible pharmacophore multiplets and their use in 3-D similarity searching. J Chem Inf Comput Sci, 43(2), 458-468. Bemis, G. W., & Murcko, M. A. (1996). The properties of known drugs. 1. Molecular frameworks. J Med Chem, 39(15), 2887-2893. Blower, P., Fligner, M., Verducci, J., & Bjoraker, J. (2002). On combining recursive partitioning and simulated annealing to detect groups of biologically active compounds. J Chem Inf Comput Sci, 42(2), 393-404. Blower, P. E., Yang, C., Fligner, M. A., Verducci, J. S., Yu, L., Richman, S., et al. (2002). Pharmacogenomic analysis: correlating molecular substructure classes with microarray gene expression data. Pharmacogenomics J, 2(4), 259-271. Brown, R. D., & Martin, Y. C. (1998). An evaluation of structural descriptors and clustering methods for use in diversity selection. SAR QSAR Environ Res, 8(1-2), 23-39. D. F. Schuppan, C. H., M. Gansmayer, M. Ocker, K. Thierauch. (2003). Germany Patent No. WO 2004058234;US 2005054647. Fomenko, A. E., Sobolev, B. N., Filimonov, D. A., & Poroikov, V. V. (2003). [Use of structural MNA descriptors for designing profiles of protein families]. Biofizika, 48(4), 595-605. Frye, S. V. (1999). Structure-activity relationship homology (SARAH): a conceptual framework for drug discovery in the genomic era. Chem Biol, 6(1), R3-7. Gillet, V. J., Willett, P., & Bradshaw, J. (1998). Identification of biological activity profiles using substructural analysis and genetic algorithms. J Chem Inf Comput Sci, 38(2), 165-179. Glick, M., Robinson, D. D., Grant, G. H., & Richards, W. G. (2002). Identification of ligand binding sites on proteins using a multi-scale approach. J Am Chem Soc, 124(10), 2337-2344. Hadfield, J. A., Ducki, S., Hirst, N., & McGown, A. T. (2003). Tubulin and microtubules as targets for anticancer drugs. Prog Cell Cycle Res, 5, 309-325. Hert, J., Willett, P., Wilton, D. J., Acklin, P., Azzaoui, K., Jacoby, E., et al. (2004). Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. Org Biomol Chem, 2(22), 3256-3266. Hochberg, M. C. (2005). COX-2 selective inhibitors in the treatment of arthritis: a rheumatologist perspective. Curr Top Med Chem, 5(5), 443-448. Hofbauer, K. G. (1997, July 5, 1997). Target Identification and Concept Validation in Drug Discovery. Paper presented at the IUPS Satellite Conference 1997: On Designing the Physiome Project, Petrodvoret, St. Petersburg, Russia. Jacoby, E., Schuffenhauer, A., & Floersheim, P. (2003). Chemogenomics knowledgebased strategies in drug discovery. Drug News Perspect, 16(2), 93-102. Khawaja, A. M., & Rogers, D. F. (1996). Tachykinins: receptor to effector. Int J Biochem Cell Biol, 28(7), 721-738. 36 Kitchen, D. B., Stahura, F. L., & Bajorath, J. (2004). Computational techniques for diversity analysis and compound classification. Mini Rev Med Chem, 4(10), 10291039. Klon, A. E., Glick, M., & Davies, J. W. (2004). Application of machine learning to improve the results of high-throughput docking against the HIV-1 protease. J Chem Inf Comput Sci, 44(6), 2216-2224. Klon, A. E., Glick, M., Thoma, M., Acklin, P., & Davies, J. W. (2004). Finding more needles in the haystack: A simple and efficient method for improving highthroughput docking results. J Med Chem, 47(11), 2743-2749. Lagunin, A., Stepanchikova, A., Filimonov, D., & Poroikov, V. (2000). PASS: prediction of activity spectra for biologically active substances. Bioinformatics, 16(8), 747748. Manallack, D. T., Pitt, W. R., Gancia, E., Montana, J. G., Livingstone, D. J., Ford, M. G., et al. (2002). Selecting screening candidates for kinase and G protein-coupled receptor targets using neural networks. J Chem Inf Comput Sci, 42(5), 1256-1262. Melnikova, I., & Golden, J. (2004). Targeting protein kinases. Nat Rev Drug Discov, 3(12), 993-994. Mendelsohn, J., & Baselga, J. (2003). Status of epidermal growth factor receptor antagonists in the biology and treatment of cancer. J Clin Oncol, 21(14), 27872799. Morgan, H. L. (1965). The generation of a unique machine description for chemical structures - a technique developed at Chemical Abstracts Service. J Chem Doc(5), 107-113. Poroikov, V. V., Filimonov, D. A., Ihlenfeldt, W. D., Gloriozova, T. A., Lagunin, A. A., Borodina, Y. V., et al. (2003). PASS biological activity spectrum predictions in the enhanced open NCI database browser. J Chem Inf Comput Sci, 43(1), 228236. Rabow, A. A., Shoemaker, R. H., Sausville, E. A., & Covell, D. G. (2002). Mining the National Cancer Institute's tumor-screening database: identification of compounds with similar cellular activities. J Med Chem, 45(4), 818-840. Rusinko, A., 3rd, Farmen, M. W., Lambert, C. G., Brown, P. L., & Young, S. S. (1999). Analysis of a large structure/biological activity data set using recursive partitioning. J Chem Inf Comput Sci, 39(6), 1017-1026. Sche, P. P., McKenzie, K. M., White, J. D., & Austin, D. J. (1999). Display cloning: functional identification of natural product receptors using cDNA-phage display. Chem Biol, 6(10), 707-716. Schuffenhauer, A., Zimmermann, J., Stoop, R., van der Vyver, J. J., Lecchini, S., & Jacoby, E. (2002). An ontology for pharmaceutical ligands and its application for in silico screening and library design. J Chem Inf Comput Sci, 42(4), 947-955. Shemetulskis, N. E., Dunbar, J. B., Jr., Dunbar, B. W., Moreland, D. W., & Humblet, C. (1995). Enhancing the diversity of a corporate database using chemical database clustering and analysis. J Comput Aided Mol Des, 9(5), 407-416. Taunton, J., Hassig, C. A., & Schreiber, S. L. (1996). A mammalian histone deacetylase related to the yeast transcriptional regulator Rpd3p. Science, 272(5260), 408-411. Toledo, L. M., Lydon, N. B., & Elbaum, D. (1999). The structure-based design of ATPsite directed protein kinase inhibitors. Curr Med Chem, 6(9), 775-805. 37 Vigushin, D. M., Ali, S., Pace, P. E., Mirsaidi, N., Ito, K., Adcock, I., et al. (2001). Trichostatin A is a histone deacetylase inhibitor with potent antitumor activity against breast cancer in vivo. Clin Cancer Res, 7(4), 971-976. Vigushin, D. M., & Coombes, R. C. (2002). Histone deacetylase inhibitors in cancer treatment. Anticancer Drugs, 13(1), 1-13. Weinstein, J. N., Myers, T. G., O'Connor, P. M., Friend, S. H., Fornace, A. J., Jr., Kohn, K. W., et al. (1997). An information-intensive approach to the molecular pharmacology of cancer. Science, 275(5298), 343-349. Weisstein, E. W. Bayes Theorem [Electronic Version]. MathWorld from http://mathworld.wolfram.com/BayesTheorem.html. Wexler, R. R., Carini, D. J., Duncia, J. V., Johnson, A. L., Wells, G. J., Chiu, A. T., et al. (1992). Rationale for the chemical development of angiotensin II receptor antagonists. Am J Hypertens, 5(12 Pt 2), 209S-220S. Willett, P. (2005). Searching techniques for databases of two- and three-dimensional chemical structures. J Med Chem, 48(13), 4183-4199. Xia, X., Maliski, E. G., Gallant, P., & Rogers, D. (2004). Classification of kinase inhibitors using a Bayesian model. J Med Chem, 47(18), 4463-4470. Xue, L., & Bajorath, J. (2000). Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb Chem High Throughput Screen, 3(5), 363-372. Zhu, L. L., Hou, T. J., Chen, L. R., & Xu, X. J. (2001). 3D QSAR analyses of novel tyrosine kinase inhibitors based on pharmacophore alignment. J Chem Inf Comput Sci, 41(4), 1032-1040. 38 APPENDIX CURRICULUM VITAE NIDHI B. Sc. (chemistry), M.S. (computer applications) Phone: 317-652-1631 Email: khurananidhi@gmail.com SUMMARY Solid grasp on basic sciences through an undergraduate degree in chemistry Extensive software development skills through a master’s degree in computer applications High level of awareness regarding software industry’s practices gained by internships and full time job as software engineer Pursuing a master’s degree in Chemical Informatics at Indiana University, Indianapolis. Worked at Novartis Institutes for BioMedical Research, Cambridge, MA as Cheminformatics Summer Science Intern during summer 2005. SKILLS Cheminformatics Molecular Modeling Languages Databases Platforms Web Technologies Tools & Utilities Others : Pipeline Pilot, Spotfire, ISIS/Base, ChemDraw, BCI Toolkit : Spartan, SYBYL, Macro Model, Maestro : C, JAVA, Visual Basic 6.0 : Oracle 9i, MS-Access : MS Win NT/9x/2K/XP, UNIX, Linux : HTML, XML, VBScript, ASP3.0, JSP, Servlets : Crystal Reports, Matlab : Labware EDUCATIONAL EXPERIENCE MS - Chemical Informatics, Indiana University Purdue University Indianapolis, Indiana. August 2004 – Till date. Expected graduation December 2005 Masters in Computer Application (M.C.A.), Maharishi Dayanand University, Haryana, India. September 2003 Bachelor of Science (B.Sc.)-Chemistry (Honors), Delhi University, New Delhi, India. July 1999 WORK EXPERIENCE Cheminformatics Summer Science Intern: June 2005 – August 2005 - Novartis Institutes for BioMedical Research, Cambridge, MA Worked with Lead Discovery Informatics division (Davies’ group) on Library and Database Annotation Concept for a project to explore the relationship between chemical structures and associated biological activities. Platform: Pipeline Pilot, Spotfire, ISIS/Base Software Engineer - March 2004 – July 2004 - Satyam Computer & Services Ltd., Hyderabad, India Provided offshore client support to Cisco Systems for development of an application that incorporated both Client/Server and n-tier Web Architecture Worked on J2EE platform using Struts framework Documented Use Cases, Test Cases and Design Elements following Satyam’s ISO 9001 quality procedures Successfully completed corporate training modules in OOAD & UML, Java, XML and Quality Assurance Project Intern - January 2003 – June 2003 - Kirloskar Pneumatic Company Ltd., New Delhi, India Involved in development of an Intranet application which automated the Tender Quotation Process of the organization Performed system analysis & design, development and documentation of the system Programming Platform: ASP3.0, VBScript, MS Access 2000 Summer Intern - June 1, 2001 – July 30, 2001. Claas India Ltd., Faridabad, India Involved in the development of a desktop application to manage fixed assets of the corporation. Designed and developed the application single handedly. Programming Platform: VB6.0, Oracle 9i and Crystal Reports 8.0 RELEVANT TRAINING/WORKSHOPS Pipeline Pilot Basic Training Course by SciTegic - June 2005. NCBI workshop on MapViewer – November 2005 AWARDS/FELLOWSHIPS - MDL Elsevier Excellence in Informatics Fellowship for the year 2004 ACDEMIC PROJECTS & PRESENTATIONS Conformational search on sixteen mer of a polymeric material that inhibits HIV-1 reverse transcriptase - Independent study project Fall 2005. Proteomics Laboratory Study and Implementation of workflow on Labware - Class project for Lab Info Management Systems course done in Spring 2005. Tracking system for unsigned Physician Orders - Platform JSP/Oracle 8i - Class project for Informatics Management course done in Spring 2005. Evaluation of BCI Toolkit - Class project for Chemical Information Technology course done in Fall 2004. Quest for Biomarkers - Class project for Introduction to Bioinformatics course done in Fall 2004. DETAILS OF COURSEWORK TAKEN MS in Chemical Informatics 1) Lab Info Management Systems 2) Introduction to Informatics 3) Chemical Information Technology 4) Introduction to Bioinformatics 5) Molecular Modeling 6) Informatics Management 7) Informatics Research & Design Master’s Program in Computer Application (A three-year graduate program) 1) Mathematical foundation of computer science 2) Probability and statistics 3) Computer based numerical & statistical techniques 4) Computer based optimization techniques 5) Computer programming & problem solving 6) Business data processing 7) Combination & graph theory 8) Data & file structure 9) System software and C programming 10) Database management system 11) System analysis & design 12) Operating systems 13) Interactive computer graphics 14) Distributed computer network application 15) Simulation & modeling 16) Artificial intelligence 17) Software engineering 18) Computer organization & assembly language 19) Microprocessor & application 20) Computer architecture 21) Accounts & financial management 22) Organizational structure & personnel management 23) Information system design & implementation Bachelor of Science (with honors in chemistry) (Not a complete list) 1) Inorganic Chemistry 2) Organic Chemistry 3) Physical Chemistry 4) Environmental Chemistry 5) Mathematics 6) Physics