UCI Chemical Data Bank Technical Document Table of Contents UCI Chemical Data Bank Technical Document .................................................... 1 Table of Contents 1 Introduction 1 Background 1 Future Goals 2 Key Features / Functionality 2 Fingerprint Similarity Searches 2 Molecular Docking 2 Functional Group Annotation / Screening 2 Reaction Processing 2 Chemical Property Prediction / Machine Learning 4 Technical Specifications 4 Data Sources / Vendors 4 Current DB Size 4 Data Type Descriptions 5 Database Schema 6 Technology Dependencies 7 Current Projects 8 Peptide Replacement for Protein Phosphatase Inhibition 8 Dynamic in vitro Combinatorial Chemistry Screening Design 9 Introduction (Copied from CombiCDB report) Background High-throughput screening of small molecules, discovery of important chemical properties and in silico chemical synthesis and design can be greatly facilitated with large databases of small molecule information. Examples of this exploration to understand chemical space include screening drug candidate molecules by molecular docking or applying machine-learning techniques to predict chemical toxicity. Such databases enable this by allowing massive in silico chemical processing that would be impractical or impossible in a traditional in vitro setting. Many already exist, such as the NCI open database and ACL MDL, however, most are privately owned with potentially prohibitive usage costs. The databases that are publicly available generally have on the order of 10 3 to 105 compounds (Voigt 2001). The UCI Chemical Data Bank under development has on the order of 107 compounds consolidated from multiple public and private sources. Furthermore, relative to some comparable services available such as the Harvard ChemBank (http://chembank.med.harvard.edu/), the UCI Chemical Data Bank will include analysis tools such as reaction processing and fingerprint similarity searches to aid in discovery. Future Goals Using the functional group and other chemical annotations for machine learning (property finding and pattern prediction) applications, such as chemical toxicity prediction. Refining filters to provide more realistic (spatial coordinate viability), practical (source amounts) and meaningful (information storage, reactivity) products. Applying molecular docking applications to identify combinatorial products as leads for drug design and other purposes. Devising chemical design techniques using the database. For example, some peptides may be known to readily act at some receptor site in a useful manner, but peptides make poor drugs. A design technique that uses the database to find or describe a drug-like molecule with comparable chemical properties to the peptide could thus be extremely useful. Key Features / Functionality Fingerprint Similarity Searches Josh should describe? Molecular Docking Josh should describe? Functional Group Annotation / Screening Implicit or explicit functional group annotation is applied using the Daylight SMARTS pattern method (James 2004) using OpenEye Software’s OEChem implementation (http://www.eyesopen.com/products/toolkits/oechem.html). Once the functional groups have been identified, screening molecules that possess any particular combination of groups is straightforward and useful for identifying molecules with likely reactivity profiles (see next section). Reaction Processing The library’s size, and thus its utility, can be expanded even further from known available chemicals to theoretical compounds that are easily synthesized given those readily available sources. This can be achieved by applying in silico reactions to the current data set. Once functional group annotations are made, combinatorial reactions that specify which groups can react are defined by the Daylight SMIRKS specification (James 2004). For example, amino groups and carboxylic acids react to form an amide bond. (Figure 1 and Figure 2). Figure 1: SMIRKS reaction specification (Carboxylic [O:1]=[C:2][O:3][H:7].[H:8][N:4][H:5]>>[O:1]=[C:2][N:4][H:8] acid + amine >> amide): Figure 2: SMIRKS for amide reaction applied to phenylalanine + serine peptide bond formation Besides expanding the data set by predicting reaction products, the reaction processing capabilities can be applied to other uses. For example, it can be used as part of a screen for potential polymer components. A simple polymer screen (identifying candidates that can at least self-polymerize) is accomplished, given reaction specifications, by determining which molecules satisfy the requirements for each reactant and whose product does as well. That is, the molecule can react with itself to produce something that can still react with itself. Figure 3: dATP passes the simple polymer screen because it can react with itself to yield a product with the same properties, forming the initial components of a DNA polymer Chemical Property Prediction / Machine Learning Given data sets of chemicals with known abstract properties, such as toxicity, those properties can be predicted for other unknown chemicals in the database. The basis for these methods is in applying kernel similarity measures on molecules, such as by interpreting the molecular graph or their SMILES strings, and feeding this data into a support vector machine (SVM). Technical Specifications Data Sources / Vendors Because SDF, MOL2 and SMILES are the major types of files we store in the database, so we have collected data of the above types from around 50 different vendors. A current list of vendors can be found in table [table number]. In the vendor table, we store the name, contact information, and a date of the newest update of each dataset in the database. We periodically check for updates online or contact the vendors to request new data CDs when they are available. Table: Current list of vendors ACBBLOCKS ACROS ALDRICH AMBINTER ANALYTICON ARRAYBIOPHARMA ASINEX ASYMCHEM AURORA CGX CHEMBRIDGE CHEMICALBLOCK CHEMSTAR CHESS COMBIBLOCKS DSSTOX EMC ENAMINE FRONTIER HARVARD ICBSCREEN IFLAB INTERCHIM KATRITSKY KEYORGANICS LABOTEST MATRIX MAYBRIDGE MAYBRIDGE MCL MDPI MDSI MENAI NANOSYN NCI PEAKDALE PHARMEKS RYAN SPECS SYNCHEM TIMTEC TOCRIS TOSLAB TRC TRIPOS TYGERSCIENTIFIC VITASMLAB WORLD ZELINSKY Current DB Size Yimeng, Jocelyne? Data Type Descriptions Compound structural data is standardized and stored in SDF format, converted by OpenEye Software’s OEChem toolkit (http://www.eyesopen.com/products/toolkits/oechem.html). Additional curation and normalization steps such as Corina 3D coordinate generation (Sadowski 1996) are applied to the data as it is inserted. Compound information is determined and stored at multiple levels, in addition to supplemental annotations. This information ranges from full three dimensional structural data in SDF files, to SMILES strings (James 2004) that only specify the basic chemical connection table (without spatial coordinates) to simple “fingerprints” (James 2004) that abstractly summarize a compound’s structural information. Database Schema Source: Table of vendors and other sources of chemical information ChemicalMix: Individual information items available from sources. May not be individual chemicals, but a mix of chemicals available as a unit. Source2ChemicalMix: Resolution table between sources and chemical mixes, because multiple sources may (and likely will) list the same chemical mixes. This model records all of that information, without storing redundant data. Annotation: High-level chemical annotations from sources, likely extracted from “SD Tags” of the SDF molecule format the source data came from. Chemical: Individual chemical components of chemical mixes that cannot be further resolved into components. Specific, derived annotations are precomputed and stored to facilitate later computations. For example, summary information about the atom and bond content (num_c, num_sg_bonds, etc.), solvation energy (ZAP_solvation, etc.) and abstract “fingerprint” representations. MixtureComponent: Similar to Source2ChemicalMix, resolution table between ChemicalMix and Chemical, to avoid storage of redundant Chemical information. Isomer3d: 3D structural information / atom coordinates of chemicals, as generated by an external program (Corina, see Technology Dependencies). Multiple 3D conformations may exist here for any single chemical. Technology Dependencies PostgreSQL (http://www.postgresql.org/): The database is built on the PostgreSQL platform. This platform was chosen for being available opensource and, among such options, it is well developed, supported and expected to provide the best performance for large data sets. Apache Web Server (http://httpd.apache.org/): Web available interfaces and tools are delivered via this well-established and popular open-source web server. Python (http://www.python.org): Many of the basic application tools, scripts and web interface are written in the Python programming language. This interpreted language is advantageous for rapid-prototyping, though specific intensive modules may be better ported to a stricter language like C or Java. In the meantime, one its primary advantages is it allows for straightforward usage of critical toolkits like the OpenEye cheminformatics toolkits (see below) OpenEye toolkits (http://www.eyesopen.com): The OEChem toolkit implements many of the basic algorithms needed for chemical data processing, SMARTS pattern matching and SMIRKS reaction processing. Other components, such as the Ogham / OEDepict tools for chemical image rendering, are used as well. Corina (http://www2.chemie.uni-erlangen.de/software/corina/index.html): Most of the chemical information supplied from vendors only includes the basic atom-bond connection table as far as structural data is concerned. Corina is a program used to generate 3D coordinate predictions for any given molecule. Only after this is done can the chemicals be applied to such problems as molecular docking (see below). Dock: Given the 3D structural data for some kind of receptor molecule, the modified Dock program is used to screen the database for chemicals that can “dock” well with the receptor. That is, it has shape and properties complementary to the binding site. This is especially useful for identifying potential drug molecule leads. Current Projects Peptide Replacement for Protein Phosphatase Inhibition Generating combinatorial libraries of peptides is relatively easy and can be used to produce potential ligands for biological targets. However, peptides in general make terrible drugs and not even good in vivo inhibitors due to poor bioavailability, membrane impermeability and rapid degradation. Thus, given a peptide known to achieve a desirable effect on some receptor molecule, the ability to predict or construct a functionally comparable, small, organic molecule naturally impervious to the practical limitations of peptides would be of great value. The family of protein phosphatases (PP) that regulate the phosphorylation state of protein hydroxyl residues (Ser, Thr or Tyr) can be thought of as such receptor molecules. Such phosphatases are ubiquitous enzymes that play a critical role in regulating important metabolic pathways such as glycogen synthesis, cell division, gene expression, neurotransmission, muscle contraction and other signal transduction pathways. Specifically under investigation are PP1 and PP2A for which having specific inhibitors would greatly assist study of their regulated pathways. Many endogenous peptides such as Inhibitor-1 are known to inhibit PP1 and PP2A specifically, without affecting other Ser-Thr phosphatases. As mentioned previously however, these are poor choices as in vivo inhibitors or drugs due to their inherent peptide nature. Based on dozens of known PP1 target protein sequences, PP1 appears to be specific for proteins bearing the consensus sequence of R/K-R/K-V/I-X-F/W. Finding or constructing an organic molecule replacement for this short peptide sequence is thus the broad goal. Encouraging for this approach is the existence of many non-peptide organic molecules that have naturally evolved to inhibit these phosphatases, for example as plant or insect toxins. These include cyclic peptides like microcystin, terpenoids like catharidin and polyketides like okadaic acid. Structural data for the PP1c protein bound to target peptides bearing the consensus sequence are available (refer to PDB code 1FJM?). This will offer direction and a means for validating any predictions made. Strategies to replace the target peptide include molecular docking of candidate molecules into the “inverted surface” of the peptide as a mock receptor-binding site. However, this could be a time intensive process, so broader screening approaches could apply machine-learning methods that identify molecules similar to the peptide by volume displacement, atom distribution in space or other coarse features that approximate the “shape” of the peptide. Dynamic in vitro Combinatorial Chemistry Screening Design The reaction processing tools can assist in the design of in vitro combinatorial chemistry experiments. The in vitro experiment involves a “dynamic screening” approach that calls for multiple synthesis reactants to be added to a common pool containing some target receptor or other assayable attribute. These reactants will then be allowed to undergo a specific “click chemistry” combination reaction, conducive to producing drug-like molecules. The dynamic nature of the screen relates to the different reactant combinations in a common pool generating products that may bind the receptor or revert back to source reactants to recombine into other product combinations. The reaction processing tools allows for rapid computation of the possible products that can be generated from reactants in the chemical database, based on the “click chemistry” reaction specification. This comes into play in the design of the experiment when one considers that, given all of the potential products synthesized in a common pool, isolating individual products from a pool will be difficult. Instead, the products are predicted in silico first and then are distributed into multiple pools by their calculated molecular weight such that products can be easily separated by mass spectrometry methods (See Table 1 and Table 2). Line Number 0 1 2 3 4 5 6 7 8 9 SMILES Molecular Weight MW Diff c1(c[nH]nn1)C(=O)O 113.07494 c1(cn(nn1)[Li])C(=O)O 119.00800 c1(c([nH]nn1)C)C(=O)O 127.10152 c1(c(n(nn1)[Li])C)C(=O)O 133.03458 c1(cn(nn1)[Na])C(=O)O 135.05677 C1(=CN([NH]=N1)[Na])C(=O)O 136.06471 C(C(=O)O)Cc1c[nH]nn1 141.12810 C(C(=O)O)Cc1cn(nn1)[Li] 147.06116 c1(c(n(nn1)[Na])C)C(=O)O 149.08335 C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129 5.93306 8.09352 5.93306 2.02219 1.00794 5.06339 5.93306 2.02219 1.00794 Table 1: Sample of SMILES strings representing 10 compounds from the “click chemistry” predicted product set including (and ordered by) their calculated molecular weight. The final column is the difference in molecular weight between a compound and the one immediately preceding it. Note that as the number of compounds grows, the smallest molecular weight difference between adjacent compounds will shrink, in some cases even reaching zero for compounds with identical chemical formulas. Obviously these cannot be resolved by mass spectrometry. Line Number 0 3 6 SMILES c1(c[nH]nn1)C(=O)O c1(c(n(nn1)[Li])C)C(=O)O C(C(=O)O)Cc1c[nH]nn1 Molecular Pool Weight Number MW Diff 113.07494 133.03458 141.12810 0 0 0 19.95964 8.09352 9 1 4 7 2 5 8 C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129 c1(cn(nn1)[Li])C(=O)O 119.00800 c1(cn(nn1)[Na])C(=O)O 135.05677 C(C(=O)O)Cc1cn(nn1)[Li] 147.06116 c1(c([nH]nn1)C)C(=O)O 127.10152 C1(=CN([NH]=N1)[Na])C(=O)O 136.06471 c1(c(n(nn1)[Na])C)C(=O)O 149.08335 0 1 1 1 2 2 2 8.96319 16.04877 12.00439 8.96319 13.01864 Table 2: The same compounds as in Table 1 but reordered and assigned a “pool” number indicating that they will be synthesized in separate groups, in this case for pools up to a size of 4. With this approach, the nearest molecular weight difference for two compounds in the same pool is significantly increased, from ~1 to 8 daltons. This is accomplished by simply ordering the compounds by molecular weight and then successively assigning compounds to each pool, up to the total number of desired pools, and repeating from the first pool. In work already done to design a click chemistry product screen, ~10,000 possible products were predicted using the in silico synthesis method. This is still a large number to manage in an in vitro laboratory setting, thus they were distributed into pools of 100 products each that are sufficiently dispersed to allow mass spectrometry to separate out mixed product pools.