Jonathan Chen 2/12/2016 CombiCDB: Combinatorial Reaction Processing Extends the Utility of the UCI Chemical Data Bank 1. Introduction High-throughput screening of small molecules, discovery of important chemical properties and in silico chemical synthesis and design can be greatly facilitated with large databases of small molecule information. Examples of this exploration to understand chemical space include screening drug candidate molecules by molecular docking or applying machine-learning techniques to predict chemical toxicity. Such databases enable this by allowing massive in silico chemical processing that would be impractical or impossible in a traditional in vitro setting. Many already exist, such as the NCI open database and ACL MDL, however, most are privately owned with potentially prohibitive usage costs. The databases that are publicly available generally have on the order of 10 3 to 105 compounds (Voigt 2001). The UCI Chemical Data Bank under development has on the order of 107 compounds consolidated from multiple public and private sources. Furthermore, relative to some comparable services available such as the Harvard ChemBank (http://chembank.med.harvard.edu/), the UCI Chemical Data Bank will include analysis tools such as reaction processing and fingerprint similarity searches to aid in discovery. We hypothesize that the library’s size, and thus its utility, can be expanded even further from known available chemicals to theoretical compounds that are easily synthesized given those readily available sources. This can be achieved by applying in silico reactions to the current data set. Here we will consider the application of combinatorial expansion to a specific “click chemistry” reaction (Kolb 2001) that is used to design an in vitro drug-binding experiment in collaboration with Dr. Gregory Weiss of the UCI Dept. of Chemistry. Additionally, novel, screening criteria are explored that may yield useful (and not necessarily drug-related) information about chemical space, such as the identification of information-containing polymer components. That is, a screen which can search molecule space and find DNA, RNA and protein components, as well as other potentially similar but as yet unknown constructs that may lend insight into the very origins of life. 2. Materials / Methods The database is built on the open source PostgreSQL platform (http://www.postgresql.org/) using data sources from internet files and vendor CDs such as the Sigma-Aldrich (http://www.sigmaaldrich.com) and Maybridge (http://www.maybridge.com) catalogs. Compound structural data is standardized and stored in SDF format, converted by OpenEye Software’s OEChem toolkit (http://www.eyesopen.com/products/toolkits/oechem.html). Additional Jonathan Chen 2/12/2016 curation and normalization steps such as Corina 3D coordinate generation (Sadowski 1996) are applied to the data as it is inserted. Compound information is determined and stored at multiple levels, in addition to supplemental annotations. This information ranges from full three dimensional structural data to SMILES (James 2004) that only specify the basic chemical connection table (without spatial coordinates) to simple “fingerprints” (James 2004) that abstractly summarize a compound’s structural information. Implicit or explicit functional group annotation is applied using the Daylight SMARTS pattern method (James 2004) using OpenEye Software’s OEChem implementation (http://www.eyesopen.com/products/toolkits/oechem.html). Combinatorial reactions that specify what functional groups can react are defined by the Daylight SMIRKS specification (James 2004). For example, amino groups and carboxylic acids react to form an amide bond. (Figure 1 and Figure 2). Figure 1: SMIRKS reaction specification (Carboxylic [O:1]=[C:2][O:3][H:7].[H:8][N:4][H:5]>>[O:1]=[C:2][N:4][H:8] acid + amine >> amide): Figure 2: SMIRKS for amide reaction applied to phenylalanine + serine peptide bond formation Jonathan Chen 2/12/2016 Conceptually, these theoretical products can be reacted again with the original reactants, resulting in first polynomial then exponential growth of possible products. A redundancy check of products and source molecules is thus important, and is accomplished by evaluating their canonical SMILES representations (James 2004). Given the above, a combinatorial expansion against any existing library set can be easily generated. This technique is applied on a single “click chemistry” reaction (Figure 3) that is good for creating drug-like molecules. The possible products of the reaction from the chemical database can then be rapidly computed. Figure 3: SMIRKS reaction specification for "click chemistry" reaction (alkyne + azide): [C:2]#[C:3].[N:5]=[N:6]=[N:7][H:8]>>[C:2]1=[C:3][N:5][N:6]=[N:7]1 The in vitro experiment involves a “dynamic screening” approach that calls for multiple synthesis products above to be dynamically synthesized (and degraded) in a common pool. However, with all of the predicted products synthesized in a common pool, isolating individual products from the pool will be difficult. Instead, the predicted products are distributed into multiple pools by their molecular weight such that products can be easily separated by mass spectrometry methods (See Table 1 and Table 2). A simple polymer screen (identifying candidates that can at least selfpolymerize) is used as a filtering and property isolation example. A simple form is accomplished, given reaction specifications, by determining which molecules satisfy the requirements for each reactant and whose product does as well. That is, the molecule can react with itself to produce something that can still react with itself. 3. Results In design preparation for the “click chemistry” reaction product screening, the reaction was first applied in silico against a commercially available subset of chemicals from the database to yield the following results. 144,701 unique and valid compounds read from Aldrich catalog subset. 825 compounds fit the reactant group profiles. Jonathan Chen 2/12/2016 o A (33 Alkyne + Carboxylic Acid). o B (662 Alkyne + Not Carboxylic Acid). o C (7 Azide + Carboxylic Acid). o D (123 Azide + Not Carboxylic Acid). 5934 unique products generated by reacting group A with group D. 5334 unique products generated by reacting group B with group C. The products predicted above are assigned into pools of 100 and distributed by molecular weight (See Table 1 and Table 2) such that there is still at least ~1 dalton difference between the closest products for any pool. Line Number 0 1 2 3 4 5 6 7 8 9 SMILES Molecular Weight MW Diff c1(c[nH]nn1)C(=O)O 113.07494 c1(cn(nn1)[Li])C(=O)O 119.00800 c1(c([nH]nn1)C)C(=O)O 127.10152 c1(c(n(nn1)[Li])C)C(=O)O 133.03458 c1(cn(nn1)[Na])C(=O)O 135.05677 C1(=CN([NH]=N1)[Na])C(=O)O 136.06471 C(C(=O)O)Cc1c[nH]nn1 141.12810 C(C(=O)O)Cc1cn(nn1)[Li] 147.06116 c1(c(n(nn1)[Na])C)C(=O)O 149.08335 C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129 5.93306 8.09352 5.93306 2.02219 1.00794 5.06339 5.93306 2.02219 1.00794 Table 1: Sample of SMILES strings representing 10 compounds from the “click chemistry” predicted product set including (and ordered by) their calculated molecular weight. The final column is the difference in molecular weight between a compound and the one immediately preceding it. Note that as the number of compounds grows, the smallest molecular weight difference between adjacent compounds will shrink, in some cases even reaching zero for compounds with identical chemical formulas. Obviously these cannot be resolved by mass spectrometry. Line Number 0 3 6 9 1 4 7 2 5 8 SMILES Molecular Pool Weight Number MW Diff c1(c[nH]nn1)C(=O)O 113.07494 c1(c(n(nn1)[Li])C)C(=O)O 133.03458 C(C(=O)O)Cc1c[nH]nn1 141.12810 C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129 c1(cn(nn1)[Li])C(=O)O 119.00800 c1(cn(nn1)[Na])C(=O)O 135.05677 C(C(=O)O)Cc1cn(nn1)[Li] 147.06116 c1(c([nH]nn1)C)C(=O)O 127.10152 C1(=CN([NH]=N1)[Na])C(=O)O 136.06471 c1(c(n(nn1)[Na])C)C(=O)O 149.08335 0 0 0 0 1 1 1 2 2 2 19.95964 8.09352 8.96319 16.04877 12.00439 8.96319 13.01864 Table 2: The same compounds as in Table 1 but reordered and assigned a “pool” number indicating that they will be synthesized in separate groups, in this case for pools up to a size of 4. With this approach, the nearest molecular weight difference for two compounds in the same pool is significantly increased, from ~1 to 8 daltons. This is accomplished by simply ordering the compounds by molecular weight and then successively assigning compounds to each pool, up to the total number of desired pools, and repeating from the first pool. Jonathan Chen 2/12/2016 The simple polymer screen was run on ~1.5 million compounds in the database, using 5 sample reaction specifications including amide, ester and phosphodiester bond formation as filters. This yielded ~250,000 polymer candidates. As a verification control, an amino acid (phenylalanine) and a nucleotide (dATP) (Figure 4) were “planted” in the dataset and indeed passed the screen based on amide and phosphodiester bond formation reactions, respectively. Figure 4: dATP passes the simple polymer screen because it can react with itself to yield a product with the same properties, forming the initial components of a DNA polymer 4. Discussion In designing a click chemistry product screen, ~10,000 possible products were predicted using the in silico synthesis method. This is still a large number to manage in an in vitro laboratory setting, thus they were distributed into pools of 100 products each that are sufficiently dispersed to allow mass spectrometry to separate out mixed product pools. The simple polymer screen is indeed sensitive enough to identify known components of DNA and peptides. However, as currently described, it has very little specificity for identifying other meaningful polymers that may perhaps store information and be structurally stable. The project and database are still under design and development and welcome input. Key among the future directions are finding additional useful filtering / scoring / screening methods. Parallel uses of the system could then range from applying in silico chemical synthesis to chemical design. Some immediate and long-term goals include: Using the functional group and other chemical annotations for machine learning (property finding and pattern prediction) applications, such as chemical toxicity prediction. Refining filters to provide more realistic (spatial coordinate viability), practical (source amounts) and meaningful (information storage, reactivity) products. Applying molecular docking applications to identify combinatorial products as leads for drug design and other purposes. Jonathan Chen 2/12/2016 Devising chemical design techniques using the database. For example, some peptides may be known to readily act at some receptor site in a useful manner, but peptides make poor drugs. A design technique that uses the database to find or describe a drug-like molecule with comparable chemical properties to the peptide could thus be extremely useful. 5. Bibliography http://chembank.med.harvard.edu/ ChemBank: Initiative for Chemical Genetics. http://www.eyesopen.com/products/toolkits/oechem.html OEChem Toolkit. http://www.maybridge.com Maybridge. http://www.postgresql.org/ PostgreSQL. http://www.sigmaaldrich.com Sigma-Aldrich. James, C. A. (2004). Daylight Theory Manual. http://www.daylight.com/dayhtml/doc/theory/theory.toc.html Kolb, H. C. (2001). "Click Chemistry: Diverse Chemical Function from a Few Good Reactions." Angew. Chem. Int. Ed. 40: 2004-2021. Sadowski, J. (1996). "Evaluation of 3D Structure Generators Revisited." http://www2.chemie.uni-erlangen.de/software/corina/xrayeval.html Voigt, J. H. (2001). "Comparison of the NCI Open Database with Seven Large Chemical Structural Databases." J. Chem. Inf. Comput. Sci. 41: 702712.