Summer2004

advertisement
Jonathan Chen
2/12/2016
CombiCDB:
Combinatorial Reaction Processing
Extends the Utility of the UCI Chemical Data Bank
1. Introduction
High-throughput screening of small molecules, discovery of important chemical
properties and in silico chemical synthesis and design can be greatly facilitated
with large databases of small molecule information. Examples of this exploration
to understand chemical space include screening drug candidate molecules by
molecular docking or applying machine-learning techniques to predict chemical
toxicity. Such databases enable this by allowing massive in silico chemical
processing that would be impractical or impossible in a traditional in vitro setting.
Many already exist, such as the NCI open database and ACL MDL, however,
most are privately owned with potentially prohibitive usage costs. The databases
that are publicly available generally have on the order of 10 3 to 105 compounds
(Voigt 2001). The UCI Chemical Data Bank under development has on the order
of 107 compounds consolidated from multiple public and private sources.
Furthermore, relative to some comparable services available such as the
Harvard ChemBank (http://chembank.med.harvard.edu/), the UCI Chemical Data
Bank will include analysis tools such as reaction processing and fingerprint
similarity searches to aid in discovery.
We hypothesize that the library’s size, and thus its utility, can be expanded even
further from known available chemicals to theoretical compounds that are easily
synthesized given those readily available sources. This can be achieved by
applying in silico reactions to the current data set. Here we will consider the
application of combinatorial expansion to a specific “click chemistry” reaction
(Kolb 2001) that is used to design an in vitro drug-binding experiment in
collaboration with Dr. Gregory Weiss of the UCI Dept. of Chemistry.
Additionally, novel, screening criteria are explored that may yield useful (and not
necessarily drug-related) information about chemical space, such as the
identification of information-containing polymer components. That is, a screen
which can search molecule space and find DNA, RNA and protein components,
as well as other potentially similar but as yet unknown constructs that may lend
insight into the very origins of life.
2. Materials / Methods


The database is built on the open source PostgreSQL platform
(http://www.postgresql.org/) using data sources from internet files and vendor
CDs such as the Sigma-Aldrich (http://www.sigmaaldrich.com) and Maybridge
(http://www.maybridge.com) catalogs.
Compound structural data is standardized and stored in SDF format,
converted
by
OpenEye
Software’s
OEChem
toolkit
(http://www.eyesopen.com/products/toolkits/oechem.html).
Additional
Jonathan Chen
2/12/2016



curation and normalization steps such as Corina 3D coordinate generation
(Sadowski 1996) are applied to the data as it is inserted.
Compound information is determined and stored at multiple levels, in addition
to supplemental annotations.
This information ranges from full three
dimensional structural data to SMILES (James 2004) that only specify the
basic chemical connection table (without spatial coordinates) to simple
“fingerprints” (James 2004) that abstractly summarize a compound’s
structural information.
Implicit or explicit functional group annotation is applied using the Daylight
SMARTS pattern method (James 2004) using OpenEye Software’s OEChem
implementation (http://www.eyesopen.com/products/toolkits/oechem.html).
Combinatorial reactions that specify what functional groups can react are
defined by the Daylight SMIRKS specification (James 2004). For example,
amino groups and carboxylic acids react to form an amide bond. (Figure 1
and Figure 2).
Figure 1: SMIRKS reaction specification (Carboxylic
[O:1]=[C:2][O:3][H:7].[H:8][N:4][H:5]>>[O:1]=[C:2][N:4][H:8]
acid
+
amine
>>
amide):
Figure 2: SMIRKS for amide reaction applied to phenylalanine + serine peptide bond formation
Jonathan Chen
2/12/2016


Conceptually, these theoretical products can be reacted again with the
original reactants, resulting in first polynomial then exponential growth of
possible products. A redundancy check of products and source molecules is
thus important, and is accomplished by evaluating their canonical SMILES
representations (James 2004).
Given the above, a combinatorial expansion against any existing library set
can be easily generated. This technique is applied on a single “click
chemistry” reaction (Figure 3) that is good for creating drug-like molecules.
The possible products of the reaction from the chemical database can then be
rapidly computed.
Figure 3: SMIRKS reaction specification for "click chemistry" reaction (alkyne + azide):
[C:2]#[C:3].[N:5]=[N:6]=[N:7][H:8]>>[C:2]1=[C:3][N:5][N:6]=[N:7]1


The in vitro experiment involves a “dynamic screening” approach that calls for
multiple synthesis products above to be dynamically synthesized (and
degraded) in a common pool. However, with all of the predicted products
synthesized in a common pool, isolating individual products from the pool will
be difficult. Instead, the predicted products are distributed into multiple pools
by their molecular weight such that products can be easily separated by mass
spectrometry methods (See Table 1 and Table 2).
A simple polymer screen (identifying candidates that can at least selfpolymerize) is used as a filtering and property isolation example. A simple
form is accomplished, given reaction specifications, by determining which
molecules satisfy the requirements for each reactant and whose product does
as well. That is, the molecule can react with itself to produce something that
can still react with itself.
3. Results
In design preparation for the “click chemistry” reaction product screening, the
reaction was first applied in silico against a commercially available subset of
chemicals from the database to yield the following results.
 144,701 unique and valid compounds read from Aldrich catalog subset.
 825 compounds fit the reactant group profiles.
Jonathan Chen
2/12/2016


o A (33 Alkyne + Carboxylic Acid).
o B (662 Alkyne + Not Carboxylic Acid).
o C (7 Azide + Carboxylic Acid).
o D (123 Azide + Not Carboxylic Acid).
5934 unique products generated by reacting group A with group D.
5334 unique products generated by reacting group B with group C.
The products predicted above are assigned into pools of 100 and distributed by
molecular weight (See Table 1 and Table 2) such that there is still at least ~1
dalton difference between the closest products for any pool.
Line
Number
0
1
2
3
4
5
6
7
8
9
SMILES
Molecular
Weight MW Diff
c1(c[nH]nn1)C(=O)O
113.07494
c1(cn(nn1)[Li])C(=O)O
119.00800
c1(c([nH]nn1)C)C(=O)O
127.10152
c1(c(n(nn1)[Li])C)C(=O)O
133.03458
c1(cn(nn1)[Na])C(=O)O
135.05677
C1(=CN([NH]=N1)[Na])C(=O)O
136.06471
C(C(=O)O)Cc1c[nH]nn1
141.12810
C(C(=O)O)Cc1cn(nn1)[Li]
147.06116
c1(c(n(nn1)[Na])C)C(=O)O
149.08335
C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129
5.93306
8.09352
5.93306
2.02219
1.00794
5.06339
5.93306
2.02219
1.00794
Table 1: Sample of SMILES strings representing 10 compounds from the “click chemistry”
predicted product set including (and ordered by) their calculated molecular weight. The final
column is the difference in molecular weight between a compound and the one immediately
preceding it. Note that as the number of compounds grows, the smallest molecular weight
difference between adjacent compounds will shrink, in some cases even reaching zero for
compounds with identical chemical formulas. Obviously these cannot be resolved by mass
spectrometry.
Line
Number
0
3
6
9
1
4
7
2
5
8
SMILES
Molecular Pool
Weight Number MW Diff
c1(c[nH]nn1)C(=O)O
113.07494
c1(c(n(nn1)[Li])C)C(=O)O
133.03458
C(C(=O)O)Cc1c[nH]nn1
141.12810
C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129
c1(cn(nn1)[Li])C(=O)O
119.00800
c1(cn(nn1)[Na])C(=O)O
135.05677
C(C(=O)O)Cc1cn(nn1)[Li]
147.06116
c1(c([nH]nn1)C)C(=O)O
127.10152
C1(=CN([NH]=N1)[Na])C(=O)O 136.06471
c1(c(n(nn1)[Na])C)C(=O)O
149.08335
0
0
0
0
1
1
1
2
2
2
19.95964
8.09352
8.96319
16.04877
12.00439
8.96319
13.01864
Table 2: The same compounds as in Table 1 but reordered and assigned a “pool” number
indicating that they will be synthesized in separate groups, in this case for pools up to a size of 4.
With this approach, the nearest molecular weight difference for two compounds in the same pool
is significantly increased, from ~1 to 8 daltons. This is accomplished by simply ordering the
compounds by molecular weight and then successively assigning compounds to each pool, up to
the total number of desired pools, and repeating from the first pool.
Jonathan Chen
2/12/2016
The simple polymer screen was run on ~1.5 million compounds in the database,
using 5 sample reaction specifications including amide, ester and phosphodiester
bond formation as filters. This yielded ~250,000 polymer candidates. As a
verification control, an amino acid (phenylalanine) and a nucleotide (dATP)
(Figure 4) were “planted” in the dataset and indeed passed the screen based on
amide and phosphodiester bond formation reactions, respectively.
Figure 4: dATP passes the simple polymer screen because it can react with itself to yield a
product with the same properties, forming the initial components of a DNA polymer
4. Discussion
In designing a click chemistry product screen, ~10,000 possible products were
predicted using the in silico synthesis method. This is still a large number to
manage in an in vitro laboratory setting, thus they were distributed into pools of
100 products each that are sufficiently dispersed to allow mass spectrometry to
separate out mixed product pools.
The simple polymer screen is indeed sensitive enough to identify known
components of DNA and peptides. However, as currently described, it has very
little specificity for identifying other meaningful polymers that may perhaps store
information and be structurally stable.
The project and database are still under design and development and welcome
input. Key among the future directions are finding additional useful filtering /
scoring / screening methods. Parallel uses of the system could then range from
applying in silico chemical synthesis to chemical design.
Some immediate and long-term goals include:
 Using the functional group and other chemical annotations for machine
learning (property finding and pattern prediction) applications, such as
chemical toxicity prediction.
 Refining filters to provide more realistic (spatial coordinate viability), practical
(source amounts) and meaningful (information storage, reactivity) products.
 Applying molecular docking applications to identify combinatorial products as
leads for drug design and other purposes.
Jonathan Chen
2/12/2016

Devising chemical design techniques using the database. For example,
some peptides may be known to readily act at some receptor site in a useful
manner, but peptides make poor drugs. A design technique that uses the
database to find or describe a drug-like molecule with comparable chemical
properties to the peptide could thus be extremely useful.
5. Bibliography
http://chembank.med.harvard.edu/ ChemBank: Initiative for Chemical Genetics.
http://www.eyesopen.com/products/toolkits/oechem.html OEChem Toolkit.
http://www.maybridge.com Maybridge.
http://www.postgresql.org/ PostgreSQL.
http://www.sigmaaldrich.com Sigma-Aldrich.
James, C. A. (2004). Daylight Theory Manual.
http://www.daylight.com/dayhtml/doc/theory/theory.toc.html
Kolb, H. C. (2001). "Click Chemistry: Diverse Chemical Function from a Few
Good Reactions." Angew. Chem. Int. Ed. 40: 2004-2021.
Sadowski, J. (1996). "Evaluation of 3D Structure Generators Revisited."
http://www2.chemie.uni-erlangen.de/software/corina/xrayeval.html
Voigt, J. H. (2001). "Comparison of the NCI Open Database with Seven
Large Chemical Structural Databases." J. Chem. Inf. Comput. Sci. 41: 702712.
Download