ChemDB-TechnicalDoc

advertisement
UCI Chemical Data Bank Technical Document
Table of Contents
UCI Chemical Data Bank Technical Document .................................................... 1
Table of Contents 1
Introduction
1
Background
1
Future Goals 2
Key Features / Functionality
2
Fingerprint Similarity Searches
2
Molecular Docking
2
Functional Group Annotation / Screening 2
Reaction Processing 2
Chemical Property Prediction / Machine Learning 4
Technical Specifications 4
Data Sources / Vendors
4
Current DB Size
4
Data Type Descriptions
5
Database Schema
6
Technology Dependencies 7
Current Projects 8
Peptide Replacement for Protein Phosphatase Inhibition 8
Dynamic in vitro Combinatorial Chemistry Screening Design
9
Introduction
(Copied from CombiCDB report)
Background
High-throughput screening of small molecules, discovery of important chemical
properties and in silico chemical synthesis and design can be greatly facilitated
with large databases of small molecule information. Examples of this exploration
to understand chemical space include screening drug candidate molecules by
molecular docking or applying machine-learning techniques to predict chemical
toxicity. Such databases enable this by allowing massive in silico chemical
processing that would be impractical or impossible in a traditional in vitro setting.
Many already exist, such as the NCI open database and ACL MDL, however,
most are privately owned with potentially prohibitive usage costs. The databases
that are publicly available generally have on the order of 10 3 to 105 compounds
(Voigt 2001). The UCI Chemical Data Bank under development has on the order
of 107 compounds consolidated from multiple public and private sources.
Furthermore, relative to some comparable services available such as the
Harvard ChemBank (http://chembank.med.harvard.edu/), the UCI Chemical Data
Bank will include analysis tools such as reaction processing and fingerprint
similarity searches to aid in discovery.
Future Goals

Using the functional group and other chemical annotations for machine
learning (property finding and pattern prediction) applications, such as
chemical toxicity prediction.

Refining filters to provide more realistic (spatial coordinate viability), practical
(source amounts) and meaningful (information storage, reactivity) products.

Applying molecular docking applications to identify combinatorial products as
leads for drug design and other purposes.

Devising chemical design techniques using the database. For example,
some peptides may be known to readily act at some receptor site in a useful
manner, but peptides make poor drugs. A design technique that uses the
database to find or describe a drug-like molecule with comparable chemical
properties to the peptide could thus be extremely useful.
Key Features / Functionality
Fingerprint Similarity Searches
Josh should describe?
Molecular Docking
Josh should describe?
Functional Group Annotation / Screening
Implicit or explicit functional group annotation is applied using the Daylight
SMARTS pattern method (James 2004) using OpenEye Software’s OEChem
implementation (http://www.eyesopen.com/products/toolkits/oechem.html). Once
the functional groups have been identified, screening molecules that possess any
particular combination of groups is straightforward and useful for identifying
molecules with likely reactivity profiles (see next section).
Reaction Processing
The library’s size, and thus its utility, can be expanded even further from known
available chemicals to theoretical compounds that are easily synthesized given
those readily available sources. This can be achieved by applying in silico
reactions to the current data set.
Once functional group annotations are made, combinatorial reactions that specify
which groups can react are defined by the Daylight SMIRKS specification (James
2004). For example, amino groups and carboxylic acids react to form an amide
bond. (Figure 1 and Figure 2).
Figure 1: SMIRKS reaction specification (Carboxylic
[O:1]=[C:2][O:3][H:7].[H:8][N:4][H:5]>>[O:1]=[C:2][N:4][H:8]
acid
+
amine
>>
amide):
Figure 2: SMIRKS for amide reaction applied to phenylalanine + serine peptide
bond formation
Besides expanding the data set by predicting reaction products, the reaction
processing capabilities can be applied to other uses. For example, it can be
used as part of a screen for potential polymer components. A simple polymer
screen (identifying candidates that can at least self-polymerize) is accomplished,
given reaction specifications, by determining which molecules satisfy the
requirements for each reactant and whose product does as well. That is, the
molecule can react with itself to produce something that can still react with itself.
Figure 3: dATP passes the simple polymer screen because it can react with itself
to yield a product with the same properties, forming the initial components of a
DNA polymer
Chemical Property Prediction / Machine Learning
Given data sets of chemicals with known abstract properties, such as toxicity,
those properties can be predicted for other unknown chemicals in the database.
The basis for these methods is in applying kernel similarity measures on
molecules, such as by interpreting the molecular graph or their SMILES strings,
and feeding this data into a support vector machine (SVM).
Technical Specifications
Data Sources / Vendors
Because SDF, MOL2 and SMILES are the major types of files we store in the
database, so we have collected data of the above types from around 50 different
vendors. A current list of vendors can be found in table [table number].
In the vendor table, we store the name, contact information, and a date of the
newest update of each dataset in the database. We periodically check for
updates online or contact the vendors to request new data CDs when they are
available.
Table: Current list of vendors
ACBBLOCKS
ACROS
ALDRICH
AMBINTER
ANALYTICON
ARRAYBIOPHARMA
ASINEX
ASYMCHEM
AURORA
CGX
CHEMBRIDGE
CHEMICALBLOCK
CHEMSTAR
CHESS
COMBIBLOCKS
DSSTOX
EMC
ENAMINE
FRONTIER
HARVARD
ICBSCREEN
IFLAB
INTERCHIM
KATRITSKY
KEYORGANICS
LABOTEST
MATRIX
MAYBRIDGE
MAYBRIDGE
MCL
MDPI
MDSI
MENAI
NANOSYN
NCI
PEAKDALE
PHARMEKS
RYAN
SPECS
SYNCHEM
TIMTEC
TOCRIS
TOSLAB
TRC
TRIPOS
TYGERSCIENTIFIC
VITASMLAB
WORLD
ZELINSKY
Current DB Size
Yimeng, Jocelyne?
Data Type Descriptions
Compound structural data is standardized and stored in SDF format, converted
by OpenEye Software’s OEChem toolkit
(http://www.eyesopen.com/products/toolkits/oechem.html). Additional curation
and normalization steps such as Corina 3D coordinate generation (Sadowski
1996) are applied to the data as it is inserted.
Compound information is determined and stored at multiple levels, in addition to
supplemental annotations. This information ranges from full three dimensional
structural data in SDF files, to SMILES strings (James 2004) that only specify the
basic chemical connection table (without spatial coordinates) to simple
“fingerprints” (James 2004) that abstractly summarize a compound’s structural
information.
Database Schema

Source: Table of vendors and other sources of chemical information

ChemicalMix: Individual information items available from sources. May not be
individual chemicals, but a mix of chemicals available as a unit.

Source2ChemicalMix: Resolution table between sources and chemical
mixes, because multiple sources may (and likely will) list the same chemical
mixes. This model records all of that information, without storing redundant
data.

Annotation: High-level chemical annotations from sources, likely extracted
from “SD Tags” of the SDF molecule format the source data came from.

Chemical: Individual chemical components of chemical mixes that cannot be
further resolved into components. Specific, derived annotations are precomputed and stored to facilitate later computations. For example, summary
information about the atom and bond content (num_c, num_sg_bonds, etc.),
solvation energy (ZAP_solvation, etc.) and abstract “fingerprint”
representations.

MixtureComponent: Similar to Source2ChemicalMix, resolution table
between ChemicalMix and Chemical, to avoid storage of redundant Chemical
information.

Isomer3d: 3D structural information / atom coordinates of chemicals, as
generated by an external program (Corina, see Technology Dependencies).
Multiple 3D conformations may exist here for any single chemical.
Technology Dependencies

PostgreSQL (http://www.postgresql.org/): The database is built on the
PostgreSQL platform. This platform was chosen for being available opensource and, among such options, it is well developed, supported and
expected to provide the best performance for large data sets.

Apache Web Server (http://httpd.apache.org/): Web available interfaces and
tools are delivered via this well-established and popular open-source web
server.

Python (http://www.python.org): Many of the basic application tools, scripts
and web interface are written in the Python programming language. This
interpreted language is advantageous for rapid-prototyping, though specific
intensive modules may be better ported to a stricter language like C or Java.
In the meantime, one its primary advantages is it allows for straightforward
usage of critical toolkits like the OpenEye cheminformatics toolkits (see
below)

OpenEye toolkits (http://www.eyesopen.com): The OEChem toolkit
implements many of the basic algorithms needed for chemical data
processing, SMARTS pattern matching and SMIRKS reaction processing.
Other components, such as the Ogham / OEDepict tools for chemical image
rendering, are used as well.

Corina (http://www2.chemie.uni-erlangen.de/software/corina/index.html):
Most of the chemical information supplied from vendors only includes the
basic atom-bond connection table as far as structural data is concerned.
Corina is a program used to generate 3D coordinate predictions for any given
molecule. Only after this is done can the chemicals be applied to such
problems as molecular docking (see below).

Dock: Given the 3D structural data for some kind of receptor molecule, the
modified Dock program is used to screen the database for chemicals that can
“dock” well with the receptor. That is, it has shape and properties
complementary to the binding site. This is especially useful for identifying
potential drug molecule leads.
Current Projects
Peptide Replacement for Protein Phosphatase Inhibition
Generating combinatorial libraries of peptides is relatively easy and can be used
to produce potential ligands for biological targets. However, peptides in general
make terrible drugs and not even good in vivo inhibitors due to poor
bioavailability, membrane impermeability and rapid degradation. Thus, given a
peptide known to achieve a desirable effect on some receptor molecule, the
ability to predict or construct a functionally comparable, small, organic molecule
naturally impervious to the practical limitations of peptides would be of great
value.
The family of protein phosphatases (PP) that regulate the phosphorylation state
of protein hydroxyl residues (Ser, Thr or Tyr) can be thought of as such receptor
molecules. Such phosphatases are ubiquitous enzymes that play a critical role in
regulating important metabolic pathways such as glycogen synthesis, cell
division, gene expression, neurotransmission, muscle contraction and other
signal transduction pathways. Specifically under investigation are PP1 and
PP2A for which having specific inhibitors would greatly assist study of their
regulated pathways.
Many endogenous peptides such as Inhibitor-1 are known to inhibit PP1 and
PP2A specifically, without affecting other Ser-Thr phosphatases. As mentioned
previously however, these are poor choices as in vivo inhibitors or drugs due to
their inherent peptide nature. Based on dozens of known PP1 target protein
sequences, PP1 appears to be specific for proteins bearing the consensus
sequence of R/K-R/K-V/I-X-F/W. Finding or constructing an organic molecule
replacement for this short peptide sequence is thus the broad goal.
Encouraging for this approach is the existence of many non-peptide organic
molecules that have naturally evolved to inhibit these phosphatases, for example
as plant or insect toxins. These include cyclic peptides like microcystin,
terpenoids like catharidin and polyketides like okadaic acid.
Structural data for the PP1c protein bound to target peptides bearing the
consensus sequence are available (refer to PDB code 1FJM?). This will offer
direction and a means for validating any predictions made. Strategies to replace
the target peptide include molecular docking of candidate molecules into the
“inverted surface” of the peptide as a mock receptor-binding site. However, this
could be a time intensive process, so broader screening approaches could apply
machine-learning methods that identify molecules similar to the peptide by
volume displacement, atom distribution in space or other coarse features that
approximate the “shape” of the peptide.
Dynamic in vitro Combinatorial Chemistry Screening Design
The reaction processing tools can assist in the design of in vitro combinatorial
chemistry experiments. The in vitro experiment involves a “dynamic screening”
approach that calls for multiple synthesis reactants to be added to a common
pool containing some target receptor or other assayable attribute. These
reactants will then be allowed to undergo a specific “click chemistry” combination
reaction, conducive to producing drug-like molecules. The dynamic nature of the
screen relates to the different reactant combinations in a common pool
generating products that may bind the receptor or revert back to source reactants
to recombine into other product combinations.
The reaction processing tools allows for rapid computation of the possible
products that can be generated from reactants in the chemical database, based
on the “click chemistry” reaction specification. This comes into play in the design
of the experiment when one considers that, given all of the potential products
synthesized in a common pool, isolating individual products from a pool will be
difficult. Instead, the products are predicted in silico first and then are distributed
into multiple pools by their calculated molecular weight such that products can be
easily separated by mass spectrometry methods (See Table 1 and Table 2).
Line
Number
0
1
2
3
4
5
6
7
8
9
SMILES
Molecular
Weight MW Diff
c1(c[nH]nn1)C(=O)O
113.07494
c1(cn(nn1)[Li])C(=O)O
119.00800
c1(c([nH]nn1)C)C(=O)O
127.10152
c1(c(n(nn1)[Li])C)C(=O)O
133.03458
c1(cn(nn1)[Na])C(=O)O
135.05677
C1(=CN([NH]=N1)[Na])C(=O)O
136.06471
C(C(=O)O)Cc1c[nH]nn1
141.12810
C(C(=O)O)Cc1cn(nn1)[Li]
147.06116
c1(c(n(nn1)[Na])C)C(=O)O
149.08335
C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129
5.93306
8.09352
5.93306
2.02219
1.00794
5.06339
5.93306
2.02219
1.00794
Table 1: Sample of SMILES strings representing 10 compounds from the “click chemistry”
predicted product set including (and ordered by) their calculated molecular weight. The final
column is the difference in molecular weight between a compound and the one immediately
preceding it. Note that as the number of compounds grows, the smallest molecular weight
difference between adjacent compounds will shrink, in some cases even reaching zero for
compounds with identical chemical formulas. Obviously these cannot be resolved by mass
spectrometry.
Line
Number
0
3
6
SMILES
c1(c[nH]nn1)C(=O)O
c1(c(n(nn1)[Li])C)C(=O)O
C(C(=O)O)Cc1c[nH]nn1
Molecular Pool
Weight Number MW Diff
113.07494
133.03458
141.12810
0
0
0
19.95964
8.09352
9
1
4
7
2
5
8
C1(=C(N([NH]=N1)[Na])C)C(=O)O 150.09129
c1(cn(nn1)[Li])C(=O)O
119.00800
c1(cn(nn1)[Na])C(=O)O
135.05677
C(C(=O)O)Cc1cn(nn1)[Li]
147.06116
c1(c([nH]nn1)C)C(=O)O
127.10152
C1(=CN([NH]=N1)[Na])C(=O)O 136.06471
c1(c(n(nn1)[Na])C)C(=O)O
149.08335
0
1
1
1
2
2
2
8.96319
16.04877
12.00439
8.96319
13.01864
Table 2: The same compounds as in Table 1 but reordered and assigned a “pool” number
indicating that they will be synthesized in separate groups, in this case for pools up to a size of 4.
With this approach, the nearest molecular weight difference for two compounds in the same pool
is significantly increased, from ~1 to 8 daltons. This is accomplished by simply ordering the
compounds by molecular weight and then successively assigning compounds to each pool, up to
the total number of desired pools, and repeating from the first pool.
In work already done to design a click chemistry product screen, ~10,000
possible products were predicted using the in silico synthesis method. This is
still a large number to manage in an in vitro laboratory setting, thus they were
distributed into pools of 100 products each that are sufficiently dispersed to allow
mass spectrometry to separate out mixed product pools.
Download