Presentation

advertisement
Validation and Standardization of
Molecular Structures in General
and Sugars in Particular: a Case
Study
Colin Batchelor,
Ken Karapetyan, Valery Tkachenko, Antony Williams
6th Joint Sheffield Conference
on Chemoinformatics
2013-07-24
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study: ChEMBL and DrugBank
Sugar case study: Perspective perception
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study: ChEMBL and DrugBank
Sugar case study: Perspective perception
Who is involved?
28 Consortium Members
>45 Associated Partners
3-year European project funded by:
• European Pharmaceutical Industry
• Innovative Medicines Initiative
Applications using the Open PHACTS API
Explorer
Open PHACTS API
dev.openphacts.org
www.openphacts.org
Twitter: @open_phacts
How do we fit in?
We integrate and standardize the chemical
compound collection underpinning Open
PHACTS and provide regular updates and ongoing data curation.
The validation and standardization rules have
been derived from the FDA structure guidelines
and have been changed for consistency and
input from members of EFPIA.
“
”
Open PHACTS provides an integrated platform of publicly
available pharmacological and physicochemical data
Data accessible via:
• Free application programming interface (API)
dev.openphacts.org
• Third-party applications built to use the API
Open PHACTS app ecosystem
How does Open PHACTS work?
Currently integrated databases
Database
Millions of triples
ACD Labs / ChemSpider
ChEBI
ChEMBL
ConceptWiki
DrugBank
Enzyme
Gene Ontology
SwissProt
WikiPathways
TOTAL
161.3
0.9
146.1
3.7
0.5
0.1
0.9
156.6
0.1
470.2
CVSP and the OPS CRS
Standardization workflows (CVSP, FDA,
OPS, custom) using modules such as:
• SMIRKS transformations
• layout (GGA)
• canonical tautomers (ChemAxon)
• sugar interpretation (RSC)
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study: ChEMBL and DrugBank
Sugar case study: Perspective perception
RDF and Open PHACTS
The underlying language of Open PHACTS is RDF.
There are few constraints as such, only guidelines
for which classes of identifier to use and accounts of
best practice.
This RDF goes into the data cache and we access
the results through user interfaces built on RESTful
JSON web services.
What does RDF look like?
In the Turtle format below, each line is a triple, in
which a binary predicate links a subject and an
object.
:CSID1execution obo:OBO_0000299 :CSID1prop11 .
:CSID1prop11 obo:IAO_0000136 ops:OPS1 .
:CSID1prop11 rdf:type cheminf:CHEMINF_000349 .
:CSID1prop11 qudt:numericValue "1.049E-17"^^xsd:double .
:CSID1prop11 qudt:unit obo:UO_0000324 .
There is also RDF/XML, which is less humanreadable.
Royal Society of Chemistry
data in Open PHACTS
1. Molecule synonyms and identifiers
2. Linksets between ChEBI, ChEMBL,
DrugBank and OPS identifiers
3. Molecule–molecule relations (“parent–
child”) of interest for drug discovery
4. Calculated physicochemical properties
for compounds (both molecular and
macroscopic)
Royal Society of Chemistry
data in Open PHACTS
1. Molecule synonyms and identifiers
2. Linksets between ChEBI, ChEMBL,
DrugBank and OPS identifiers
3. Molecule–molecule relations (“parent–
child”) of interest for drug discovery
4. Calculated physicochemical properties
for compounds (both molecular and
macroscopic)
Calculated physicochemical
properties (ACD 12.0)
log P log D (at pH 5.5, at pH 7.4)
bioconcentration factor KOC (at pH 5.5, at
pH 7.4) index of refraction polar surface
area molar refractivity molar volume
polarizability surface tension density at STP
boiling point at 1 atm flash point at 1 atm
enthalpy of vaporization at STP vapour
pressure at STP
RDF for calculated properties:
vocabularies
Two dozen calculated properties for each of
>106 molecules.
CHEMINF ontology for kinds of calculation and
chemical data
QUDT for results
OPS IDs for molecules
OBI and IAO to connect calculations to results
RDF for calculated properties:
CHEMINF
schema
calculation
calculated log P
rdf:type
CHEMINF
execution of
ACD/Labs
PhysChem software
library version 12.01
OPS
benzene
IAO
is about
process
OBI
has specified
input
OBI
has specified
output
rdf:type
calculation result
QUDT
has value
benzene’s
connection table
QUDT
has unit
rdf:type
CHEMINF
connection table
QUDT
dimensionless
quantity
“2.17”^^xsd:float
QUDT
has standard
uncertainty
“0.234”^^xsd:float
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study: ChEMBL and DrugBank
Sugar case study: Perspective perception
ChEMBL and DrugBank
analysed
Taking ChEMBL 16 (http://www.ebi.ac.uk/chembl/) which
contains 1 295 510 distinct molecules, CVSP found
something to say about 456 250 of them (35%).
DrugBank 3.0 (http://www.drugbank.ca/) contains 6510
distinct molecules of which CVSP has found something to
say about 662 of them (10%)
(We haven’t done all of CS yet; we will.)
Potentially serious things
ChEMBL
DrugBank
14218
1.09%
202
3.10% Not an overall neutral system
485
0.04%
21
0.32% Forbidden-valence atoms
44
—
0
—
Has adjacent atoms with like charges
4
—
0
—
Has more than one radical centre
Aesthetics
ChEMBL
DrugBank
57275
4.42% 70
1.08
%
Uneven-length bonds
25736
1.99% 78
1.20
%
Congested layout
23622
1.82% 24
0.37
%
Containing not-quite-linear cyano groups
167
0.01% 1
—
Zero-dimensional structures
70
0.01% 0
—
Containing not-quite-linear isocyano groups
Artwork molecules
ChEMBL
DrugBank
0
0
Cyclobutane
8
0
Ethane molecules in the structure
6
0
Sulfur atoms with no explicit bonds
4
0
Boron atoms with no explicit bonds
1
0
Ethyne molecule
(in the ChEMBL case it actually is acetylene)
3
0
Stray methane molecules
FDA tautomer and metal rules
ChEMBL
DrugBank
17508
1.35% 80
1.29%
In enol form (or chalcogenoenol form)
9526
0.74% 4
0.07%
N=C–OH tautomer of a carbonyl compound
2
—
—
Nitroso-form oximes
1104
0.09% 6
0.09%
Metal–nitrogen bond
845
0.06% 10
0.15%
Non-metal–transition-metal bond
432
0.03% 10
0.15%
Metal–oxygen bond
3
—
2
—
Aluminium–non-metal bond
2
—
0
—
Metal–fluorine bond
1
Stereochemistry
ChEMBL
DrugBank
185742
14.3% 39
0.60%
G2-4: Has a single unknown stereocentre and no
defined stereocentres: probably a racemate
68572
5.3%
13
0.20%
G2-42 Has more than one unknown stereocentre
and no defined stereocentres: probably
problematic. Could indicate relative
stereochemistry?
36572
2.8%
27
0.44%
G2-44 At least one defined stereocentre, and one
is stereocentre undefined or unknown: probably
an epimer or mixture of anomers
26076
2.0%
11
0.17%
G2-46 Has more than one unknown stereocentre
and more than one defined stereocentre –
probably problematic again
23113
1.8%
13
0.20%
Unknown double bond arrangement
883
0.1%
1
—
At least one ring containing stereobonds
Overview
Open PHACTS and chemical validation and
standardization
RDF for chemoinformatics calculations
General case study: ChEMBL
Sugar case study: Perspective perception
Sugar depiction challenges
Stereochemistry not stored in V2000
format (though present in .cdx).
Consequences
Sugar questions
ChEMBL
(19275)
DrugBank
(153)
5359
27.8%
138
90.2%
At least one L-pyranose ring (often antibiotics
contain these)
4748
24.6%
0
—
At least one perspective chair
416
2.16%
0
—
At least one Haworth ring
52
0.03%
0
—
At least one perspective boat or twist boat
Sugar ring redepiction
algorithm
1. Identify perspective conformation (boat,
chair, Haworth)
2. Determine perspective stereo
3. Assign wedge or hash to bonds
accordingly
4. Reconstruct sugar ring so as to minimize
disruption to the rest of molecule
5. Tidy
Take the x-axis as parallel to
the line through the top two
chair atoms or through the
bottom two chair atoms.
Δy positive: wedge
Δy negative: hash
Then remap chair to
homotropous hexagon.
In the boat case, the
substituent further up the
page is the wedge, while
the one further down the
page is the hash,
regardless of whether
bridgehead or not.
Depiction
1. Identify mean bond
length and chair centroid.
2. Snap ring atoms to a
regular-hexagonal grid.
3. Remove superfluous
hydrogen atoms.
4. Only mark stereo on a
single substituent if they
are paired (cf. Grice).
Tidying: desiderata
Different problem from structure layout in
general.
The structure we end up with is, in many
important respects, fine.
Preserve drawing conventions—aglycones
being on the top right hand side.
Next steps
Stable user-facing URI for CVSP
(currently http://cvsp.beta.rsc-us.org/,
but subject to change)
Apply CVSP to all of ChemSpider.
Investigate fused rings.
Acknowledgements
In particular,
Jon Steele (RSC)
David Sharpe (RSC)
John Blunt (Canterbury, NZ)
Any questions?
batchelorc@rsc.org
@documentvector
Download