unichem_webinar_may13_2015

advertisement
UniChem
An Introduction to UniChem: EMBL-EBI’s mapping tool
for small molecule database identifiers.
Webinar: Wed 13th May 2015
Jon Chambers and Anne Hersey,
ChEMBL group,
The European Bioinformatics Institute, part of the
European Molecular Biology Laboratory (EMBL-EBI).
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
A ChEMBL Compound Report Card
https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL12
Compound Cross-references on a Compound
Report Card…
Cross-references to the same
molecule in other resources.
Automatically maintained via
UniChem web services.
Other resources can make
use of this same functionality.
REST Web services.
REST web services
https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL12/1
https://www.ebi.ac.uk/unichem/
UniChem query results.
LR = Last Release when Assignment was current.
UCI = UniChem Identifier
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve
?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
EBI Resources containing small molecule data.
- - Links
between
resources
allow
each
resource
to evolve
Many
resources,
each with
very
different
user-bases.
independently.
- New resources predicted to be developed/adopted in future.
- But, maintenance is manual/time consuming, and a
- duplication
How can chemistry-centric
users make use of all these data
of effort.
?
‘49575’
‘CHEMBL12’
‘DZP’
‘diazepam’
‘ECBD..??’
‘SCHEMBL21442’
Advantages of the UniChem model.
UniChem
- All EBI DBs share the maintenance overhead of creating links to each other.
- All EBI DBs share the benefits of maintained links to external resources.
- The ‘mapping service’ could be opened for use by external users.
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
Essential requirements for UniChem.
• Create cross-referencing of chemical structures and their
identifiers between databases.
• Fast (ie: capable of producing mappings ‘on the fly’ during a
web page load, via a web service call.)
• Low maintenance.
• Up to date.
• Archive and track changes to ‘id-to-structure’ assignments
over time.
Standard InChI used as the normalizing
mechanism.
InChIs (International Chemical Identifier).
• Non-proprietary, free.
• Not a registry system.
• Designed for printed and electronic data sources.
• Hashed representation aids ‘private’ querying.
InChI (International Chemical Identifier)
InChIKey…
27 characters long…
MGDTEJBDJOHWYU-UHTGSUKQAC-N
[‘connectivity block’ aka ‘First InChIKey Hash Block’ (FIKHB) shown in blue]
UniChem Schema
UC_STRUCTURE
UCI
STANDARDINCHI
STANDARDINCHIKEY
UC_XREF
eg: CHEMBL12
UCI
-FK -PK
SRC_ID
-FK -PK
SRC_COMPOUND_ID
-PK
ASSIGNMENT
LAST_REL_CURRENT
-PK
Entries here
are immutable
1 or 0
UC_SOURCE
UC_RELEASE
SRC_ID
-PK
RELEASE_U
-PK
SRC_RELEASE_NUMBER
SRC_RELEASE_DATE
etc
SRC_ID
NAME
DESCRIPTION
CURRENT_RELEASE_U
etc
-PK
UniChem Tracks Historical Assignments…
Data Release No1
from Source ‘S’:
cpd123
InChiX
Data Release No2
from Source ‘S’:
cpd123
InChiY
Data Release No3
from Source ‘S’:
(latest)
cpd123
InChiZ
UniChem will record that in this particular source, the id ‘cpd123’…
• … was last assigned to InChiX on Release No.1, but is not currently assigned to this structure.
• … was last assigned to InChiY on Release No.2, but is not currently assigned to this structure.
• … is currently assigned to InChiZ.
ie: UniChem keeps a record of current AND obsolete assignments.
UniChem deals with ‘Multiple Assignments’…
Multiple ids from a particular source assigned
to a single InChI…
cpd123
cpd456
…and…
InChiX
cpd789
Single id from a particular source assigned
to multiple InChIs…
InChiX
cpd123
InChiY
InChiZ
Loading Rules
Records are not loaded if…
 There is a mis-match between the InChI and the
InChIKey…

ie: where the InChIKey calculated by UniChem from the
InChI provided by the source does not exactly match the
InChIKey provided by the source.
 The Standard InChI supplied is greater than 2000
characters long.
20
Automated Loading and Release.
Common Format
Productio
n
… etc …
Source specific
downloaders and
parsers
Single loader
Release
Incl.
Downloads+
Mapping
files
Weekly release
process
Overall process controlled by crontab (timings
optimized for each DB to capture latest releases asap).
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
Top Level stats.
Stats.
https://www.ebi.ac.uk/unichem/ucquery/stats
24
Sources.
Sources
Downloads.
ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/
Downloads on the UniChem ftp site …
Oracle Dumps on the UniChem ftp site …
Release number == UDRI
Contents of a single Release directory…
Downloads on the UniChem ftp site …
Whole Source Mapping Downloads
Whole Source Mapping Downloads – Files
containing all id mappings between two sources.
An Example of a Whole source mapping file.
eg: src3src15.txt
[PDBe and SureChEMBL]
From src:'3' To src:'15'
SX2 SCHEMBL3396223
0DU SCHEMBL6234813
FM9 SCHEMBL12263874
HHH SCHEMBL1957930
2DC SCHEMBL1746175
28Y SCHEMBL232090
0X5 SCHEMBL3515230
PU7 SCHEMBL1964201
1LP SCHEMBL111850
ACK SCHEMBL4066485
...
(8719 records)
Analyses.
Various analyses run on the current UniChem
content, using ‘Structural Identity’ defined in one of
3 ways…
FULIK = The Full InChIKey.
FIKHB = First InChIKey Hash Block (commonly
called 'the connectivity layer' of the InChIKey).
SCFIB = Separated Single Components of FIKHB.
Structures by Source
Numbers of ‘structures’ contributed by each source, and
of these, how many are unique to the source…
Overlaps between Sources
Numbers of ‘structures’ which ‘overlap’ between pairs of
sources…
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
UniChem Connectivity Search
An advanced use of UniChem which permits
searching across UniChem data sources for
molecules with the same molecular skeleton as
the query, but which may exist in …


Different stereochemical and isotopic forms
Different salt forms or mixtures
Funded by FP7 Capacities Specific Programme, grant agreement no. 284209
Connectivity Based Searching in
UniChem
 Standard UniChem links created only on the basis of
identical InChIKeys.
 Aim: Create links on the basis of common
connectivity (but differing elsewhere;
stereochemistry, isotopic composition, etc).
 Requirements…




Fast (has to be created dynamically).
Identify ‘relationships’ between molecules (eg: “has same
connectivity …and is isotopic variant of”)
Link between cpds with common connectivity within
mixtures/salts.
Generic / Flexible / Customizable.
Funded by FP7 Capacities Specific Programme, grant agreement no. 284209
Alternative views of molecular
equivalence.
Sometimes, molecules that many scientists
would consider equivalent in the context of their
particular field (e.g. pharmacology, docking,
etc.), are quite often depicted differently across
different resources.
Frequently, these depictions have different
Standard InChIs and so cannot be integrated by
simply matching on Standard InChIKey.
Examples…
Isotopic Differences
CP-99994, an NK1 antagonist…
CHEMBL441225
DTQNEFOKTXXQKV-HKUYNNGSSA-N
PubChem CID 71450958
DTQNEFOKTXXQKV-XRLBDJASSA-N
NB: First InChIKey Hash Block
(FIKHB) in blue.
Example of Stereochemical
differences
Paroxetine in two different sources ….
AHOUBRCZNHFOSL-YOEHRIQHSA-N
Incorrectly drawn, or Valid
stereoisomeric forms ?
AHOUBRCZNHFOSL-WMLDXEAASA-N
NB: First InChIKey Hash Block
(FIKHB) in blue.
Links between mixtures / salts ?
Yohimbine
(CHEMBL15245 in ChEMBL)
BLGXFZZNTVWLAYSCYLSFHTSA-N
Yohimbine HCl
(Antagonil in ‘Selleck’)
PIPZGJSEDRMUAWVJDCAHTMSA-N
Co_Amoxiclav
Amoxicillin
Clavulanic
acid
QJVHTELASVOWBE-AGNWQMPPSA-N
InChI=1S/C16H19N3O5S.C8H9NO5/c1-16(2)11(15(23)24)1913(22)10(14(19)25-16)18-12(21)9(17)7-3-5-8(20)6-4-7;10-2-1-47(8(12)13)9-5(11)3-6(9)14-4/h3-6,9-11,14,20H,17H2,12H3,(H,18,21)(H,23,24);1,6-7,10H,2-3H2,(H,12,13)/b;4-1-/t9-,10,11+,14-;6-,7-/m11/s1
Links between mixtures / salts ?
Yohimbine
BLGXFZZNTVWLAYSCYLSFHTSA-N
Yohimbine HCl
PIPZGJSEDRMUAWVJDCAHTMSA-N
Links between mixtures / salts ?
Yohimbine
BLGXFZZNTVWLAYSCYLSFHTSA-N
…Yes, but parsing of the
InChI required first...
Yohimbine
BLGXFZZNTVWLAYSCYLSFHTSA-N
Yohimbine HCl
PIPZGJSEDRMUAWVJDCAHTMSA-N
Hydrochloride
VEXZGXHMUGYJMCUHFFFAOYSA-N
UniChem Schema
Additions to schema for ‘Connectivity
Search’ shown in green
UC_STRUCTURE
UCI
STANDARDINCHI
STANDARDINCHIKEY
FIKHB
-PK
UC_XREF
eg: CHEMBL12
UCI
-FK -PK
SRC_ID
-FK -PK
SRC_COMPOUND_ID
-PK
ASSIGNMENT
LAST_REL_CURRENT
UC_FIKHB_HIERARCHY
PARENT
CHILD
1 or 0
UC_SOURCE
UC_RELEASE
SRC_ID
-PK
RELEASE_U
-PK
SRC_RELEASE_NUMBER
SRC_RELEASE_DATE
etc
SRC_ID
NAME
DESCRIPTION
CURRENT_RELEASE_U
etc
-PK
Links between combinations of
stereoisomers, isotopic variants, in mixtures
/ salts …
Yohimbine
(CHEMBL15245 in ChEMBL)
BLGXFZZNTVWLAYSCYLSFHTSA-N
…is a component of…
…is isotopic variant of… AND
…is stereoisomer of…
tritiated Rauwolscine
BLGXFZZNTVWLAY-XDGRAVGFSA-N
…is a component of… AND
…is stereoisomer of…
Rauwolscine Oxalate
XIIDGINYXKOJGX-ZKKXXTDSSA-N
Yohimbine HCl
PIPZGJSEDRMUAWVJDCAHTMSA-N
Rauwolscine HCl
PIPZGJSEDRMUAW-ZKKXXTDSSA-N
Refining ‘Connectivity
Search’ to show salts
and mixtures.
Select radio button ‘4’
of Option C.
Connectivity Search Results Page.
Connectivity Search Web Services
Connectivity Search Web service
query results
https://www.ebi.ac.uk/unichem/rest/cpd_search/CHEMBL15245/1/0/0/4
Connectivity Search in ChEMBL
Connectivity Search in ChEMBL
Train Online
http://www.ebi.ac.uk/training/online/course/unichem-quick-tour-0
Acknowledgements
ChEMBL
John Overington
Anne Hersey
Anna Gaulton
Mark Davies
Louisa Bellis
ChEBI
Chris Steinbeck
Janna Hastings
Atlas
Robert Petryszak
George Papadatos
Shaun McGlinchey
Jon Chambers
PDBe
Sameer Velankar
Training
Tom Hancocks
Richard Grandison
UniChem Webinar: 13th May 2015
• What is UniChem ?
• Basic Use of UniChem (web service and web page).
• Background …
• Why was UniChem developed ? What problem does it solve ?
• Requirements and Features…
• Schema, Data Normalization, Loading Rules, etc
• Current Content …
• Sources, Downloads, Stats, Analyses.
• Connectivity Search.
• Q and A
Future webinars:
• 20th May - ChEMBL walkthrough
• 27th May - Sequence searching (*3pm UK time)
• 3rd June – UniProt – accessing protein data
programmatically
• 10th June – MyChEMBL walkthrough
• 17th June - ChEMBL Web Services
All webinars @ 4:00pm UK time unless stated
For details see: http://www.ebi.ac.uk/training/online/emblebi-training-webinar-series-2015
__END__
Mapping imprecision
Example of multiple ids from a source assigned to a single Standard InChI…
alloxazine
37325
InChI=1S/C10H6N4O2/
c15-9-7-8(13-10 …
isoalloxazine
37327
mappings generated…
ChEMBL
CHEMBL68500
CHEMBL68500
->
->
-> ChEBI
37325
37327
ChEBI
37325 ->
37327 ->
-> ChEMBL
CHEMBL68500
CHEMBL68500
62
Download