UniChem An Introduction to UniChem: EMBL-EBI’s mapping tool for small molecule database identifiers. Webinar: Wed 13th May 2015 Jon Chambers and Anne Hersey, ChEMBL group, The European Bioinformatics Institute, part of the European Molecular Biology Laboratory (EMBL-EBI). UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A A ChEMBL Compound Report Card https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL12 Compound Cross-references on a Compound Report Card… Cross-references to the same molecule in other resources. Automatically maintained via UniChem web services. Other resources can make use of this same functionality. REST Web services. REST web services https://www.ebi.ac.uk/unichem/rest/src_compound_id/CHEMBL12/1 https://www.ebi.ac.uk/unichem/ UniChem query results. LR = Last Release when Assignment was current. UCI = UniChem Identifier UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A EBI Resources containing small molecule data. - - Links between resources allow each resource to evolve Many resources, each with very different user-bases. independently. - New resources predicted to be developed/adopted in future. - But, maintenance is manual/time consuming, and a - duplication How can chemistry-centric users make use of all these data of effort. ? ‘49575’ ‘CHEMBL12’ ‘DZP’ ‘diazepam’ ‘ECBD..??’ ‘SCHEMBL21442’ Advantages of the UniChem model. UniChem - All EBI DBs share the maintenance overhead of creating links to each other. - All EBI DBs share the benefits of maintained links to external resources. - The ‘mapping service’ could be opened for use by external users. UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A Essential requirements for UniChem. • Create cross-referencing of chemical structures and their identifiers between databases. • Fast (ie: capable of producing mappings ‘on the fly’ during a web page load, via a web service call.) • Low maintenance. • Up to date. • Archive and track changes to ‘id-to-structure’ assignments over time. Standard InChI used as the normalizing mechanism. InChIs (International Chemical Identifier). • Non-proprietary, free. • Not a registry system. • Designed for printed and electronic data sources. • Hashed representation aids ‘private’ querying. InChI (International Chemical Identifier) InChIKey… 27 characters long… MGDTEJBDJOHWYU-UHTGSUKQAC-N [‘connectivity block’ aka ‘First InChIKey Hash Block’ (FIKHB) shown in blue] UniChem Schema UC_STRUCTURE UCI STANDARDINCHI STANDARDINCHIKEY UC_XREF eg: CHEMBL12 UCI -FK -PK SRC_ID -FK -PK SRC_COMPOUND_ID -PK ASSIGNMENT LAST_REL_CURRENT -PK Entries here are immutable 1 or 0 UC_SOURCE UC_RELEASE SRC_ID -PK RELEASE_U -PK SRC_RELEASE_NUMBER SRC_RELEASE_DATE etc SRC_ID NAME DESCRIPTION CURRENT_RELEASE_U etc -PK UniChem Tracks Historical Assignments… Data Release No1 from Source ‘S’: cpd123 InChiX Data Release No2 from Source ‘S’: cpd123 InChiY Data Release No3 from Source ‘S’: (latest) cpd123 InChiZ UniChem will record that in this particular source, the id ‘cpd123’… • … was last assigned to InChiX on Release No.1, but is not currently assigned to this structure. • … was last assigned to InChiY on Release No.2, but is not currently assigned to this structure. • … is currently assigned to InChiZ. ie: UniChem keeps a record of current AND obsolete assignments. UniChem deals with ‘Multiple Assignments’… Multiple ids from a particular source assigned to a single InChI… cpd123 cpd456 …and… InChiX cpd789 Single id from a particular source assigned to multiple InChIs… InChiX cpd123 InChiY InChiZ Loading Rules Records are not loaded if… There is a mis-match between the InChI and the InChIKey… ie: where the InChIKey calculated by UniChem from the InChI provided by the source does not exactly match the InChIKey provided by the source. The Standard InChI supplied is greater than 2000 characters long. 20 Automated Loading and Release. Common Format Productio n … etc … Source specific downloaders and parsers Single loader Release Incl. Downloads+ Mapping files Weekly release process Overall process controlled by crontab (timings optimized for each DB to capture latest releases asap). UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A Top Level stats. Stats. https://www.ebi.ac.uk/unichem/ucquery/stats 24 Sources. Sources Downloads. ftp://ftp.ebi.ac.uk/pub/databases/chembl/UniChem/ Downloads on the UniChem ftp site … Oracle Dumps on the UniChem ftp site … Release number == UDRI Contents of a single Release directory… Downloads on the UniChem ftp site … Whole Source Mapping Downloads Whole Source Mapping Downloads – Files containing all id mappings between two sources. An Example of a Whole source mapping file. eg: src3src15.txt [PDBe and SureChEMBL] From src:'3' To src:'15' SX2 SCHEMBL3396223 0DU SCHEMBL6234813 FM9 SCHEMBL12263874 HHH SCHEMBL1957930 2DC SCHEMBL1746175 28Y SCHEMBL232090 0X5 SCHEMBL3515230 PU7 SCHEMBL1964201 1LP SCHEMBL111850 ACK SCHEMBL4066485 ... (8719 records) Analyses. Various analyses run on the current UniChem content, using ‘Structural Identity’ defined in one of 3 ways… FULIK = The Full InChIKey. FIKHB = First InChIKey Hash Block (commonly called 'the connectivity layer' of the InChIKey). SCFIB = Separated Single Components of FIKHB. Structures by Source Numbers of ‘structures’ contributed by each source, and of these, how many are unique to the source… Overlaps between Sources Numbers of ‘structures’ which ‘overlap’ between pairs of sources… UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A UniChem Connectivity Search An advanced use of UniChem which permits searching across UniChem data sources for molecules with the same molecular skeleton as the query, but which may exist in … Different stereochemical and isotopic forms Different salt forms or mixtures Funded by FP7 Capacities Specific Programme, grant agreement no. 284209 Connectivity Based Searching in UniChem Standard UniChem links created only on the basis of identical InChIKeys. Aim: Create links on the basis of common connectivity (but differing elsewhere; stereochemistry, isotopic composition, etc). Requirements… Fast (has to be created dynamically). Identify ‘relationships’ between molecules (eg: “has same connectivity …and is isotopic variant of”) Link between cpds with common connectivity within mixtures/salts. Generic / Flexible / Customizable. Funded by FP7 Capacities Specific Programme, grant agreement no. 284209 Alternative views of molecular equivalence. Sometimes, molecules that many scientists would consider equivalent in the context of their particular field (e.g. pharmacology, docking, etc.), are quite often depicted differently across different resources. Frequently, these depictions have different Standard InChIs and so cannot be integrated by simply matching on Standard InChIKey. Examples… Isotopic Differences CP-99994, an NK1 antagonist… CHEMBL441225 DTQNEFOKTXXQKV-HKUYNNGSSA-N PubChem CID 71450958 DTQNEFOKTXXQKV-XRLBDJASSA-N NB: First InChIKey Hash Block (FIKHB) in blue. Example of Stereochemical differences Paroxetine in two different sources …. AHOUBRCZNHFOSL-YOEHRIQHSA-N Incorrectly drawn, or Valid stereoisomeric forms ? AHOUBRCZNHFOSL-WMLDXEAASA-N NB: First InChIKey Hash Block (FIKHB) in blue. Links between mixtures / salts ? Yohimbine (CHEMBL15245 in ChEMBL) BLGXFZZNTVWLAYSCYLSFHTSA-N Yohimbine HCl (Antagonil in ‘Selleck’) PIPZGJSEDRMUAWVJDCAHTMSA-N Co_Amoxiclav Amoxicillin Clavulanic acid QJVHTELASVOWBE-AGNWQMPPSA-N InChI=1S/C16H19N3O5S.C8H9NO5/c1-16(2)11(15(23)24)1913(22)10(14(19)25-16)18-12(21)9(17)7-3-5-8(20)6-4-7;10-2-1-47(8(12)13)9-5(11)3-6(9)14-4/h3-6,9-11,14,20H,17H2,12H3,(H,18,21)(H,23,24);1,6-7,10H,2-3H2,(H,12,13)/b;4-1-/t9-,10,11+,14-;6-,7-/m11/s1 Links between mixtures / salts ? Yohimbine BLGXFZZNTVWLAYSCYLSFHTSA-N Yohimbine HCl PIPZGJSEDRMUAWVJDCAHTMSA-N Links between mixtures / salts ? Yohimbine BLGXFZZNTVWLAYSCYLSFHTSA-N …Yes, but parsing of the InChI required first... Yohimbine BLGXFZZNTVWLAYSCYLSFHTSA-N Yohimbine HCl PIPZGJSEDRMUAWVJDCAHTMSA-N Hydrochloride VEXZGXHMUGYJMCUHFFFAOYSA-N UniChem Schema Additions to schema for ‘Connectivity Search’ shown in green UC_STRUCTURE UCI STANDARDINCHI STANDARDINCHIKEY FIKHB -PK UC_XREF eg: CHEMBL12 UCI -FK -PK SRC_ID -FK -PK SRC_COMPOUND_ID -PK ASSIGNMENT LAST_REL_CURRENT UC_FIKHB_HIERARCHY PARENT CHILD 1 or 0 UC_SOURCE UC_RELEASE SRC_ID -PK RELEASE_U -PK SRC_RELEASE_NUMBER SRC_RELEASE_DATE etc SRC_ID NAME DESCRIPTION CURRENT_RELEASE_U etc -PK Links between combinations of stereoisomers, isotopic variants, in mixtures / salts … Yohimbine (CHEMBL15245 in ChEMBL) BLGXFZZNTVWLAYSCYLSFHTSA-N …is a component of… …is isotopic variant of… AND …is stereoisomer of… tritiated Rauwolscine BLGXFZZNTVWLAY-XDGRAVGFSA-N …is a component of… AND …is stereoisomer of… Rauwolscine Oxalate XIIDGINYXKOJGX-ZKKXXTDSSA-N Yohimbine HCl PIPZGJSEDRMUAWVJDCAHTMSA-N Rauwolscine HCl PIPZGJSEDRMUAW-ZKKXXTDSSA-N Refining ‘Connectivity Search’ to show salts and mixtures. Select radio button ‘4’ of Option C. Connectivity Search Results Page. Connectivity Search Web Services Connectivity Search Web service query results https://www.ebi.ac.uk/unichem/rest/cpd_search/CHEMBL15245/1/0/0/4 Connectivity Search in ChEMBL Connectivity Search in ChEMBL Train Online http://www.ebi.ac.uk/training/online/course/unichem-quick-tour-0 Acknowledgements ChEMBL John Overington Anne Hersey Anna Gaulton Mark Davies Louisa Bellis ChEBI Chris Steinbeck Janna Hastings Atlas Robert Petryszak George Papadatos Shaun McGlinchey Jon Chambers PDBe Sameer Velankar Training Tom Hancocks Richard Grandison UniChem Webinar: 13th May 2015 • What is UniChem ? • Basic Use of UniChem (web service and web page). • Background … • Why was UniChem developed ? What problem does it solve ? • Requirements and Features… • Schema, Data Normalization, Loading Rules, etc • Current Content … • Sources, Downloads, Stats, Analyses. • Connectivity Search. • Q and A Future webinars: • 20th May - ChEMBL walkthrough • 27th May - Sequence searching (*3pm UK time) • 3rd June – UniProt – accessing protein data programmatically • 10th June – MyChEMBL walkthrough • 17th June - ChEMBL Web Services All webinars @ 4:00pm UK time unless stated For details see: http://www.ebi.ac.uk/training/online/emblebi-training-webinar-series-2015 __END__ Mapping imprecision Example of multiple ids from a source assigned to a single Standard InChI… alloxazine 37325 InChI=1S/C10H6N4O2/ c15-9-7-8(13-10 … isoalloxazine 37327 mappings generated… ChEMBL CHEMBL68500 CHEMBL68500 -> -> -> ChEBI 37325 37327 ChEBI 37325 -> 37327 -> -> ChEMBL CHEMBL68500 CHEMBL68500 62