Project Prospect and the Semantic Web Colin Batchelor Royal Society of Chemistry, Cambridge, UK batchelorc@rsc.org Project Prospect and the Semantic Web Who we are What we’ve done Motivation Means The InChI and the Semantic Web Ontology development for chemistry RXNO and MOP 2 Who we are 4 Royal Society of Chemistry Advancing the Chemical Sciences Learned and professional society Scientific publisher 25 journals, 8 databases and a growing book program 8000 articles yearly Covering a broad spectrum of chemical sciences from systems biology (Molecular BioSystems) to physical and theoretical chemistry (PCCP) 5 What we’ve done 7 8 9 The motivation The motivation Scientific papers are formulaic and consistently structured (but not necessarily IMRD: see later) There may be infinitely many possible chemical compounds BUT Nomenclature is productive and susceptible to machine parsing 11 The means The means how publishing really works 13 Data capture Editing and proof-reading 14 Enhanced HTML Database Text mining (Oscar) Manual QA Enhanced RSS15 16 17 Regular polysemy … where words stand for multiple things in a consistent way. Examples: Brand names Grinding Figure–ground Exact–class–part polysemy in chemistry Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco. 18 Regular polysemy Brand names “Learning to buy a Renault and talk to BMW” Grinding “The squirrel scampered down the path and kept stopping and looking at the officers to check they were behind” vs. “[…] the trick was to serve squirrel fresh and not to leave it hanging like other game” 19 Regular polysemy Figure–ground Audrey Hepburn painted the door (figure) Audrey Hepburn walked through the door (ground) The Incredible Hulk walked through the door (ambiguous) 20 Imidazole 21 An imidazole 22 The imidazole side-chain/ group/ring/etc. 23 Can ChEBI handle this? Imidazoles (!) Imidazole (CHEBI:24780) (CHEBI:16069) Imidazole ring Imidazolyl group etc.) not yet not yet (but methyl, benzyl, … and there are no disambiguation cues 24 Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesn’t hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions 25 Disambiguation: toy model CLASS: w(–1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w(–1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = “building block”, “protecting group”, “side chain” 26 Why is this hard? Coordination resolution Part of speech ambiguity: tosylates; noun or verb? 27 Why is this hard? How many numbered compounds actually are named in a given paper? iloprost (1) tributyl-1-hexynylstannane (2) the desired 2-heptyne (3) methyl–Pd(II) iodide 4 or 4′ alkynylstannane 5 the hypervalent stannate 6 (alkynyl)(methyl)Pd(II) complex 7 the desired methylalkyne 8 compounds 9–14 the stannyl precursors 15 and 16 methylated compounds 17 and 18 stannyl precursor 19 iloprost methyl ester 20 “iloprost methyl ester” is the real name, but you need to know that iloprost is a monocarboxylic acid! 28 Why is this hard? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors 29 What are we marking up? Chemical compounds (InChI, ChEBI) Chemical classes and parts (ChEBI) Nanoparticles (in ChEBI from end of October) Chemical terms from the IUPAC Gold Book Name reactions (RXNO) Gene products: function, process, location (GO) Nucleotide and polypeptide sequence terms (SO) Cell types (CL) 30 InChI and the Semantic Web What InChI is for Can represent complete molecules (may be ions or radicals) of less than 1024 heavy (non-H) atoms. (however) Cannot yet represent metal atom geometry. Cannot yet represent polymers. Cannot yet represent diradicals etc. 32 What InChI is not for Classes of molecule Parts of molecule (these have been done in ChemBlast) 33 InChI in RDF (We don’t like this.) We use the RSS content module. (As if articles contained molecules.) And we use info:inchi URIs. Look… 34 Some RDF <content:items> <rdf:Bag> <rdf:li> <content:item rdf:about="info:inchi/InChI=1/C15H22O9/c18(16)19-6-15(7-20-9(2)17)12(21-10(3)18)11-13(24-15)23-14(4,5)22-11/h1113H,6-7H2,1-5H3/t11?,12-,13+/m1/s1"/> </rdf:li> <rdf:li> <content:item rdf:about="info:inchi/InChI=1/C21H34O9/c1-6-914(22)25-12-21(13-26-15(23)10-7-2)18(27-16(24)11-8-3)17-19(30-21)2920(4,5)28-17/h17-19H,6-13H2,1-5H3/t17?,18-,19+/m1/s1"/> </rdf:li> </rdf:Bag> </content:items> <content:items> <content:item> <owl:Class rdf:ID="GO_0016298"> <rdfs:label>lipase activity</rdfs:label> </owl:Class> </content:item> </content:items> 35 36 RXNO David Barden Colin Batchelor Celia Gitterman RXNO the name reaction ontology (1) Every chemist knows about famous chemists like Wittig, Cannizzaro, Diels, Alder, benzoin They’re pretty unambiguous and well-suited to logical definitions But what organizing principle do we use? 38 RXNO the name reaction ontology (2) Sort reactions by what they do to the ‘skeleton’ of the molecule. Skeleton-changing reactions: Joinings, cleavings, rearrangements, ring formation, ring expansion Skeleton-preserving reactions: Additions, eliminations, substitutions, protections, deprotections 39 RXNO the name reaction ontology (3) Quality? Subjectivity? Get our curators to assign reactions to categories without conferring, check percentage agreement, discuss disagreements, improve guidelines, iterate to convergence. 40 41 42 RXNO the name reaction ontology (4) 43 44 What do people say? 46 The spectroscopist’s tale The enriched html version came as something of a revelation and the current emphasis on links to, and through biomolecular terminology was very much a plus for us, since my colleagues and I are a mix of physical and biological chemists who are dabbling in inter-disciplinary waters. Given the steadily increasing burden of keeping up with the current literature and accessing earlier publications - a fortiori when conventional disciplinary boundaries are being crossed - the ability to 'grow a tree' from current articles (including one's own) is going to make 'targeted sleuthing' a great deal easier. John Simons, Oxford 47 The high-throughput screener’s tale An interesting opportunity particularly for managers, students and beginners that are not that deeply immersed in the detail and the terminology. It further opens access to those who want to explore areas they are not specialists in. Great idea! Eberhard Krausz, MPI-CBG Dresden 48 Lastly… “My only criticism would be the need for a time warning… I spent 4 hours digging about which generated at least six new research ideas printed half a ream of paper and I missed my bus home. At least it was a new excuse my wife had not heard, so another first.” An analytical chemist, The North. 49 50 Acknowledgements Royal Society of Chemistry Richard Kidd, Jeff White, David Barden, Celia Gitterman, Hilary Burch, the Informatics team University of Cambridge Peter Corbett, Simone Teufel, Ann Copestake, Peter Murray-Rust OBO Karen Eilbeck, Midori Harris, Jen Deegan, Jane Lomax, Chris Mungall, Barry Smith, the ChEBI team 51 52