HCLSIG$$F2F$$2008-10_F2F$1_Prospect

advertisement
Project Prospect
and the Semantic Web
Colin Batchelor
Royal Society of Chemistry, Cambridge, UK
batchelorc@rsc.org
Project Prospect
and the Semantic Web
Who we are
What we’ve done
Motivation
Means
The InChI and the Semantic Web
Ontology development for chemistry
RXNO and MOP
2
Who we are
4
Royal Society of Chemistry
Advancing the Chemical Sciences
 Learned and professional society
 Scientific publisher
 25 journals, 8 databases and a growing
book program
 8000 articles yearly
 Covering a broad spectrum of chemical
sciences from systems biology (Molecular
BioSystems) to physical and theoretical
chemistry (PCCP)
5
What we’ve done
7
8
9
The motivation
The motivation
 Scientific papers are formulaic and consistently
structured (but not necessarily IMRD: see later)
 There may be infinitely many possible chemical
compounds
BUT
 Nomenclature is productive and susceptible to
machine parsing
11
The means
The means
how publishing really works
13
Data capture
Editing and proof-reading
14
Enhanced HTML
Database
Text mining (Oscar)
Manual QA
Enhanced RSS15
16
17
Regular polysemy
… where words stand for multiple things in a
consistent way.
Examples:
 Brand names
 Grinding
 Figure–ground
 Exact–class–part polysemy in chemistry
Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc.
BERBMTM08 at LREC 2008, Marrakech, Morocco.
18
Regular polysemy
Brand names
“Learning to buy a Renault and talk to BMW”
Grinding
“The squirrel scampered down the path and kept
stopping and looking at the officers to check they
were behind”
vs.
“[…] the trick was to serve squirrel fresh and not to
leave it hanging like other game”
19
Regular polysemy
Figure–ground
 Audrey Hepburn painted the door (figure)
 Audrey Hepburn walked through the door
(ground)
 The Incredible Hulk walked through the
door (ambiguous)
20
Imidazole
21
An imidazole
22
The imidazole side-chain/
group/ring/etc.
23
Can ChEBI handle this?
 Imidazoles (!)
 Imidazole
(CHEBI:24780)
(CHEBI:16069)
 Imidazole ring
 Imidazolyl group
etc.)
not yet
not yet (but methyl, benzyl,
… and there are no disambiguation cues
24
Disambiguation
One Sense per Discourse (Gale et al. 1992)
… this doesn’t hold at all
One Sense per Collocation (Yarowsky 1993)
… matches our intuitions
25
Disambiguation: toy model
CLASS:
 w(–1) = a, an, the, this
 w(0) plural (bit of a cheat, as not a collocation)
PART:
 w(–1) = bridging, terminal
 w(+1) = backbone, bridge, chain, core, dyad,
fluorophore, fragment, framework (and many
more)
 w(+1)w(+2) = “building block”, “protecting group”,
“side chain”
26
Why is this hard?
Coordination resolution
Part of speech ambiguity: tosylates; noun or verb?
27
Why is this hard?
How many numbered compounds
actually are named in a given
paper?
iloprost (1)
tributyl-1-hexynylstannane (2)
the desired 2-heptyne (3)
methyl–Pd(II) iodide 4 or 4′
alkynylstannane 5
the hypervalent stannate 6
(alkynyl)(methyl)Pd(II) complex 7
the desired methylalkyne 8
compounds 9–14
the stannyl precursors 15 and 16
methylated compounds 17 and 18
stannyl precursor 19
iloprost methyl ester 20
“iloprost methyl ester” is the real
name, but you need to know
that iloprost is a
monocarboxylic acid!
28
Why is this hard?
For compound names:
~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and
Corbett 2007)
~20% PubChem
~20% ChemDraw
For compound numbers:
~70% author ChemDraw
~30% editors
29
What are we marking up?





Chemical compounds (InChI, ChEBI)
Chemical classes and parts (ChEBI)
Nanoparticles (in ChEBI from end of October)
Chemical terms from the IUPAC Gold Book
Name reactions (RXNO)
 Gene products: function, process, location (GO)
 Nucleotide and polypeptide sequence terms (SO)
 Cell types (CL)
30
InChI and the Semantic Web
What InChI is for
 Can represent complete molecules (may be
ions or radicals) of less than 1024 heavy
(non-H) atoms.
(however)
 Cannot yet represent metal atom geometry.
 Cannot yet represent polymers.
 Cannot yet represent diradicals etc.
32
What InChI is not for
 Classes of molecule
 Parts of molecule
(these have been done in ChemBlast)
33
InChI in RDF
(We don’t like this.)
We use the RSS content module. (As if
articles contained molecules.)
And we use info:inchi URIs.
Look…
34
Some RDF
<content:items>
<rdf:Bag>
<rdf:li>
<content:item rdf:about="info:inchi/InChI=1/C15H22O9/c18(16)19-6-15(7-20-9(2)17)12(21-10(3)18)11-13(24-15)23-14(4,5)22-11/h1113H,6-7H2,1-5H3/t11?,12-,13+/m1/s1"/>
</rdf:li>
<rdf:li>
<content:item rdf:about="info:inchi/InChI=1/C21H34O9/c1-6-914(22)25-12-21(13-26-15(23)10-7-2)18(27-16(24)11-8-3)17-19(30-21)2920(4,5)28-17/h17-19H,6-13H2,1-5H3/t17?,18-,19+/m1/s1"/>
</rdf:li>
</rdf:Bag>
</content:items>
<content:items>
<content:item>
<owl:Class rdf:ID="GO_0016298">
<rdfs:label>lipase activity</rdfs:label>
</owl:Class>
</content:item>
</content:items>
35
36
RXNO
David Barden
Colin Batchelor
Celia Gitterman
RXNO
the name reaction ontology (1)
 Every chemist knows about famous
chemists like Wittig, Cannizzaro, Diels,
Alder, benzoin
 They’re pretty unambiguous and well-suited
to logical definitions
 But what organizing principle do we use?
38
RXNO
the name reaction ontology (2)
 Sort reactions by what they do to the
‘skeleton’ of the molecule.
 Skeleton-changing reactions:
 Joinings, cleavings, rearrangements, ring
formation, ring expansion
 Skeleton-preserving reactions:
 Additions, eliminations, substitutions,
protections, deprotections
39
RXNO
the name reaction ontology (3)
 Quality? Subjectivity?
 Get our curators to assign reactions to
categories without conferring, check
percentage agreement, discuss
disagreements, improve guidelines, iterate
to convergence.
40
41
42
RXNO
the name reaction ontology (4)
43
44
What do people say?
46
The spectroscopist’s tale
The enriched html version came as
something of a revelation and the
current emphasis on links to, and
through biomolecular terminology was
very much a plus for us, since my
colleagues and I are a mix of physical
and biological chemists who are
dabbling in inter-disciplinary waters.
Given the steadily increasing burden of
keeping up with the current literature
and accessing earlier publications - a
fortiori when conventional disciplinary
boundaries are being crossed - the
ability to 'grow a tree' from current
articles (including one's own) is going to
make 'targeted sleuthing' a great deal
easier.
John Simons, Oxford
47
The high-throughput screener’s
tale
An interesting opportunity
particularly for managers,
students and beginners
that are not that deeply
immersed in the detail and
the terminology. It further
opens access to those
who want to explore areas
they are not specialists in.
Great idea!
Eberhard Krausz, MPI-CBG
Dresden
48
Lastly…
“My only criticism would be the need for a
time warning… I spent 4 hours digging
about which generated at least six new
research ideas printed half a ream of paper
and I missed my bus home. At least it was a
new excuse my wife had not heard, so
another first.”
An analytical chemist, The North.
49
50
Acknowledgements
Royal Society of Chemistry
Richard Kidd, Jeff White, David Barden, Celia
Gitterman, Hilary Burch, the Informatics team
University of Cambridge
Peter Corbett, Simone Teufel, Ann Copestake, Peter
Murray-Rust
OBO
Karen Eilbeck, Midori Harris, Jen Deegan, Jane
Lomax, Chris Mungall, Barry Smith, the ChEBI
team
51
52
Download