- ChemAxon

advertisement
Chemical Entity extraction using the
chemicalize.org-technology
Josef Scheiber
Novartis Pharma AG – NITAS/TMS
Where the story of this project started ...
A day in October 2008
Some time around 7:45
in the morning ...
Novartis Campus
Dreirosenbrücke
Vision for textmining
Integration chemical, biological knowledge
Mining for Chemical Knowledge - Rationale
- Make text corpora searchable for chemistry
- Generate chemistry databases for use in research based
on Scientific Papers or Patents
- Link Chemical Information with further annotation in an
automated way for e.g. Chemogenomics applications
- Patent analyis for MedChem projects
Connection table
Mining for chemical Knowledge - Rationale
Information on compounds
targeting GPCRs
HELP
Information
explosion
Source: Banville, Debra L. “Mining chemical structural information from the drug
literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42
Example:
Project Prospect – Royal Society of Chemistry
 Enhancing Journal Articles with Chemical Features
This helps you identifying other articles
talking about the same molecule
Mining for Chemical Knowledge – Focus for today
- Make text corpora searchable for chemistry
- Generate chemistry databases for use in research based
on Scientific Papers or Patents
- Link Chemical Information with further annotation in an
automated way for e.g. Chemogenomics applications
- Patent analyis for MedChem projects
Connection table
A use case for successful patent mining
(molecules you sometimes find in your inbox ;-) )
Sildenafil
(1998, Pfizer) –
€ 11.7 billion
(USD 15.1 billion)
Slide inspired by an example from Steve Boyer/IBM;
Sales data from Prous Integrity datase
Vardenafil
(2003, Bayer) –
€ 1.24 billion
(USD 1.6 billion)
Conventional Database Building
Facts – current standard
... (ACS) owes most of its wealth to its two 'information
services' divisions — the publications arm and the
Chemical Abstracts Service (CAS), a rich database of
chemical information and literature. Together, in 2004,
these divisions made about $340 million — 82% of the
society's revenue — and accounted for $300 million (74%)
of its expenditure. Over the past five years, the society has
seen its revenue and expenditure grow steadily ...
Source: ACS homepage
Facts
Established application
Straighforward use
De-facto Gold standard
Unique data source
Very costly
No structure export for reasonable price
Very limited in large-scale follow-up analysis
Most recent patents not available
Not data (search), but integration, analysis and
insight, leading to decisions and discovery
Now – What would be the perfect solution?
All patent offices require to
provide all claimed structures
as machine-readable version
available for one-clickdownload 
Text extraction
Definition:
Extract all molecules that
are mentioned in a patent
text of interest, convert
them to structures and
make them available in
machine-readable format
Mining for Chemical Knowledge
Technologies from providers
Text entity recognition
Image recognition
(a) Extractors (IUPAC names)
- TEMIS Chemical Entity
Relationships Skill Cartridge
- Accelrys Pipeline Pilot extractor
(Notiora)
- Fraunhofer (ProMiner Chemistry)
- Chemaxon (chemicalize.org)
- Oscar (Corbett, Murray-Rust et al.)
- SureChem
- IBM ChemFrag Annotator
- OSRA (NIH)
(b) Converter
(Names  connection table)
- CambridgeSoft name=struct
- Openeye Lexichem
- Chemaxon
- Clide Pro (Keymodule Ltd.)
- Fraunhofer chemoCR
- ChemReader
The objective
To provide a tool that provides sophisticated
text analysis methods for NIBR scientists and
thereby leverages the methods of TMS
Mining for Chemical Knowledge – Novartis Tools – the
chemicalize-technology is working under the hood!
Clipboard Analysis
Identified
structures
Patent
text
View structure
onMouseOver
Export to
other
applications
Mining for Knowledge – Novartis Tools
Input example: J Med Chem Paper
Mining for Chemical Knowledge – Use Case
Medicinal Chemist wants to synthesize competitor
compound as tool compound for own project
This enables the identification
of compounds most
representative for a
Identification
competitor
patent
of core
scaffold
Analysis of
substitution
patterns
Example – A text-based patent
A patent example
Automated
Text
extraction
452
compounds
Reference
636 compounds
71%
Example – An image-base patent
 Text extraction not suitable for this case, it does find only a
meager 40 molecules, 1129 in reference – Why?
An entirely image-based patent example
Language issues – e.g. Japanese patents
Encountered problems
 OCR (Optical Character Recognition)!!
 USPTO and WIPO are now available full text in most cases
 Typos!
 Name2Struct problems (less an issue here)
IBM initiative
Patent Mining / ChemVerse database (Steve Boyer)
 The objective is to automatically extract all molecules from
all patents available and make them searchable in a
database
 They leverage cloud computing and have access to all fulltext patents
 This is going absolutely the right direction
 They annotate the molecules with information from freely
available databases
Future ideas: Patent Analysis
 Markush translation, Image+Target
 Ranking capabilities of outcome for User
 „blurred“ dicos for translating stuff like aryl, cycloalkyl etc.
 Select  annotate as entity  on the fly error-correction
 Result goes in a database  Crowdsourcing efforts to
improve and store results
 Suggest functionality
To enable true Patinformatics analyses ...
Definition by Tony Trippe:
Acknowledgements








NITAS/TMS
Therese Vachon
Daniel Cronenberger
Pierre Parisot
Martin Romacker
Nicolas Grandjean
Alex Fromm
Katia Vella
Olivier Kreim
 Clayton Springer
 Naeem Yusuff
 Bharat Lagu
And many other people in different divisions of NIBR for their support
Download