- ChemAxon

Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS Where the story of this project started ... A day in October 2008 Some time around 7:45 in the morning ... Novartis Campus Dreirosenbrücke Vision for textmining Integration chemical, biological knowledge Mining for Chemical Knowledge - Rationale - Make text corpora searchable for chemistry - Generate chemistry databases for use in research based on Scientific Papers or Patents - Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications - Patent analyis for MedChem projects Connection table Mining for chemical Knowledge - Rationale Information on compounds targeting GPCRs HELP Information explosion Source: Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42 Example: Project Prospect – Royal Society of Chemistry  Enhancing Journal Articles with Chemical Features This helps you identifying other articles talking about the same molecule Mining for Chemical Knowledge – Focus for today - Make text corpora searchable for chemistry - Generate chemistry databases for use in research based on Scientific Papers or Patents - Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications - Patent analyis for MedChem projects Connection table A use case for successful patent mining (molecules you sometimes find in your inbox ;-) ) Sildenafil (1998, Pfizer) – € 11.7 billion (USD 15.1 billion) Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase Vardenafil (2003, Bayer) – € 1.24 billion (USD 1.6 billion) Conventional Database Building Facts – current standard ... (ACS) owes most of its wealth to its two 'information services' divisions — the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million — 82% of the society's revenue — and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily ... Source: ACS homepage Facts Established application Straighforward use De-facto Gold standard Unique data source Very costly No structure export for reasonable price Very limited in large-scale follow-up analysis Most recent patents not available Not data (search), but integration, analysis and insight, leading to decisions and discovery Now – What would be the perfect solution? All patent offices require to provide all claimed structures as machine-readable version available for one-clickdownload  Text extraction Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in machine-readable format Mining for Chemical Knowledge Technologies from providers Text entity recognition Image recognition (a) Extractors (IUPAC names) - TEMIS Chemical Entity Relationships Skill Cartridge - Accelrys Pipeline Pilot extractor (Notiora) - Fraunhofer (ProMiner Chemistry) - Chemaxon (chemicalize.org) - Oscar (Corbett, Murray-Rust et al.) - SureChem - IBM ChemFrag Annotator - OSRA (NIH) (b) Converter (Names  connection table) - CambridgeSoft name=struct - Openeye Lexichem - Chemaxon - Clide Pro (Keymodule Ltd.) - Fraunhofer chemoCR - ChemReader The objective To provide a tool that provides sophisticated text analysis methods for NIBR scientists and thereby leverages the methods of TMS Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood! Clipboard Analysis Identified structures Patent text View structure onMouseOver Export to other applications Mining for Knowledge – Novartis Tools Input example: J Med Chem Paper Mining for Chemical Knowledge – Use Case Medicinal Chemist wants to synthesize competitor compound as tool compound for own project This enables the identification of compounds most representative for a Identification competitor patent of core scaffold Analysis of substitution patterns Example – A text-based patent A patent example Automated Text extraction 452 compounds Reference 636 compounds 71% Example – An image-base patent  Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why? An entirely image-based patent example Language issues – e.g. Japanese patents Encountered problems  OCR (Optical Character Recognition)!!  USPTO and WIPO are now available full text in most cases  Typos!  Name2Struct problems (less an issue here) IBM initiative Patent Mining / ChemVerse database (Steve Boyer)  The objective is to automatically extract all molecules from all patents available and make them searchable in a database  They leverage cloud computing and have access to all fulltext patents  This is going absolutely the right direction  They annotate the molecules with information from freely available databases Future ideas: Patent Analysis  Markush translation, Image+Target  Ranking capabilities of outcome for User  „blurred“ dicos for translating stuff like aryl, cycloalkyl etc.  Select  annotate as entity  on the fly error-correction  Result goes in a database  Crowdsourcing efforts to improve and store results  Suggest functionality To enable true Patinformatics analyses ... Definition by Tony Trippe: Acknowledgements         NITAS/TMS Therese Vachon Daniel Cronenberger Pierre Parisot Martin Romacker Nicolas Grandjean Alex Fromm Katia Vella Olivier Kreim  Clayton Springer  Naeem Yusuff  Bharat Lagu And many other people in different divisions of NIBR for their support

- ChemAxon

Related documents

Products

Support

- ChemAxon

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib