Chemisches Zentralblatt

advertisement
1 / 34
Digitalization and Chemical Entity
Recognition of Chemisches Zentralblatt:
Unrivaled Historical Information
Meets Modern Technology
M. Brändle (ETH Zürich), V. Eigner-Pitto (InfoChem GmbH)
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
2 / 34
Historical Importance of Chemisches Zentralblatt
1830 Chemisches Zentralblatt
1969
First and oldest abstracts journal in chemistry
Covers chemical literature from 1830 to 1969
Describes the „birth“ of chemistry as science (vs. alchemy)
1840
1907 Chemical Abstracts
…
Biggest and single abstracts source in chemistry
Currently >31 million papers and patents
Content 1840-1906 added retrospectively
1772
1817 Gmelin Handbook
1771
InfoChem / ETH Zürich Copyright © 2009
…
1881 Beilstein Handbook
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
…
Brändle, Eigner Pitto
3 / 34
Chemisches Zentralblatt: Content
• Covers 140 years of chemistry
• About 3.6 million abstracts
• journal articles
• patents
• 900‘000 pages (115‘000 for time period 1830-1906)
• 700‘000 pages with abstracts
• 200‘000 pages of indexes („Register“)
• Author
1830
• Subject
• alphabetic
1830
• systematic
1863
• Patent
1897
• Formula
1925
• General indexes 1883
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
4 / 34
History of Chemisches Zentralblatt: Rise
1830 „Pharmaceutisches Central-Blatt“,
403 abstracts/544 pages/10 journals,
weekly after 8 months.
1850 Title changes to „ChemischPharmaceutisches Central-Blatt“
1856 „Chemisches Central-Blatt“
1864 Introduction of a systematic
table of contents  Classification of
chemistry
1879 First patent abstracts in „kleinen
Mittheilungen“
1883 1st edition of General Index
1884 In-text images
1888 273 journals excerpted
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
5 / 34
History of Chemisches Zentralblatt: Prosperity
1897 Holding passes to Deutsche Chemische
Gesellschaft for DM 15‘000.
Introduction of patent index.
1901 Editorial office moves from Leipzig to
Berlin.
CA
1919 Takes over abstracts from Angew. Chem.
Split into scientific (I/III) and technical part
(II/IV).
1921 Begins to cover foreign patents.
1924 CZ is reunified into one journal of abstracts.
1925 Introduction of formula index.
1929 Centennial: Richard Willstätter accentuates
„timeliness, exactness, completeness“ as
attributes and requirements for quality of
CZ.
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
6 / 34
History of Chemisches Zentralblatt: Decline
1940
|
1945
1947
|
1949
1950
Pages
Double production of CZ in East
and West Germany.
1954
Reunification of CZ under East
and West German organisations.
Trying to fill gap by supplement volumes.
1961
Berlin Wall does not hinder production.
1967
Introduction of SRD (Schnellreferatedienst,
quick abstract service) for organic chemistry.
1969
CA
WW II: Difficulties in collecting information.
1944 bombing of editorial office.
Editorial Office
East Berlin
Editorial Office
West Berlin
GDR office declares unable to afford
production of SRD and of journal.
CZ ceases publication.
SRD continued as „Chemischer
Informationsdienst“ (ChemInform).
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
7 / 34
Chemisches Zentralblatt vs. CA: Quantity
Abstracts
Pages
WW II
WW II
WW I
WW I
CA format change
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
8 / 34
Chemisches Zentralblatt vs. CA: Quality
• Many textbooks on chemical literature claim better quality of Chemisches
Zentralblatt than CA for pre-WW II
• H. Skolnik, The literature matrix of chemistry, 1982: „outstanding A/I service“
• R.E. Maizell, How to find chemical information, 3rd ed. 1998, citing E.J. Crane,
„[..] has value because of [..] good abstracts“
• M. Mücke, Die chemische Literatur, 1982, „Zwar war CA zahlenmässig [..] dem
Chemischen Zentralblatt überlegen, doch war dies gerade umgekehrt, was die
Qualität der Referate betraf.“
• R.T. Bottle, J.F. Rowland, Information Sources in Chemistry, 4th ed. 1993,
„Before WW II, many chemists regarded CZ as superior in coverage to CA; its
abstracts were longer and more informative [...]“
• A.S.K. Atsu, Comparative coverage of chemical abstracting services in the period
1906-1940, M. Sc. Thesis, City University, London (1976)
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
9 / 34
Chemisches Zentralblatt vs. CA: Quality
Example: Hans Fischer, Georg Stangler, Synthese des
Mesoporphyrings, Mesohämins und über die Konstitution
des Hämins, Justus Liebigs Ann. Chem. 459(1927), 5398.
CZ I(1928), 528
CA 22:11339
(1928), 1363
Length (pages)
7.5
1
Length (words)
3,882
690
Length (chars)
24,308
4,695
Compounds
~ 120
~ 70
✔
✕
Structure formulas
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
10 / 34
Chemisches Zentralblatt: Digitalization
• Relevant for documentation of prior art
• Continuous and growing demand of the information
• FIZ Chemie Berlin has scanned the whole work and offers a full text searchable
database for the web and the dataset for integration in Intranets
• ETH Zurich has bought the digitalized raw material (pdfs with OCRed text in the
background) from FIZ and is creating a database offering full text search
• 900‘000 pdf pages,1.3 TB
• Raw text content incl. search index about 10 GB
• CAS has performed automatic translation (German  English) of the 1897-1907
volumes and included in CAplus
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
11 / 34
Reasons for buying digitalized Chem. Zentralblatt
www.infochembio.ethz.ch/en/holdings.html
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
12 / 34
Reasons for buying digitalized Chem. Zentralblatt
• Space
• Loss of compact shelving space in basement (432 m  194 m, -55%)
• Disposal of printed Beilstein, CA, Chem. Zentralblatt
• Access
• e-books, e-journals, end-user databases at workbench of chemist
• Chemists trained to electronic sources, print and µ-film cumbersome
• Restoration costs due to deterioration of acid-containing paper
• 17K€/t for deacidification : Chem. Zentralblatt 1.6 t  27K€
• Digitalization and operation costs much higher (10x), but can be divided
• Ease of use : Search / Browse / Print
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
13 / 34
Quality of Obtained Raw Data
• Errors upon conversion
• Visual inspection of pages: Cover Flow / Quick Look technology
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
14 / 34
Quality of Raw Data Observed: Page Errors
• File errors (conversion)
• Unreadable directories (missing content)
• Defect pdf files (missing content)
• Errors during scanning (visual inpection)
• Duplicate pages (shifting page index)
• Missing pages (shifting page index, missing content)
• Issues scanned in wrong order (minor)
• Two pages on one (shifting page index)
• Wrong volume (missing content)
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
15 / 34
Quality of Raw Data Observed: OCR
• ETH works with OCR from FIZ Chemie
• page  word index, 346 million „words“
• 8.8% with only 1 character
• slightly expanded fonts, e.g. for author names, sum formulas
• Abbreviations (journal names, Zentralblatt = C), numbers
• element names in structure formulas
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
16 / 34
Planned Tasks ETH Zürich
• Adding navigation structure, provide DB
search and browse for ETH members (Q4/09)
• Mining and Markup (Q1/10)
• Bibliographic references
• Authors
• General Subject Headings
• Reference linking to journal
articles and patents (Q1/10)
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
17 / 34
Chemisches Zentralblatt: Conclusion
• Covers chemical literature from 1830 to 1969
• Very good abstract quality
• Better quality (length, details) than CA for pre-WW II period 1907-1940
• Contains also important patent information
• Invaluable information in indexes (e.g. synonyms of ancient chemical names)
• Only comprehensive abstract journal on the market up to 1907
• More comprehensive than CA for 19th century literature
• Complements Beilstein and Gmelin handbooks for 19th century literature
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
18 / 34
Importance of Chemisches Zentralblatt: Example
Org. Lett., 2006, 8 (19), pp 4279–4281
The authors have
retracted this paper on
November 15, 2007 (Org.
Lett. 2007, 24, 5139)
Chemisches Zentralblatt., 1904, 2, 1145
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
19 / 34
InfoChem Motivation
• Text search in Chemisches Zentralblatt:
• Abstracts in German language
• High number of old German chemical names
• Chemists think in structures!!!
• Language independent structure search would help ALL scientists to access this
historical source and to use the relevant information of this art
• Required technology for structure search projects
• Optimized German-English dictionaries
• 30 million SPRESI names
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
20 / 34
Overview of Approach and Applied Technology
H
N
HO
NER
OCR
O
N2S
SPRESI Dictionaries
Comparison
(quantitative)
ICANNOTATOR
H
N
Manual abstraction of sample set for evaluation
HO
O
.tiff Documents
Database
skhflaskjlkfjlkdj
Link to original literature
Pdf documents
Text under image
InfoChem / ETH Zürich Copyright © 2009
Combined search
on federated
search system
(ICFEDSEARCH)
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
21 / 34
Challenges OCR (1)
1830
InfoChem / ETH Zürich Copyright © 2009
1870
1969
1910
1930
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
22 / 34
Challenges OCR (2)
• Bad quality of original source:
InfoChem / ETH Zürich Copyright © 2009
dirty (blotted, stained) pages
print from back page
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
23 / 34
Challenges OCR (3)
• Tables: extremely small fonts,
not recognizable
begin / end of columns
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
24 / 34
Challenges OCR (4)
• Ambiguous old fonts (h=b; c=e; ligations)
• Spaced text
Specific rules, large German dictionaries and extensive training are
applied to correct systematic mistakes of standard OCR process
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
25 / 34
Challenges Annotation (1)
• Names lack position, valence or stoichiometric information
• Pimarsäure

is it the R or L form?
• Platinchlorid

in which oxidation state II, III, IV?
• Chemical names that indicate a chemical class
N OH
• Nitrolsäure (nitrolic acid)

R C
• Lactonsäure (lactonic acid)

any of several acids with a lactone ring
bearing the carboxylic group
NO2
• Mixed compounds
• Eunole

Naphthole + Eucalyptusöl
• Pikrotoxin

Pikrotoxinin + Pikrotin
NO solution: correct structure information is not available in the original source
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
26 / 34
Challenges Annotation (2)
• Obsolete German language
• Schwefelsaures Natrium, Chlorür, Bromür
• Historical names
• Pelopeum  Columbium  Niobium
• Different spelling for the same name:
• Dibrom…  Bibrom…
• Ätzkali  Aetzkali
•
InfoChem / ETH Zürich Copyright © 2009

Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
27 / 34
Solutions in Annotation Process
• Correction of German-specific grammar
• Translation in English of not available chemical names
• Research in old sources:
• Beilstein
• Brockhaus Encyclopedia
• German-English dictionaries of chemistry
• Meyers Encyclopedia
• Pierer Encyclopedia
• References to very old books, journals, articles
• “Naturwissenschaftliche Exzerpte und Notizen Mitte 1877 bis Anfang 1883”
by Karl Marx
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
28 / 34
Results Annotation Chemisches Zentralblatt
• 120,000 pages covering time period 1830-1907
• 2.4 million chemical names with associated structure
• 98,000 unique names
• 47,000 unique structures
Quantitative comparison with manually abstracted sample set
• Recall
51%
• Precision
87%
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
29 / 34
Federated Search Prototype
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
30 / 34
Federated Search Prototype
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
31 / 34
Federated Search Prototype
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
32 / 34
Summary
• Described history, content and importance nowadays of Chemisches Zentralblatt
• Illustrated how the challenges of OCR and annotation process have been solved
• Time period 1830-1907 contains 98,000 unique names and 47,000 unique structures
• Quantitative comparison proves over 50% recall and nearly 90% precision
• Generated structure searchable Chemisches Zentralblatt database is integrated in ICFEDSEARCH
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
33 / 34
Outlook
Chemisches Zentralblatt:
Phase 1, Q2 2009
Phase 2, Q4 2009
Pages:
120,000
900,000
Time period:
1830-1907
1830-1969
Unique names:
98,000
Ca. 1 million
Unique structures:
47,000
Ca. 500,000
Recall:
50%
?
InfoChem / ETH Zürich Copyright © 2009
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
34 / 34
Acknowledgements
• Prof. Dr. Deplanque, Mr. Heineke and FIZ Chemie Team Berlin
• Ms. Langanke
• InfoChem Team
• Chemistry Biology Pharmacy Information Center (ETH Zürich)
Thank you!
ETH Zürich:
InfoChem GmbH:
InfoChem / ETH Zürich Copyright © 2009
www.infochembio.ethz.ch, braendle@chem.ethz.ch
www.infochem.de, www.spresi.com, info@infochem.de
Fraunhofer Symposium on Text Mining, Bonn, October 5-6, 2009
Brändle, Eigner Pitto
Download