Maximizing the Use of the Lawson Number in Beilstein Searching

advertisement
Maximizing the Use of the
Lawson Number in Beilstein
Searching
Gary Wiggins and Usha Coca
School of Informatics
Indiana University
ACS CERM, June 4, 2004
Abstract
 In the Beilstein database, the Lawson Number is based on the
Beilstein System for classifying organic substances. Every
substance in the Beilstein file has at least one Lawson
Number, and the smaller the Lawson Number, the more
common is the fragment. While the Lawson Number is a
searchable field, searching with Lawson Numbers is not
equivalent to substructure or Markush searching. Since the
Lawson Numbers represent certain structural fragments, they
can be used for structural similarity searches. Searches that
include the Lawson Number are effective when used in
combination with other search keys, such as molecular
formula, element ranges, etc. It is also useful when combined
with NOT in substructure searches. Thus, the Lawson Number
could serve as an effective index search key if its meaning
were known. We have developed a prototype system that
could be interfaced with the CrossFire system for effective use
of the Lawson Numbers in searching. The system will be
described and demonstrated.
Friedrich Konrad Beilstein
(1838-1906)
Beilstein Handbook of Organic
Chemistry in 27 “volumes”
Series
Basic
Sup Ser I
Sup Ser II
Sup Ser III
(v. 1-16 only)
Sup Ser III/IV
(v. 17-27 only)
Sup Ser IV
(v. 1-16 only)
Sup Ser V
Abbrev.
H
EI
E II
E III
Coverage
up to 1910
1910-1919
1920-1929
1930-1949
E III/IV
1930-1959
E IV
1950-1959
EV
1960-1979 (English)
Psst! Want a good, cheap set
of Beilstein?
We have finally decided that our cramped Chemistry Library can no longer
afford the luxury of retaining our Beilstein print collection (which has
probably not been touched for several years now, since we acquired the
online version). We hope we can find a new home for the collection (all
437 volumes, plus a handful of how-to-use-it texts), otherwise it must
be discarded. Any organization willing to pay the shipping costs is
welcome to this collection. If interested, please contact me directly.
Howard M. Dess
Chemistry and Physics Librarian
Library of Science and Medicine
Rutgers University
Piscataway, NJ 08854-8009
Source: CHMINF-L, June 1, 2004
Beilstein Handbook:
Arrangement of Compounds
 Beilstein: a collection of critically evaluated
data on organic compounds arranged in a
classified manner
 Arrangement:
–
–
–
–
Acyclic Compounds, Volumes 1-4
Isocyclic Compounds, Volumes 5-16
Heterocyclic Compounds, Volumes 17-27
Divided into System Numbers 1-4720
 Each Supplementary Series (E) volume
contains the same classes of compounds as
the corresponding Basic (H) volume
System Number Meaning
 Beilstein Institute never published the
meanings of the System Numbers
 System Number 3691 means
"heterocyclic carbon frameworks with
exactly 2 N ring atoms with a
combination of exactly 2 hydroxy groups
and 1 carboxylic acid group”
Placement of Info in Beilstein:
Registry (Index) Compounds













Stem nuclei: Hydrocarbons, saturated followed by unsaturated
Oxy = Hydroxy compounds: alcohols (OH)
Oxo = Carbonyl compounds: aldehydes and ketones (C=O)
Carboxylic Acids (COOH)
Sulfinic Acids (SO2H)
Sulfonic Acids (SO3H)
Chalcogen Oxoacids (XO2H, XO2OH); X = S, Se, Te
Amines (NH2)
Hydroxylamines (NHOH) & Dihydroxylamines (N(OH)2)
Hydrazines (NHNH2)
Azo compounds (N=NH)
More complex N functionalities
Group containing other elements (P, As, Si, Mg, etc.)
Beilstein System Algorithm 1
 Beilstein “hydrolysis” scheme based on
an instinctive chemical classification as
perceived by an organic chemist
 Carbons with more than one (non-ring)
heteroatom attached are always
regarded as derived from carbonyl
groups, if:
– at least one of the heteroatoms is other
than the attachment atom of a substituent
(halogen, nitro, nitroso, azide)
Beilstein System Algorithm 2
 Splits any molecule into a set of fragments
 Splitting points are C-Q bonds, where Q is a
heteroatom that does not belong to a ring in
common with the C in question
 Fragments then classified and coded using
– skeletal features
– type and multiplicity of chemical functional groups
(including masked groups)
– degree of unsaturation
– carbon number
(See "Notes for Users" at the start of each Beilstein
volume published from about 1992 onwards.)
Source of Ambiguity
 In the physical Beilstein Handbook, the
end of one system number and the
beginning of another sometimes occur
on the same physical page.
 Leads to bleed-over from the previous
section (e.g., alkyl hydrocarbons linked
to the simplest alcohol, Methane)
Lawson Number
 Originally used in the program SANDRA
 Algorithmic expression of the System-
Numbers in the printed work
– System Numbers: 1-4720
– Lawson Numbers: 8-32759
– System Number = Lawson Number
divided by 8 (roughly)
 Inherited the ambiguity of the page
number placement
Lawson Number: Purpose
 To divide the total virtual structure
universe of published and unpublished
compounds into approximately equal
sections (virtual pages) of related
compounds
Lawson Number Occurrence 1
 Any compound may have several LNs; most
have 2 to 3.
 In 1991, (1.8 million compounds in the file at
that time):
–
–
–
–
–
25.1% had 1
39.4% had 2
24.0% had 3
8.5% had 4
3.0% had > 4
 Average LN occurred in about 70 compounds
in 1991
Lawson Number Occurrence 2
 Occasionally a LN will represent a
unique structure, e.g., LN 12, retrieves
only BRN 4736629:
What governs the value of the
LN? In order of influence:
 Cyclic class (number and type of heteroatoms)
 Chemical functions (amine, hydroxy, etc.)
 Degree of unsaturation of the carbon framework wrt





multiple bonds at carbon + ring closures
Carbon count of the carbon-complete fragment
framework
Degree of carbon branching
Degree of halogen and nitro substitution
Chalcogen exchange
Ring sizes
Beilstein Handbook of Organic
Chemistry: SANDRA
 SANDRA, Structure AND Reference Analyzer
– Program that interpreted a graphical structure of a
compound and predicted where it should be found
in printed Beilstein
– Developed in 1987 by Alexander Lawson for use
on a local microcomputer
 SANDRA fragment screens had a heavy
chemical bias: classified according to
chemical structure
Beilstein Handbook of Organic
Chemistry: SANDRA
 12-digit code linked information to page
ranges
Beilstein Handbook of Organic
Chemistry: SANDRA
 This compound belongs in v. 13 Syst. 1823 H p. 348
 Hashcodes:
• Ethylamine
• Phenol
000500010002
800100010906
• Non-localized amino-cyclohexanol
800510010306
Beilstein Handbook of Organic
Chemistry SANDRA
 12-digit hash code had corresponding
4-digit code,e.g., the number 1849
linked 800510010306 to System no.
1823, H-page 348.
 Four-digit number retained the
sortability of the 12-digit code, but gives
a hashcode for each fragment that can
be stored in 2 bytes: 7392-28C1-1610
Lawson Number Planned
Enhancements (around 1990)
 A second phase of the LN implementation never materialized
for LNs greater than 32767
– was to include 8000 shape discriminators to help avoid false drops,
with LN values in the range 32776-40951
– Ring skeletal shapes for all mono and bicyclic systems (including
fused, bridged, and spiro rings) of 3-10 ring atoms, containing 0, 1,
or 2 heteroatoms of the set (O,N,S) in any combination or any ring
position would get a unique LN
– For rings with 11-17 atoms including O,N,S ring atoms would get a
LN
– Another LN for those with heteroatoms other than N, O, S
– All mono and bicyclic systems with 18 or more ring-atoms were to
get one LN
– A single LN for for tricyclic and greater ring systems (Further
discrimination could be based on present or not present, such as
steroid skeletons, morphane, adamantine, etc.)
Lawson Number Uses
 Most effectively used when combined
with other search elements, e.g.:
– Molecular Formula
– Element Ranges
– Boolean operator NOT in combination with
substructures
Lawson Number Search Tool
http://mypage.iu.edu/~ucoca/begperl/formFetch.html
Lawson Number Search in
Usha’s DB for COOH/O-R/(O4)
 Retrieves (among seven LN ranges):
LN Range
31456-31471
Function
COOH/O-R/(O4)
Beilstein CrossFire Search for
LN Range 31456-31471
 Yielded 10,467 hits on 4/15/2004
 One of those was BRN 18833 with LNs
31459 and 289:
Lawson Number Search in
Usha’s DB
 Revealed that LN 289 is O-R(*1)
 Combining the previous Beilstein
CrossFire search with LN 289 yielded
4910 hits on 4/15/2004.
Lawson Number Search for
LN 289 in Usha’s Database
Lawson Number Search in
Beilstein CrossFire
 Find a compound with a cyclopentane
ring with three free sites (over 440,000
substances) and with both LN 31459
and LN 289
 Result: 10 substances on 4/15/2004
CrossFire LN Search Yields
Very Diverse Results
Lawson Number Range
Search # 2 on CrossFire
 23369 –25200
 Yielded 668,065 substances on 6/3/04
 When combined with the chemical
name segment Aziridin* in proximity to
Propion*, the search yielded 142
substances.
CrossFire 2 Search Results:
All have in common LN 24059
Lawson Number 24059
 Parent Heterocycles N(1)
Possible to Link CrossFire to
Usha’s Web Tool
 Hop in feature
– Allows users to jump into CrossFire
Commander and run a search from a link
on the Web (or from an external package)
Conclusion
While the Lawson Number was originally
developed as a tool to aid in finding the
correct place for a given compound in
the printed Beilstein, it clearly has utility
in online searches of the Beilstein
database. Having a Web supplement
that defines the meaning of the Lawson
Numbers will enhance the usefulness of
the search field.
Bibliography and
Acknowledgement
The generous input from Dr. Alexander Lawson is much
appreciated!
 Lawson, Alexander J. “Structure graphics in: pointers to
Beilstein out.” in: Warr, Wendy A., ed. Graphics for chemical
structures: integration with text and data. (ACS Symposium
Series; 341) American Chemical Society: Washington, 1987,
80-87.
 Lawson, Alexander J. “Chemical structure browsing.” in: Warr,
Wendy A., ed. Chemical structure information systems:
Interfaces, communication, and standards. (ACS Symposium
Series; 400) American Chemical Society: Washington, 1989, 4149.
 Lawson, Alexander J. “The Lawson similarity number (LN).
Offline generation and online use.” in: Heller, Stephen R., ed.
The Beilstein online database: implementation, content, and
retrieval. (ACS Symposium Series; 436) American Chemical
Society: Washington, 1990, 143-155.
Bibliography
 Sunkel,J.; Hoffman, E.; Luckenbach, R.
“Straightforward procedure for locating chemical
compounds in the Beilstein Handbook.” Journal of
Chemical Education 1981, 58(12), 982-986..
 “A powerful tool for chemists: The Lawson-Number.”
[brochure] Springer-Verlag, Berlin: 1989?.
 Lawson, Alexander. Personal communication. 22
June 2001.
 Meehan, Paul; Schofield, Helen. “CrossFire; a
structural revolution for chemists.” Online Information
Review 2001, 25(4), 241-249.
MIMAS (Manchester Information
& Associated Services)
 JISC-supported UK national data
center
 Run by Manchester Computing at
the University of Manchester
 Provides access to ISI Web of
Knowledge, JSTOR, CrossFire, etc.
 http://www.mimas.ac.uk/
MIMAS CrossFire Services
 Very useful documentation
– http://www.mimas.ac.uk/crossfire/docs.html
 Introductory guides
 Training materials
 Manuals
UW-Madison CrossFire Site
 Links to a locally-produced help file
 http://chemistry.library.wisc.edu/beilstein/home.htm
 Quick Guide
 http://chemistry.library.wisc.edu/beilstein/quickguide.htm
Beilstein on STN
 Beilstein on STN (Workshop Manual).
FIZ Karlsruhe: EggensteinLeopoldshafen, 2003.
 http://www.stn-
international.com/training_center/chemistry/beilstein/
beilstein_wsm.pdf
MDL Web Site
 Replaces the former Beilstein site
 MDL Knowledge Base
– http://www.mdl.com/support/knowledgebase/index.jsp
Download