Lecture 1: Topics to be Covered

advertisement
Computer Structure Codes
(after lectures by Dr. J.M. Barnard)
• How do you store chemical structures
on computer?
• What can you do with them there?
• How do the computer systems used in
chemical informatics work?
Representing a chemical
structure
• How much information do you want to
include?
OH
– atoms present
– connections between atoms
• bond types
– stereochemical configuration
– charges
– isotopes
– 3D-coordinates for atoms
CH2
H2N
O
CH
OH
Representing a chemical
structure
• How much information do you want to
include?
OH
– atoms present
– connections between atoms
• bond types
(aromatic ring identification)
– stereochemical configuration
– charges
– isotopes
– 3D-coordinates for atoms
CH2
H2N
O
CH
OH
Representing a chemical
structure
• How much information do you want to
include?
OH
– atoms present
– connections between atoms
• bond types
– stereochemical configuration
– charges
– isotopes
– 3D-coordinates for atoms
CH2
H2N
O
CH
OH
Representing a chemical
structure
• How much information do you want to
include?
OH
– atoms present
– connections between atoms
• bond types
– stereochemical configuration
– charges
– isotopes
– 3D-coordinates for atoms
CH2
+
H3N
O
CH
O
Representing a chemical
structure
• How much information do you want to
include?
OH
– atoms present
– connections between atoms
• bond types
– stereochemical configuration
– charges
– isotopes
– 3D-coordinates for atoms
CH2
H2N
O
14
CH
OH
2D structure diagram
•
•
•
•
chemists’ “natural language”
used by most computer systems for display
shows topology, optionally stereochemistry
several commonly-used computer programs
allow input /editing of structure diagrams
– ISIS/Draw (MDL)
http://www.mdl.com
– ChemDraw (CambridgeSoft)
http://www.cambridgesoft.com/products/
– GRINS/JavaGRINS (Daylight)
http://www.daylight.com/products/javatools.html
2D structure diagram
• provides 2D pictorial representation of
chemical structure
– display on screen
– cut/paste/embed in Word document etc.
• inter-convert with other forms for further
processing
–
–
–
–
database searching
structure analysis
property prediction
database analysis
Registry Numbers
• unique identifiers for compounds or
substances
– catalog number
• most chemical databases have them
– Chemical Abstracts
– Beilstein
– private compound registries in pharmaceutical companies
• usually just “idiot numbers”
– no chemical information
• may have hierarchical structure
parent compound  stereoisomer  salt  batch
• need to decide what is a separate compound
Line Notations
• represent structures as compact linear string
of alphanumeric symbols
• easily handled by computer
– compact storage
– easily transmitted over a network
• allow rapid manual coding/decoding by
trained users
– much faster for input than using a structure
drawing program
Line Notations: SMILES
Simplified Molecular Input Line Entry System
• developed by Dave Weininger (Daylight)
O
HO
NH2
CH CH2
1
OH
OC(=O)C(N)CC1=CC=C(O)C=C1
Other line notations
5
3
O
1
HO
NH2
12
11
6
13
CH CH2
OH
4
8
9
• ROSDAL (Beilstein)
Representation Of Structure Diagram Arranged Linearly
1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O
• Sybyl Line Notation (Tripos)
OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1
• Wiswesser Line Notation (WLN) (obsolete)
QVYZ1R DQ
Connection Tables (CTs)
• main form of structure representation in
computer systems
– list atoms and bonds (and other data) as a table
• many different formats
– “internal” CTs (in memory)
• algorithmic processing
– “external” CTs (disk files)
• archival storage
• data exchange between programs
Internal Connection Table
• usually “redundant”
– every bond shown twice, once for each atom
• implemented as array of records
• record for each atom might store
–
–
–
–
–
–
atomic type
hydrogen count
formal charge
2D display co-ordinates
bonds to neighboring atoms
etc.
“Redundant” Connection Table
13
OH
1. O
2. C
11
9 3. O
4. C
12
8 5. N
6. C
7. C
6
CH2 8. C
9. C
5
10. C
H2N CH
4
11. C
12. C
O 1OH 13. O
3
1
0
0
1
2
2
0
1
1
0
1
1
1
21
11
22
21
41
41
61
72
81
92
10 1
11 2
10 1
32
41
51
61
71
82
91
10 2
11 1
12 2
71
12 1
13 1
MDL Connection Table
• proprietary file format developed by MDL
– http://www.mdl.com/downloads/latest_releases/index.jsp
• de facto standard for exchange of datasets
• several different flavours and versions
–
–
–
–
–
Molfile (single molecule)
SDfile (set of molecules and data)
RGfile (Markush structure)
Rxnfile (single reaction)
RDfile (set of reactions with data)
• separates atoms, bonds into separate blocks
Standard Connection Table Formats
• different vendors have proprietary CT formats
• many attempts to establish agreed “standard”
formats
– no real general success
– different user communities have failed to
coordinate efforts
– some standards exist in restricted areas
• SMILES and MDL CT formats widely used
• most popular programs read/write several
different formats
Standard Connection Table Formats
• Standard Molecular Data (SMD) format
– never gained wide acceptance
• Protein Data Bank (PDB) format
• Crystallographic Information File (CIF)
• Molecular Information File (MIF)
– developed from SMD and compatible with CIF
• Chemical Exchange Format (CXF)
– Chemical Abstracts Service
• Chemical Markup Language (CML)
– for data exchange using the Internet
• INChI (IUPAC/NIST Chemical Identifier)
Conclusions
• There are lots of ways of storing a
chemical structure in a computer
– including different amounts of information
• Most important ones are
– line notations (e.g. SMILES)
– connection tables (e.g. MDL Molfile)
– nomenclature
• Structure diagrams used for input/output
Topological Graph Theory
• branch of mathematics
– particularly useful in chemical informatics
and in computer science generally
• study of “graphs” which
consist of
– a set of “nodes”
– a set of “edges” joining
pairs of nodes
Properties of graphs
• graphs are only about connectivity
– spatial position of nodes is irrelevant
– length of edges are irrelevant
– crossing edges are irrelevant
Structure Diagrams as Graphs
• 2D structure diagrams very like topological
graphs
– atoms  nodes
– bonds  edges
• terminal hydrogen atoms are not normally
shown as separate nodes (“implicit” H)
– reduces number of nodes by ~50%
– “hydrogen count” information used to colour
neighbouring “heavy atom” atom
– separate nodes sometimes used for “special”
hydrogens
• deuterium, tritium
• hydrogen bonded to more than one other atom
• hydrogens attached to stereocentres
Advantages of using graphs
• mathematical theory is well understood
• graphs can be easily represented in
computers
– many useful algorithms are known
• identical graphs  identical molecules
• different graphs  different molecules
Disadvantages of graphs
• analogy between chemical structures and
graphs is not perfect
– identical graphs <=/=> identical molecules
– different graphs <=/=> different molecules
• realities of chemical structures cause
problems
–
–
–
–
–
aromaticity
stereochemistry
tautomerism
coordination compounds
multi-centre bonds inorganic compounds
macromolecules
polymers
incompletely-defined substances
• many graph algorithms are inherently slow
Aromaticity
• electronic property of certain ring systems,
giving enhanced chemical stability
• bonds in aromatic rings have properties that
are distinct from single and double bonds
• generally accepted definition is Hückel rule
– 4n+2 pi-electrons (n is a small integer)
• there are borderline cases
• aromaticity causes problems for computer
representation
– different systems deal with it in different ways
Aromaticity problems
• using single and double bonds can give
different topological graphs for the same
compound
Br
Br
Br
• one solution is to use
an aromatic bond type
Br
Br
Br
Alternating bonds and
aromaticity
• Chemical Abstracts Registry System
uses a “normalised” bond type for all
rings with alternating single and double
bonds
– this includes some systems
that are not aromatic
– and omits some
that are
S
Representing aromaticity
• some systems represent aromaticity as an
atom property
– SMILES allows use of lower-case atomic symbols
for aromatic atoms (adjacent aromatic atoms are
assumed to be joined by aromatic bonds)
Br
S
s1cccc1
S1C=CC=C1
Br
Brc1c(Br)cccc1
BrC1=C(Br)C=CC=C1
• problem: aromaticity is really a ring property
Tautomerism
OH
O
• dynamic equilibrium
NH
between positional
isomers (labile H)
• are they different compounds?
N
– answer depends on what you want to do with them
• can use normalised bonds
to represent them by a
single graph
– gets mixed up with ring
alternating bonds
– some tautomers may be
aromatic, when others are not
O
H
N
Tautomerism
• tautomerism is a matter of degree
• tautomers can be defined in different
ways
HQ–X=R  Q=X–RH
only certain elements can be Q, X or R
• keto-enol tautmers
are not recognised
by Chemical Abstracts
• mono-unsaturated
carbon chains are O
not distinguished HO
by Daylight
OH
O
O
HO
Structure conventions
sometimes called “business rules”
– some chemical groups can be shown in different but equally
valid ways
O
O
N
O
+
O
N
– conventions are needed to determine which is preferred
– software may be needed to convert to preferred form
Stereochemistry
• different compounds with identical
connectivity
• same topology, different topography
S-tyrosine
R-tyrosine
Stereochemistry
• configuration is often unknown
– or partially known (relative stereochemistry)
– or you may have a mixture of stereoisomers
• in which one isomer may occur in enantiomeric excess
• many different descriptors used by chemists
– wedge (up) and hatched (down) bonds in structure
diagrams
– Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z)
– text-based descriptors (stereoparent, or optical
rotation)
Stereochemistry: up/down
bonds
• can be used as additional
“colours” for graph edges
– many connection table
formats have special codes
for up and down bonds
– need to know which end
of bond is which
OH
H2N
O
CH2
OH
OH
H2N
O
CH2
OH
• useful for re-generating diagrams for display
• can be used to calculate other stereo
descriptors
Up/down bond problems
• different patterns of
up/down bonds can
show the same stereoisomer
– different graphs,
same molecule
OH
H2N
O
OH
CH2
H2C
OH
HO
• some patterns of up and down
bonds actually convey no useful
information about configuration
NH2
O
CH3
CH2
Cl
F
CH3
Stereochemistry: CIP designators
• R.S. Cahn, C. Ingold, and V. Prelog,
– Angewandte Chemie Intl. Ed. in English 1966, 5, 385-551
• one-letter designator for stereocenters
– based on rules assigning priorities to groups
around it
– tetrahedral carbons (R, S)
– double bonds (E, Z)
• additional colors for graph nodes or edges
– useful for distinguishing stereoisomers when
absolute configuration is known
– less useful for matching parts of structures
(substructure search) as priority rules can cause
designator to change when remote part of
structure is changed
Double bond stereo in
SMILES
/ and \ used as “directional” single bonds
– only meaningful when used on both atoms
of a double bond
– several ways of showing same
Cl
F
configuration
I
Cl
Br
Cl/C(F)=C(Br)/I
Cl\C(F)=C(Br)\I
I
F
Br
Cl\C(F)=C(Br)/I
Cl/C(F)=C(Br)\I
Other complications
• Organometallic and co-ordination compounds
– complex stereochemistry
– special bond types may be needed (dative bonds
etc.)
– ambiguity over covalent/ionic character of bonds
• “business rules” rules usually needed
• Inorganic compounds
– topological representation often not possible
– composition may not involve integral ratios
between elements
Macromolecules
• in principle can represent all atoms, as
for small molecules
• some systems use “shortcuts” or
“superatoms” for subunits (e.g. amino
acids)
Ala
Gly
Gly
Asp
Tyr
Asp
Pro
Arg
Val
Ala
OH
Ala
Cys
Trp
Tyr
Arg
His
Cys
Tyr
His
OH
Val
Gly
Ala
Val
Macromolecules
• Each shortcut is defined
with appropriate
attachment points
• ordinary atoms can be
mixed with shortcuts
• system can expand
shortcuts when needed
OH
*
O
NH
*"
O
Tyr
Polymers
• special problems are presented because
properties of polymer can be affected by
polymerisation conditions
–
–
–
–
–
average number of subunits
extent of cross-linking
ratio between different subunits
random / block sequences of subunits
etc.
• Two main approaches
– monomer representation
– structural repeating unit (SRU) representation
Incompletely-defined
substances
• unknown stereochemistry
• unknown attachment position
• unknown repetition
NH2
Cl
OH
n
Markush (“Generic”)
structures
– structures with R-groups
– shorthand for describing sets of structures
with common features
OH
R2
R1
*
R1=
Cl
*
Br
*
I
*
R2=
CH2
CH3
*
CH2
CH2
*
CH3
CH2
CH2
CH2
CH3
Markush structures
– also called “generic” structures
– very important in chemical patents
• inventor claims whole class of related
compounds
– can be used to describe combinatorial
libraries
– can be used as queries in database
searches
Canonicalization
• a given chemical structure (or graph) can
have many valid and unambiguous
representations
– different order of rows in connection table
– different order of atoms in SMILES
• for comparison purposes it would be useful to
have a single unique or “canonical”
representation
• process of converting input representation to
canonical form is called “canonicalization” or
“canonization”
– process of applying “rules” (i.e. an algorithm)
Canonicalization
• an obvious approach:
– generate all possible valid SMILES
– choose the one that comes first
alphabetically
• this would be very slow, but effective,
and there is a danger of missing one
– principle was used for canonicalizing
Wiswesser Line Notation
Canonicalization
• most methods in use today involve
renumbering the atoms in some unique
and reproducible way
– can be used to number rows in connection
table
– can determine order of atoms in SMILES
• normally involve a node labelling
technique called “relaxation”
– example is Morgan’s algorithm (1965)
Symmetry perception
• if ties between label values cannot be
resolved on basis of atom/bond types,
the atoms are symmetrically equivalent,
and
it doesn’t matter which is chosen next
• Morgan’s algorithm is thus also useful
for identifying symmetry in molecules
Morgan’s algorithm
• Works by taking more of the graph into
account at each iteration
– essence of “relaxation” technique is iteratively
updating a value by looking at its immediate
neighbours
• It is not infallible
– graphs (“isospectral” graphs) are known where the
algorithm cannot distinguish nodes that are not
symmetrically equivalent
• There are many variations on it
– and several theoretical papers analysing it
mathematically
Ring perception
• How many rings are there in these structures
and which ones are they?
• rings are important features of chemical
structures
–
–
–
–
nomenclature generation
aromaticity perception
synthetic significance
fragment descriptor generation
Rings and ring systems
• A ring system is a subgraph in which
every edge is part of a cycle
Which rings to perceive?
• Usually the smallest set of smallest rings
– two 6-membered rather than
one 6- and one 10-membered
– two 5-membered rather than
one 5- and one 6-membered
• But there may be more than one SSSR
– C-S-C-C-C-C
– C-C-C-C-O-C
– C-S-C-C-O-C
S
O
Substructure Fragments
• Subgraphs can be identified in a structure
graph corresponding to functional groups,
OH
rings etc.
–
–
–
–
–OH
–NH2
–COOH
phenyl
• this can be done by
tracing appropriate
H2N
paths in the graph
• subgraphs may overlap
CH2
O
CH
OH
Fragment codes
– many early chemical information systems
were based on identifying fragments of this
sort
• originally the fragments were identified
manually
• and represented on punched cards
– special fragment codes (dictionaries of
fragments) were devised for different
systems
• some of these are still in use, though with
automated encoding of structures
• particularly important are the systems for
“Markush” structures in patents (e.g. Derwent
WPI code)
Fingerprints
• the fragments present in a structure can be
represented as a sequence of 0s and 1s
00010100010101000101010011110100
– 0 means fragment is not present in structure
– 1 means fragment is present in structure (perhaps
multiple times)
• each 0 or 1 can be represented as a single bit
in the computer (a “bitstring”)
• for chemical structures often called structure
“fingerprints”
Fingerprints
• fingerprints are typically 150-2500 bits long
• where a fixed dictionary of fragments is used
there can be a 1:1 relationship between
fragment and bit position in fingerprint
– sometimes several related fragments will “set” the
same bit
• disadvantage is that if structure contains few
fragments from the dictionary, no bits are set
– can be avoided if “generalised” fragments are
used
(involving e.g. “any atom”, “any ring bond” types)
2D structure depiction
• if structures are stored without 2D display
coordinates, we need to generate them
– SMILES
• “depiction” algorithms are used for this
• identify and lay out ring systems first
– complications over orientation of some systems
– Chemical Abstracts stores “standard depictions” of
all ring systems it has encountered
• then add side chains, avoiding collisions
– many features can be added to improve
appearance
3D structure depiction
• much more complicated than 2D
• need to store standard bond lengths and
angles
• need to distinguish atoms in different
hybridisation states (sp2 vs sp3 carbon)
• need rotate single bonds to avoid “bumps”
• sophisticated “conformation generation”
programs identify low-energy conformers
– very useful for identifying molecules with the
correct shape to fit into biological receptor sites
Nomenclature generation
• most systematic nomenclature is based
on ring systems
– need to identify/prioritise ring systems first
– identify standard numbering for system
• frequently need to store this
– add side chains and substituents with
appropriate locants
Download