smiles 1

advertisement
SMILES
• Simplified Molecular Input Line Entry
System (SMILES)
• Widely used AND computationally
efficient
• Uses atomic symbols and a set of
intuitive rules
• Uses hydrogen-suppressed molecular
graphs (HSMG)
SMILES Bonds
SINGLE*
-
DOUBLE
=
TRIPLE
#
AROMATIC*
* can be omitted
:
Butanols
O
2-Butanol
iso-Butanol
tert-Butanol
O
O
SMILES Branches
• Represented by enclosure in
parentheses
• Can be nested or stacked
• Examples:
CC(O)CC is 2-Butanol
OCC(C)C is iso-Butanol
OC(C)(C)C is tert-Butanol
SMILES Bonds
Ethene
Chloroethene
1,1-Dichloroethene
cis-1,2-Dichloroethene
Trichloroethene
Perchloroethene
C=C
ClC=C
ClC(Cl)=C
ClC=CCl
ClC(Cl)=CCl
ClC(Cl)=C(Cl)Cl
SMILES Atoms
• Use normal chemical symbols
• Add punctuation symbols if necessary
• No super- or subscripts
SMILES Symbols
• String of alphanumeric characters and
certain punctuation symbols
• Terminates at the first space
encountered when read left to right
• The ORGANIC SUBSET:
B, C, N, O, P, S, F, Cl, Br, I
Other SMILES Atoms
• Aliphatic or nonaromatic carbon: C
• Atom in aromatic ring: lowercase letter
• Designate ring closure with pairs of
matching digits, e.g.
c1ccccc1 (or C1=CC=CC=C1) is Benzene,
whereas
C1CCCCC1 is Cyclohexane
SMILES Charges
• Specify attached hydrogens and
charges in square brackets
• Number of attached hydrogens is the
symbol H followed by optional digit
SMILES Charges
[H+]
[OH-]
[OH3+]
[Fe++]
[NH4+]
proton
hydroxyl anion
hydronium cation
iron(II) cation
ammonium cation
SMILES Cyclic Structures
• Break one single or one aromatic bond
in each ring
• Number in any order
– Designate ring-breaking atoms by the
same digit following the atomic symbol
Cyclic Structures
• Numbers indicate start and stop of ring
• Same number indicates start and end of the
ring, entered immediately following the
start/end atoms
• Only numbers 1 – 9 are used
• A number should appear only twice
• Atom can be associated w. 2 consecutive
numbers, e.g., Napthalene: c12ccccc1cccc2
Naphthalene
c12ccccc1cccc2
SMILES Conventions
• Avoid two consecutive left parentheses
if possible
• Strive for the fewest number of possible
branches
• Tautomeric bonds are not designated;
enter the appropriate form
Further Restrictions
• A branch cannot begin a SMILES
notation
• A branch cannot immediately follow a
double- or triple-bond symbol
• Example: C=(CC)C is invalid, but
• C(=CC)C or C(CC)=C are valid SMILES
SMILES Fragments
•
•
•
•
•
•
•
Nitro
Nitrate
Nitrite
Sulfonic acid
Cyanide/Nitrile
Azide
Azido
•
•
•
•
•
•
•
N(=O)(=O)
ON(=O)(=O)
ON(=O)
S(=O)(=O)O
C#N
N=N#N
N+=N-
SMILES Metals
[Al] [As] [Au] [Be]
[Bi] [Cd] [Ca] [Fe]
[Hg] [K] [Li] [Mg]
[Na] [Ni] [Pt] [Sb]
[Sn] [Zn] [Zr]
Disconnected Structures
• Indicated by a dot
• Tetramethyl ammonium bromide
C[N+]C(C)C.[Br-]
Isomeric and Chiral SMILES
• Isomeric configuration indicated by
forward and backward slashes: / \
• Examples:
– trans-1,2-dibromoethene: Br/C=C/Br
• Direction of the slash continues
– cis-1,2-dibromoethene: Br/C=C\Br
• Direction of the slash reverses
• Chirality indicated by the “@” symbol
Some Applications
• JMDraw/SMILESViewer (Christoph
Steinbeck)
• JME Molecular Editor (Peter Ertl)
• STN Express (SMILES as output)
• Tripos (dbtranslate: SMILES to MOL)
• Marvin (Ferenc Csizmadia)
http://chemaxon.com/marvin/
• CACTVS http://www2.ccc.uni-erlangen.de/cactvs/
Another Application
• SMILESCAS Database
http://www.syrres.com/esc/smilecas.htm
Over 103,000 SMILES notations
• Input CAS Registry Number
• Leads to SMILES and thence to a
structure search
Download