BCI_Frag - Predictive Toxicology

advertisement
BCI fingerprints
Submitted by:
George S. Cowan, Ph.D.
an Idea Integration consultant at
Pfizer Global Research and Development
2800 Plymouth Road
Ann Arbor, Michigan, USA 48105
email:
Pfizer Business:
George.Cowan@pfizer.com
Academic Research: cowan@acm.org
Phone:
734-622-3364
Molecular Substructures are represented with fingerprints
A fingerprint is a string of bits that shows whether certain molecular fragments are present in a
molecule. BCI fingerprints are interesting because they are keyed; that is, a dictionary gives the
relationship between substructures and bits. This aids in interpreting and, ultimately, displaying the
results. Barnard Chemical Information Ltd. has developed categories of useful fragments and ways to
generate all such fragments that exist in a library of compounds. They can be reached at:
Barnard Chemical Information Ltd
Tel: +44 (0)114 233 3170
E-mail: barnard@BCI1.demon.co.uk
Fax: +44 (0)114 234 3415
Web: http://www.bci1.demon.co.uk
--- 46 Uppergate Road, Stannington, Sheffield S6 6BX, UK --The Description of Molecular Substructures
A BCI program is used to analyze a set of compounds and construct a dictionary of all the fragments
that are found. The BCI fragments used are
 augmented atoms
 atom pairs with a distance between atoms of any length desired
 atom sequences of any length desired
 ring composition sequences for any size rings desired
 ring fusions — sequences of ring connectivities
The fingerprint of a molecule tells which of these fragments it contains. Because some readers may
be interested in understanding exactly which chemical concepts can and cannot be constructed by
starting with these fragments, we next give considerable detail about the information that they
contain. The simplest way to convey this level of detail is to provide actual examples of the
dictionary entries that BCI programs construct. The example dictionary entries have spaces added for
readability. For additional detail, contact BCI, Ltd.
Augmented-atom fragments consist of a central atom with the non-hydrogen atoms that it is bonded
Page 1
Figure 1. A Beta-adrenergic agonist.
to and a characterization of the bonds. One example of an augmented-atom fragment is a hydroxyl
substitution on a benzene ring, with the role of the central atom played by the ring carbon to which
the oxygen is attached. Here is its BCI dictionary entry:
AA C cs O rn C rn C
meaning that this Augmented Atom fragment represents Carbon with a chain-single bond to an
Oxygen, a ring-normalized bond to a Carbon, and a ring-normalized bond to another Carbon.
Atom-pair fragments consist of two atoms, a count of their pi bonds and non-hydrogen bonds, and a
count of all the atoms in the sequence from one to the other. A count of 2 pi-bonds on a carbon atom
indicates an aromatic carbon. Here is the BCI dictionary entry for the adjacent substitutions on the
aromatic ring in Figure 1:
AP
C23
2
C23
which is interpreted as meaning: an Atom Pair consisting of a Carbon atom with 2 pi-bonds and 3
non-hydrogen bonds at one end of a sequence of 2 atoms, with the other end of the sequence being
another Carbon with 2 pi-bonds and 3 non-hydrogen bonds.
Atom-sequence fragments show a sequence of atoms and the bonds connecting them. As an example,
one of the fragments that would result from the molecule shown in Figure 1 would be an atom
sequence that traced its way from the hydroxy substitution at position 4 to the carbon attached on the
opposite side of the ring. Here is the fragment’s BCI dictionary representation:
AS O cs C rn C rn C rn C cs C
meaning that this Atom Sequence fragment represents an Oxygen with a chain-single bond to a
Carbon, followed by a ring-normalized bond to a Carbon, followed by a ring-normalized bond to a
Carbon, followed by a ring-normalized bond to a Carbon, followed by a chain-single bond to a
Carbon.
Page 2
Figure 2. Another Beta-adrenergic Ligand.
Ring-composition-sequence fragments are similar to atom-sequence fragments except that they trace
their way around a single ring. Here is the BCI dictionary representation for the five-membered ring
in Figure 2:
RC O rs C rs O rs C rn C rs
with the elements meaning the same as in an atom sequence. Note that a ring composition sequence
always ends with the bond that closes the ring.
Ring-fusion fragments show the sequence of ring connectivities around each ring in the Extended Set
of Smallest Rings (ESSR). The benzene rings in the molecule of Figure 2 are represented with the
fragment
RF XX3 XX3 XX3 XX3 XX3 XX3
indicating 6 atoms that each are connected to 3 other atoms including hydrogen.
The five-membered ring in Figure 2 is represented with the ring-fusion fragment:
RF XX4 XX2 XX3 XX3 XX2
indicating an atom with 4 attached atoms (the leftmost carbon), an atom with 2 attached atoms (an
oxygen), two atoms with 3 attached atoms each (the aromatic carbons), and finally an atom with 2
attached atoms (the other oxygen).
All of the fragment types that name atoms have generalized forms that refer to columns in the
periodic chart or to Any Atom (AA). There are also generalized forms for bonds: Any Ring (ar), Any
Chain (ac), and Any Bond (aa). BCI provides other fragments and capabilities, but the ones listed
above are the ones that are used in the dataset provided.
Page 3
Any particular molecule will have many BCI fragments. For instance the molecule in Figure 1 has 8
augmented atoms, 58 atom pairs, 102 atom sequences, 1 ring composition, and 1 ring fusion, for a
total of 170 unique fragments. As we add additional molecular structures to a dataset, the list of
fragments that occur somewhere in the dataset grows. For a set of 38 beta-adrenergic compounds
from the literature there were 30 augmented atoms, 464 atom pairs, 1922 atom sequences, 4 ring
compositions, and 4 ring fusions, for a total of 2424 unique fragments. For the set of 417 compounds
in the Predictive Toxicology Challenge there were 1085 augmented atoms, 6620 atom pairs, 49,177
atom sequences, 324 ring compositions, and 34 ring fusions, for a total of 57,240 unique fragments.
BCI routines build a dictionary for these fragments. After choosing the fragments and constructing
the dictionary, each compound is re-represented as a molecular fingerprint, a string of true(1)/false(0)
values with each value indicating whether the corresponding dictionary fragment is present. BCI
actually uses a representation that takes advantage of the fact that most compounds use only a small
portion of the available fragments, but we have provided the format that meets the guidelines.
One issue that the large number of fragments raise is that a smaller subset of the fragments may
capture all the information in the rest. Each set of fragments that had exactly the same occurrence
pattern over the data set of training compounds were combined into a single fingerprint bit for the set
(with an OR semantics assumed for any test compounds that have a portion of the set). The 57,240
training set fragments were reduced in this way to 6150 fragment sets. The dictionary entries for all
the fragments in a set have the same bit position specified. The dictionary entries for fragment 2 are
shown below.
The molecular fingerprints for a set of molecules is a compact representation that is ready for use by
the computerized routines. Here we use Unix commands to look at several bits of the first few
compounds. The first line of the result is a header showing which bits we retrieved.
% cut -f1-4,6151 train.bci.fingerprints | head -5
TRNUM
TR000
TR001
TR002
TR003
1
0
0
0
0
2
1
0
0
0
3
0
1
0
0
6150
0
0
0
0
We see that compound TR000 has the fragment for bit 2. Here are the dictionary listings for the
fragments for that bit:
% awk '$1==2' train.bci.dictionary
2 AAC aaCLaaCLaaCL
2 AAC acCLacCLacCL
2 AAC csCLcsCLcsCL
As we take apart the molecular graph of a molecule to create the fragments, we lose information. In
exchange, we gain a representation (lists of fragments) that lets us easily compare and manipulate
descriptions of molecules. One clear example of losing information is that stereo-isomers have
exactly the same set of fragments. It is also possible, but not easy, to find other pairs of molecules
that have exactly the same set of fragments because we only look for the presence of a fragment in a
molecule and do not count the number of times that each fragment occurs. However, much of the
information in the molecular graph representation is conserved, in the sense that, with patience, most
molecules could be reconstructed from the interlocking information contained in the fragments. We
are left with an experimental question, not a question of information theory: Will the information
that is left in the molecular fragments allow us to recognize the class of compounds with a particular
Page 4
biological function? The experiments to answer the question are worth performing because
molecular fragments retain detailed information about molecular substructures, but in a form that is
convenient and efficient in computer programs.
Page 5
Download