BCI fingerprints Submitted by: George S. Cowan, Ph.D. an Idea Integration consultant at Pfizer Global Research and Development 2800 Plymouth Road Ann Arbor, Michigan, USA 48105 email: Pfizer Business: George.Cowan@pfizer.com Academic Research: cowan@acm.org Phone: 734-622-3364 Molecular Substructures are represented with fingerprints A fingerprint is a string of bits that shows whether certain molecular fragments are present in a molecule. BCI fingerprints are interesting because they are keyed; that is, a dictionary gives the relationship between substructures and bits. This aids in interpreting and, ultimately, displaying the results. Barnard Chemical Information Ltd. has developed categories of useful fragments and ways to generate all such fragments that exist in a library of compounds. They can be reached at: Barnard Chemical Information Ltd Tel: +44 (0)114 233 3170 E-mail: barnard@BCI1.demon.co.uk Fax: +44 (0)114 234 3415 Web: http://www.bci1.demon.co.uk --- 46 Uppergate Road, Stannington, Sheffield S6 6BX, UK --The Description of Molecular Substructures A BCI program is used to analyze a set of compounds and construct a dictionary of all the fragments that are found. The BCI fragments used are augmented atoms atom pairs with a distance between atoms of any length desired atom sequences of any length desired ring composition sequences for any size rings desired ring fusions — sequences of ring connectivities The fingerprint of a molecule tells which of these fragments it contains. Because some readers may be interested in understanding exactly which chemical concepts can and cannot be constructed by starting with these fragments, we next give considerable detail about the information that they contain. The simplest way to convey this level of detail is to provide actual examples of the dictionary entries that BCI programs construct. The example dictionary entries have spaces added for readability. For additional detail, contact BCI, Ltd. Augmented-atom fragments consist of a central atom with the non-hydrogen atoms that it is bonded Page 1 Figure 1. A Beta-adrenergic agonist. to and a characterization of the bonds. One example of an augmented-atom fragment is a hydroxyl substitution on a benzene ring, with the role of the central atom played by the ring carbon to which the oxygen is attached. Here is its BCI dictionary entry: AA C cs O rn C rn C meaning that this Augmented Atom fragment represents Carbon with a chain-single bond to an Oxygen, a ring-normalized bond to a Carbon, and a ring-normalized bond to another Carbon. Atom-pair fragments consist of two atoms, a count of their pi bonds and non-hydrogen bonds, and a count of all the atoms in the sequence from one to the other. A count of 2 pi-bonds on a carbon atom indicates an aromatic carbon. Here is the BCI dictionary entry for the adjacent substitutions on the aromatic ring in Figure 1: AP C23 2 C23 which is interpreted as meaning: an Atom Pair consisting of a Carbon atom with 2 pi-bonds and 3 non-hydrogen bonds at one end of a sequence of 2 atoms, with the other end of the sequence being another Carbon with 2 pi-bonds and 3 non-hydrogen bonds. Atom-sequence fragments show a sequence of atoms and the bonds connecting them. As an example, one of the fragments that would result from the molecule shown in Figure 1 would be an atom sequence that traced its way from the hydroxy substitution at position 4 to the carbon attached on the opposite side of the ring. Here is the fragment’s BCI dictionary representation: AS O cs C rn C rn C rn C cs C meaning that this Atom Sequence fragment represents an Oxygen with a chain-single bond to a Carbon, followed by a ring-normalized bond to a Carbon, followed by a ring-normalized bond to a Carbon, followed by a ring-normalized bond to a Carbon, followed by a chain-single bond to a Carbon. Page 2 Figure 2. Another Beta-adrenergic Ligand. Ring-composition-sequence fragments are similar to atom-sequence fragments except that they trace their way around a single ring. Here is the BCI dictionary representation for the five-membered ring in Figure 2: RC O rs C rs O rs C rn C rs with the elements meaning the same as in an atom sequence. Note that a ring composition sequence always ends with the bond that closes the ring. Ring-fusion fragments show the sequence of ring connectivities around each ring in the Extended Set of Smallest Rings (ESSR). The benzene rings in the molecule of Figure 2 are represented with the fragment RF XX3 XX3 XX3 XX3 XX3 XX3 indicating 6 atoms that each are connected to 3 other atoms including hydrogen. The five-membered ring in Figure 2 is represented with the ring-fusion fragment: RF XX4 XX2 XX3 XX3 XX2 indicating an atom with 4 attached atoms (the leftmost carbon), an atom with 2 attached atoms (an oxygen), two atoms with 3 attached atoms each (the aromatic carbons), and finally an atom with 2 attached atoms (the other oxygen). All of the fragment types that name atoms have generalized forms that refer to columns in the periodic chart or to Any Atom (AA). There are also generalized forms for bonds: Any Ring (ar), Any Chain (ac), and Any Bond (aa). BCI provides other fragments and capabilities, but the ones listed above are the ones that are used in the dataset provided. Page 3 Any particular molecule will have many BCI fragments. For instance the molecule in Figure 1 has 8 augmented atoms, 58 atom pairs, 102 atom sequences, 1 ring composition, and 1 ring fusion, for a total of 170 unique fragments. As we add additional molecular structures to a dataset, the list of fragments that occur somewhere in the dataset grows. For a set of 38 beta-adrenergic compounds from the literature there were 30 augmented atoms, 464 atom pairs, 1922 atom sequences, 4 ring compositions, and 4 ring fusions, for a total of 2424 unique fragments. For the set of 417 compounds in the Predictive Toxicology Challenge there were 1085 augmented atoms, 6620 atom pairs, 49,177 atom sequences, 324 ring compositions, and 34 ring fusions, for a total of 57,240 unique fragments. BCI routines build a dictionary for these fragments. After choosing the fragments and constructing the dictionary, each compound is re-represented as a molecular fingerprint, a string of true(1)/false(0) values with each value indicating whether the corresponding dictionary fragment is present. BCI actually uses a representation that takes advantage of the fact that most compounds use only a small portion of the available fragments, but we have provided the format that meets the guidelines. One issue that the large number of fragments raise is that a smaller subset of the fragments may capture all the information in the rest. Each set of fragments that had exactly the same occurrence pattern over the data set of training compounds were combined into a single fingerprint bit for the set (with an OR semantics assumed for any test compounds that have a portion of the set). The 57,240 training set fragments were reduced in this way to 6150 fragment sets. The dictionary entries for all the fragments in a set have the same bit position specified. The dictionary entries for fragment 2 are shown below. The molecular fingerprints for a set of molecules is a compact representation that is ready for use by the computerized routines. Here we use Unix commands to look at several bits of the first few compounds. The first line of the result is a header showing which bits we retrieved. % cut -f1-4,6151 train.bci.fingerprints | head -5 TRNUM TR000 TR001 TR002 TR003 1 0 0 0 0 2 1 0 0 0 3 0 1 0 0 6150 0 0 0 0 We see that compound TR000 has the fragment for bit 2. Here are the dictionary listings for the fragments for that bit: % awk '$1==2' train.bci.dictionary 2 AAC aaCLaaCLaaCL 2 AAC acCLacCLacCL 2 AAC csCLcsCLcsCL As we take apart the molecular graph of a molecule to create the fragments, we lose information. In exchange, we gain a representation (lists of fragments) that lets us easily compare and manipulate descriptions of molecules. One clear example of losing information is that stereo-isomers have exactly the same set of fragments. It is also possible, but not easy, to find other pairs of molecules that have exactly the same set of fragments because we only look for the presence of a fragment in a molecule and do not count the number of times that each fragment occurs. However, much of the information in the molecular graph representation is conserved, in the sense that, with patience, most molecules could be reconstructed from the interlocking information contained in the fragments. We are left with an experimental question, not a question of information theory: Will the information that is left in the molecular fragments allow us to recognize the class of compounds with a particular Page 4 biological function? The experiments to answer the question are worth performing because molecular fragments retain detailed information about molecular substructures, but in a form that is convenient and efficient in computer programs. Page 5