Molecular Similarity and Molecular Structure N. Sukumar ISPC, San Francisco, Aug. 2007 Why should molecules have Structure? “The idea that molecules are microscopic, material bodies with more or less well-defined shapes has been fundamental to the development of our understanding of the physicochemical properties of matter, and it is now so familiar and deeply ingrained in our thinking that it is usually taken for granted - it is the central dogma of chemistry.” — R. G. Woolley (Woolley 1980) What do we mean by Molecular Structure? Ø The notion that a molecule has structure is fundamental to much of chemistry as practiced today. But what do we really mean by this term? Ø There are several ways to envision molecular structure, some more general and others fairly concrete, but rather restrictive. Ø As an example of the latter, we can think of molecular structure in terms of familiar ball-and-stick molecular models. Such models are simple to visualize and are intuitively appealing. But by confining our conception to such models, we risk imposing a classical, mechanical, vision upon an intrinsically microscopic quantum world. Ø From a philosophical perspective, we can define structure as that property of a molecule by virtue of which it occupies space in the real world. Ø From a statistical perspective, we can define structure as that which distinguishes an object from a heap of its parts, in this case, a molecule from a collection of its constituent atoms. Molecular Structure Ø This statistical definition generalizes the concept of molecular structure to situations where the relative spatial locations of the constituent atoms may not be known and makes the link to the fundamental statistics of the constituent particles. Ø Most modern molecular structure determinations are indirect, utilizing a transformation from momentum space or frequency domain. Ø Mathematically, structure is measured by the interparticle distribution function. Thus an ideal gas of atoms has minimal structure, a hydrogen-bonded liquid is more structured and a crystal or molecular solid even more so. Ø The familiar ball-and-stick molecular models are thus the rigid limit of a hierarchy of structures. Hierarchy of Molecular Structure Representations Molecular Structure and Shannon information entropy Ø The Shannon information entropy is a maximum for a uniform distribution. Ø Deviations from this uniformity may be attributed to structure. Ø Electron-nuclear forces add structure to an electron distribution, thereby lowering the entropy; Ø Electron repulsion forces broaden the distribution and hence raise the entropy Ø A decrease of Shannon information entropy is due to the dominant role of the attractive forces exerted by the nuclei in imparting structure to the electron distribution in a molecular system Molecular structure in the Born-Oppenheimer approximation • • • • • The BO separation of electronic and nuclear motions in molecules shows that there must exist molecular states which can be approximately represented as products of electronic and nuclear functions. The electronic structure problem then involves solving for the eigenfunctions of an electronic Hamiltonian, while the nuclear function satisfies an equation of motion, with the eigenvalues of the electronic Hamiltonian forming an effective potential energy surface upon which the nuclei may be envisioned to move. The distinct concepts of electronic structure and molecular structure are thus intimately related. This is, of course, not accidental: as Hohenberg and Kohn showed in 1964, there exists a unique mapping between the potential v(r) due to the nuclei and the distribution of electron density ρ(r). Since ρ(r) determines the number of electrons N = ∫ρ(r) dr, ρ(r) also uniquely determines the ground state wave function ψ, the ground state electronic energy and the molecular structure. Electron density envelopes for Ethylene ρ = 0.002 e/Bohr3 ρ = 0.20 e/Bohr3 ρ = 0.36 e/Bohr3 Electron density profiles of ethylene Molecular structure and bond paths “Will you reflect for a moment on some of the things that I have been saying? I described a bond, a normal simple chemical bond; and I gave many details of its character (and could have given many more). Sometimes it seems to me that a bond between two atoms has become so real, so tangible, so friendly that I can almost see it. And then I awake with a little shock; for a chemical bond is not a real thing: it does not exist: no one has ever seen it, no one ever can. It is a figment of our own imagination.” — C. A. Coulson (Coulson 1951; Coulson 1955) Molecular structure in the Quantum Theory of Atoms in Molecules • • • The virial partitioning of molecular systems into roughly neutral subsystems forms the basis of the Quantum Theory of Atoms in Molecules, providing a rigorous and unambiguous recipe for partitioning a molecule into atomic subsystems. In this formulation, the nuclei function as attractors of the electron density field ρ(r), the atom being defined as the union of an attractor and its basin of attraction. Each atom thus contains one and only one nucleus, with the gradient paths of the electron density (∇ρ) being employed to define the bonds between atoms as well as the interatomic boundaries: the bond path between any two atoms is defined as the unique gradient path ∇ρ connecting the respective nuclei, while the interatomic surface is defined through the zero-flux criterion: ∇ρ.ñ = 0 where ñ is the normal to the surface. Chemical topology & Molecular graphs • • • • This partitioning scheme has a sound theoretical underpinning: the zero-flux criterion ensures that each atomic subsystem satisfies the virial theorem and thereby ensures the spatial additivity of the action W=∫L(t)dt (where L is the Lagrangian), and of its variation, in accordance with Schwinger’s principle of stationary action. It is through this principle that we are able to extend the formulation of quantum mechanics to an open quantum subsystem, such as an atom in a molecule. Through bond paths, we also recover the concept of chemical bonds: the topology of the bond paths completely specifies the molecular graph. This molecular graph is commonly referred to as the 2-D structure of the molecule. Electron density contours, gradient paths and bond paths of ethylene http://www.chemistry.mcmaster.ca/faculty/bader/aim/aim_1.html Bond paths and non-nuclear attractors Li Li2 Li Li Li Li Li4 Li Li = non-nuclear attractor Li Li Li Li6 Li Li No direct Li-Li bonds Quantum Topology of Molecular Structure and Change Water Umbilic catastrophe Structure and Conformation • Conformational flexibility is a critical link between structure, stability and function. • Enzymes must be flexible enough to mediate a reaction pathway, yet rigid enough to achieve molecular recognition. Transition-state theory involves a rate-limiting step, shown as an obligatory thermodynamic barrier Protein folding landscape Theory and simulations show that energy landscapes for protein folding are funnel-shaped and have no apparent microscopic energetic or entropic barriers. Schonbrun, Jack and Dill, Ken A. (2003) Proc. Natl. Acad. Sci. USA 100, 12678-12682 Encoding Structure : Descriptors O N N Cl AAACCTCATAGGAAGCATACCA GGAATTACATCA… Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors Molecular Structures Descriptors Model Property Molecular Representations O N H3C N CH3 N CH3 Chemistry space and Molecular Similarity Chemistry space and Molecular Similarity The figure depicts a cartoon representation of the relationship between the continuum of chemical space (light blue) and the discrete areas of chemical space that are occupied by compounds with specific affinity for biological molecules. Examples of such molecules are those from major gene families (shown in brown, with specific gene families colour-coded as proteases (purple), lipophilic GPCRs (blue) and kinases (red)). The independent intersection of compounds with drug-like properties, that is those in a region of chemical space defined by the possession of absorption, distribution, metabolism and excretion properties consistent with orally administered drugs — ADME space — is shown in green. Christopher Lipinski & Andrew Hopkins, NATURE|VOL 432 | 16 DECEMBER 2004, pp.855-861 Molecular Similarity Assessment: Motivation… The Drug Discovery Pipeline Distribution of drug potencies Cumulative Cost Probability of success The Interface of NIH and Drug Development Current Public Sector Science Dedicated MedChem begins Indefinite Target identification 1 yr 1 yr 1 yr Compound accepted into Development ~ 3 yrs Lead Optimization, Toxicology Assay develop- Screening (HTS or Hit-toment otherwise) Probe 1 yr 2 yrs Ph I Ph II (Safety) (Dose finding, initial efficacy in patient pop.) ~3 yrs Ph III (Efficacy and safety in large populations) 1.5 yrs Indefinite Regulatory Ph IV-V review (Additional indications, Safety monitoring) Cumulative Cost Probability of success The Interface of NIH and Drug Development Proposed Public Sector Science Dedicated MedChem begins Indefinite Target identification 1 yr 1 yr 1 yr Compound accepted into Development ~ 3 yrs Lead Optimization, Toxicology Assay develop- Screening (HTS or Hit-toment otherwise) Probe 1 yr 2 yrs Ph I Ph II (Safety) (Dose finding, initial efficacy in patient pop.) ~3 yrs Ph III (Efficacy and safety in large populations) 1.5 yrs Indefinite Regulatory Ph IV-V review (Additional indications, Safety monitoring) Model Applicability Domain Analysis Poor Model Applicability Good Model Applicability Macrocycles – musky odor or not ? (C. Davidson and B. Lavine) musk non-musk Nitroaromatics – musk or non-musk? (C. Davidson and B. Lavine) musk non-musk Descriptor Selection • What features of a molecule are related to the property of interest ? • What descriptors can capture that information? Molecular Structures Descriptors Model Property GA/PCA Results with TAE descriptors (C. Davidson and B. Lavine) 7 selected features •1—Nonmusk •2—Musk Results with Wavelet and PEST Descriptors (C. Davidson and B. Lavine) Lavine) 3D PC Plot Dim(9) 3 2 2 •1—Nonmusk 2 2 2 2 •2—Musk 2 2 2 2 2 2 22 2 2 222 22 2 22 2 2 22 2 22 2 2 2 2 2 2 2 2 2 222 2 2 22 2 2 2 222222 2 2 2 2 222222222 22 2 22 2 22 22 2 2 2 2 2 22 22 2 2 22 2 222 2 PC2 1 0 -1 -2 -3 -3 1 11 1 1111 111111 1 11 1111 1 11 11 1111111 1 11 1 1 1 11 1 1 1 1 1 1 1 22 -2 -1 0 1 PC1 2 3 4 5 Nitroaromatics and macrocycles (B. Lavine) 3D PC Plot Dim(30) 3 11111111 1 1111 1 1 1 1 11111111 1111 11 •1 Macro Non-Musk 11 11 11 1 11 1111 1111 1 11 1 111 11 1 1111 11 11 1 1111 1 1111 11 •2 Macro Musk 2 •1 Nitro Non-Musk •2 Nitro Musk PC2 1 1 1111 11 1 11 0 -1 22 2 2222 2 22 222 22 22222 22222 2222 2 2 22 2 2 2 22 2222 22 -2 -3 -6 -4 -2 0 PC1 2 22 2 2 222 2 22 22 2222 222 222 2 222 22 2 22222 2 2 2 222222222 2 22 2 22 2222222 2 2 2 2 2 2 4 6 Assessment of Molecular Similarity Assessment of Similarity It was six men of Indostan To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind The First approached the Elephant, And happening to fall Against his broad and sturdy side, At once began to bawl: “God bless me! but the Elephant Is very like a wall!” The Second, feeling of the tusk, Cried, “Ho! what have we here So very round and smooth and sharp? To me ’tis mighty clear This wonder of an Elephant Is very like a spear!” The Third approached the animal, And happening to take The squirming trunk within his hands, Thus boldly up and spake: “I see,” quoth he, “the Elephant Is very like a snake!” The Fourth reached out an eager hand, And felt about the knee. “What most this wondrous beast is like Is mighty plain,” quoth he; “ ‘Tis clear enough the Elephant Is very like a tree!” The Fifth, who chanced to touch the ear, Said: “E’en the blindest man Can tell what this resembles most; Deny the fact who can This marvel of an Elephant Is very like a fan!” The Sixth no sooner had begun About the beast to grope, Than, seizing on the swinging tail That fell within his scope, “I see,” quoth he, “the Elephant Is very like a rope!” And so these men of Indostan Disputed loud and long, Each in his own opinion Exceeding stiff and strong, Though each was partly in the right, And all were in the wrong! - John Godfrey Saxe (1816-1887) Why there is No Salt in the Sea — Joseph E. Earley Foundations of Chemistry (Springer Netherlands) Volume 7, Number 1, Pages 85-102, January 2005 What, precisely, is 'salt'? It is a certain white, solid, crystalline, material, also called sodium chloride. Does any of that solid white stuff exist in the sea? – Clearly not. One can make salt from sea water easily enough,but that fact does not establish that salt, as such, is present in brine. (Paper and ink can be made into a novel – but no novel actually exists in a stack of blank paper with a vial of ink close by.) When salt dissolves in water, what is present is no longer 'salt' but rather a collection of hydrated sodium cations and chloride anions, neither of which is precisely salt, nor is the collection. The aqueous material in brine is also significantly different from pure water. Salt may be considered to be present in seawater, but only in a more or less vague 'potential' way. Actually, there is no salt in the sea. What about water in proteins? • • • • • • • Our bodies are an aqueous environment — Liquid water constitutes one of the essential components of biological systems and it is difficult to overstate the role of water in biological structure and function. Proteins crystallize with several units of H2O weakly bound to the rest of the protein H2O provides the thermodynamic driving force for proteins to fold and self-assemble. It mediates not only tertiary and quaternary interactions, but also interactions between different biomolecules, and between biomolecules and ligands or surfaces. H2O molecules are also known to take part in specific enzymatic reactions. Protein conformational dynamics appear to be linked (or slaved) to the dynamics of vicinal H2O, thereby affecting protein function. H2O in the vicinity of proteins and other biomolecules critically influence protein structure, dynamics, function and other thermodynamic and kinetic properties. pH-Sensitive Protein Surface Electrostatic Potential Maps 1POC EP pH 3.0 1POC EP pH 4.0 1POC EP pH 5.0 1POC EP pH 6.0 1POC EP pH 7.0 1POC EP pH 8.0 DNA Binding Complex with 1CGP Representations of DNA Structure Can we improve on ATCG? • Most bioinformatic methods represent DNA by sequence of letters • DNA bases assumed to act independently • This representation of DNA has little to do with the energetics of binding of protein to DNA Dixel approach • Characterization of DNA through features of electron densities on the surfaces of the major and minor groves of the DNA • The central base pair resides in the specific electronic environment generated by the flanking base pairs DNA Nucleotide Triplets as DIXELS A “basis set” of all possible nucleotide base pairs with all possible neighbors results in a set of base pair “triplets”. Ab Initio properties of base pair and two flanking base pairs (end capped) are computed. Central base pair is encoded and stored as a “DIXEL” object. Base pair properties perturbed by flanking base pairs Challenges in Molecular Similarity Assessment “First there are the known knowns” —These are the things that we know we know “Then there are the known unknowns” —These are the things that we now know we do not know “Finally there are also the unknown unknowns” —These are the things that we do not yet know we do not know “And each day brings us a few more unknown unknowns” —Donald Rumsfeld, 2003