Milano Chemometrics and QSAR Research Group Roberto Todeschini Viviana Consonni Manuela Pavan Andrea Mauri Davide Ballabio Alberto Manganaro chemometrics molecular descriptors QSAR multicriteria decision making environmetrics experimental design artificial neural networks statistical process control Department of Environmental Sciences University of Milano - Bicocca P.za della Scienza, 1 - 20126 Milano (Italy) Website: michem.unimib.it/chm/ Roberto Todeschini Milano Chemometrics and QSAR Research Group Molecular descriptors Constitutional descriptors and graph invariants Iran - February 2009 Content Counting descriptors Empirical descriptors Fragment descriptors Molecular graphs Topological descriptors Counting descriptors Each descriptor represents the number of elements of some defined chemical quantity. For example: - the number of atoms or bonds - the number of carbon or chlorine atoms - the number of OH or C=O functional groups - the number of benzene rings - the number of defined molecular fragments Counting descriptors ... also a sum of some atomic / bond property is considered as a count descriptor, as well as its average A MW mi i 1 A P wi AMW MW / A i 1 For example: - molecular weight and average molecular weight - sum of the atomic electronegativities - sum of the atomic polarizabilities - sum of the bond orders Counting descriptors A counting descriptor n is semi-positive variable, i.e. n 0 Its statistical distribution is usually a Poisson distribution. Main characteristics • simple • the most used • local information • high degeneracy • discriminant modelling power Empirical descriptors Descriptors based on specific structural aspects present in sets of congeneric compounds and usually not applicable (or giving a single default value) to compounds of different classes. Empirical descriptors Index of Taillander Taillander et al., 1983 It is a descriptor dedicated to the modelling of the benzene rings and is defined as the sum of the six lengths joining the adjacent substituent groups. H H Cl CH3 H H Empirical descriptors Hydrophilicity index (Hy) Todeschini et al., 1999 It is a descriptor dedicated to the modelling of hydrophilicity and is based on a function of the counting of hydrophilic groups (OH-, SH-, NH-, ...) and carbon atoms. 1 nHy 1 (1 nHy ) log 1 nHy nC log 2 n n n Hy log 1 n nHy nC n number of hydrophilic groups -1 Hy 3.64 number of carbon atoms total number of non-hydrogen atoms Empirical descriptors Compound nHy nC n Hy hydrogen peroxide 2 0 2 3.64 carbonic acid 2 1 3 3.48 water 2 0 1 3.44 butanetetraol 4 4 8 3.30 propanetriol 3 3 6 2.54 ethanediol 2 2 4 1.84 methanol 1 1 2 1.40 ethanol 1 2 3 0.71 decanediol 2 10 12 0.52 propanol 1 3 4 0.37 butanol 1 4 5 0.17 pentanol 1 5 6 0.03 methane 0 1 1 0.00 nHy = 0 and nC = 0 0 0 N 0.00 decanol 1 10 11 - 0.28 ethane 0 2 2 - 0.63 pentane 0 5 5 - 0.90 decane 0 10 10 - 0.96 alcane with nC = 1000 0 1000 1000 - 1.00 Fragment approach Parametric approach (Hammett – Hansch,1964) Substituent approach (Free-Wilson, Fujita-Ban, 1976) DARC-PELCO approach (Dubois, 1966) Sterimol approach (Verloop, 1976) Fragment approach The biological activity of a molecule is the sum of its fragment properties Congenericity principle QSAR styrategies can be applied ONLY to classes of similar compounds common reference skeleton molecule properties gradually modified by substituents Hansch approach Corvin Hansch, 1964 Biological response = f1(L) + f2(E) + f3(S) + f4(M) 1 Lipophilic properties 2 Electronic properties 3 Steric properties 4 Other molecular properties Hansch approach 1 Congenericity approach 2 Linear additive scheme 3 Limited representation of global molecular properties 4 No 3D and conformational information Free-Wilson approach Me Me Me Me Me 1H I H F Br 2H H I F F Free-Wilson approach Me Me Me Me Me H I H F Br H H I F F Pos. 1 Pos. 2 F Br I F Br I mol.1 0 0 0 0 0 0 mol.2 0 0 1 0 0 0 mol.3 0 0 0 0 0 1 mol.4 1 0 0 1 0 0 mol.5 0 1 0 1 0 0 Free-Wilson approach Free-Wilson, 1964 S Ns y i b0 bks Ii ,ks s 1 k 1 Iks absence/presence of k-th subst. in the s-th site yi b0 b11 Ii ,11 b21 Ii ,21 b31 Ii ,31 b12 Ii ,12 b22 Ii ,22 b32 Ii ,32 F Br Pos. 1 I F Br Pos. 2 I Fragment approach Fingerprints binary vector 1000101000000010000000 presence of a fragment absence of a fragment similarity searching Molecular graph 5 1 2 6 3 7 4 Molecular graph Mathematical object defined as G = (V, E) set V vertices atoms set E edges bonds 5 1 2 6 3 7 4 Molecular graph Usually in the molecular graph hydrogen atoms are not considered H - depleted molecular graph Molecular graph A walk in G is a sequence of vertices w = (v1, v2, v3, ..., vk) such that {vj, vj+1} E. The length of a walk is the number of edges traversed by the walk. A path in G is a walk without any repeated vertices. The length of a path (v1, v2, v3, ..., vk+1) is k. 1 3 4 2 5 6 v1 v2 v3 v2 v5 walk of length 4 v1 v2 v3 v4 v5 path of length 4 Molecular graph The topological distance dij is the length of the shortest path between the vertices vi and vj. 1 3 2 d15 = 2 d15 = 4 4 5 6 The detour distance dij is the length of the longest path between the vertices vi and vj. Molecular graph A self returning walk is a walk closed in itself, i.e. a walk starting and ending on the same vertex. 1 3 2 4 5 v1 v2 v3 v2 v1 6 Self returning walk of length 4 v2 v3 v4 v5 v2 A cycle is a walk with no repeated vertices other than its first and last ones (v1 = vk). Molecular graph The molecular walk (path) count MWCk (MPCk) of order k is the total number of walks (paths) of k-th length in the molecular graph. MWC0 = nSK (no. of atoms) MWC1 = nBO (no. of bonds) DRAGON Molecular size Branching Graph complexity MWC1, MWC2, …, MWC10 Molecular graph The self-returning walk count SRWk of order k is the total number of self-returning walks of length k in the graph. SRW1 = nSK SRW2 = nBO DRAGON SRW1, SRW2, …, SRW10 spectral moments of the adjacency matrix, i.e. linear combinations of counts of certain fragments contained in the molecular graph, i.e. embedding frequencies. Molecular graph Local vertex invariants (LOVIs) are quantities associated to each vertex of a molecular graph. Graph invariants are molecular descriptors representing graph properties that are preserved by isomorphism. characteristic polynomial derived from local vertex invariants Molecular graph and more Molecular graph Topological matrix Algebraic operator Local Vertex Invariants Graph invariants Molecular descriptors molecular graph topostructural descriptors graph invariants molecular geometry x, y, z coordinates topochemical descriptors topographic descriptors topological information indices Wiener index, Hosoya Z index Zagreb indices, Mohar indices Randic connectivity index Balaban distance connectivity index Schultz molecular topological index Kier shape descriptors eigenvalues of the adjacency matrix eigenvalues of the distance matrix Kirchhoff number detour index topological charge indices ............... 3D-Wiener index 3D-Balaban index D/D index ............... Kier-Hall valence connectivity indices Burden eigenvalues BCUT descriptors Kier alpha-modified shape descriptors 2D autocorrelation descriptors ............... total information content on ..... mean information content on ..... Molecule graph invariants Numerical chemical information extracted from molecular graphs. The mathematical representation of a molecular graph is made by the topological matrices: • adjacency matrix • atom connectivity matrix • distance matrix • edge distance matrix • incidence matrix ... more than 60 matrix representations of the molecular structure Local vertex invariants Local vertex invariants (LOVIs) are quantities associated to each vertex of a molecular graph. Examples: • atom vertex degree • valence vertex degree • sum of the vertex distance degree • maximum vertex distance degree Topological matrices Adjacency matrix Derived from a molecular graph, it represents the whole set of connections between adjacent pairs of atoms. 1 if atom i and j are bonded aij = 0 otherwise Topological matrices Bond number B It is the simplest graph invariant obtained from the adjacency matrix. It is the number of bonds in the molecular graph calculated as: 1 B 2 A A a ij i 1 j 1 where aij is the entry of the adjacency matrix. Local vertex invariants atom vertex degree δi It is the row sum of the vertex adjacency matrix 5 1 2 6 3 7 4 1 2 3 4 5 6 7 di 1 0 1 0 0 0 0 0 1 2 1 0 1 0 1 0 1 4 3 0 1 0 1 0 1 0 3 4 0 0 1 0 0 0 0 1 5 0 1 0 0 0 0 0 1 6 0 0 1 0 0 0 0 1 7 0 1 0 0 0 0 0 1 Local vertex invariants valence vertex degree for atoms of the 2nd principal quantum number (C, N, O, F) δiv Z iv hi Ziv number of valence electrons of the i-th atom hi number of hydrogens bonded to the i-th atom Local vertex invariants valence vertex degree the vertex degree of the i-th atom is the count of edges incident with the i-th atom, i.e. the count of bonds or electrons. Local vertex invariants valence vertex degree for atoms with principal quantum number > 2 v Z v i hi δi Z i Z iv 1 Zi total number of electrons of the i-th atom (Atomic Number) Topological descriptors Zagreb indices (Gutman, 1975) A M 1 d 2a a 1 M 2 b di d j di vertex degree of the i-th atom Topological descriptors Kier-Hall connectivity indices (1986) They are based on molecular graph decomposition into fragments (subgraphs) of different size and complexity and use atom vertex degrees as subgraph weigth. Randic branching index (1975) R 1 b di d j 1/ 2 d i d j 1 / 2 is called edge connectivity Topological descriptors mean Randic branching index χR χR B Topological descriptors atom connectivity indices of m-th order 0 a da1/ 2 The immediate bonding environment of each atom is encoded by the subgraph weigth. b di d j b The number of terms in the sum depends on the molecular structure. 1/ 2 1 di dl d j k 2 2 P 1/ 2 k 1 m 1 / 2 n m χ q δa k k 1 a 1 m The connectivity indices show a good capability of isomer discrimination and reflect some features of molecular branching. P P number of m-th order paths q subgraph type (Path, Cluster, Path/Cluster, Chain) n=m for Chain (Ring) subgraph type n = m + 1 otherwise Topological descriptors valence connectivity indices of m-th order 0 a d v 1/ 2 a v v i v i b d d 1 v 2 2 P m v 1/ 2 j b d d d v v l k 1 1 / 2 v χ δa k k 1 a 1 m m v 1/ 2 j k v q P n χ v q They encode atom identities as well as the connectivities in the molecular graph. Topological descriptors Kier-Hall electronegativity X KH δ δi v i X KH δ iv δ i N2 Kier-Hall relative electronegativity electronegativity of carbon sp3 taken as zero principal quantum number correlation with the Mulliken-Jaffe electronegativity: XMJ 1.99 dvi di 6.99 div di XMJ 7.99 7.07 2 N Distance matrix vertex distance matrix degree si It is the row sum of the vertex distance matrix 5 1 2 6 3 4 7 The distance dij between two vertices is the smallest number of edges between them. si i 1 2 3 4 5 6 7 1 0 1 2 3 2 3 2 13 3 2 1 0 1 2 1 2 1 8 2 3 2 1 0 1 2 1 2 9 2 4 3 2 1 0 3 2 3 14 3 5 2 1 2 3 0 3 2 13 3 6 3 2 1 2 3 0 3 14 3 7 2 1 2 3 2 3 0 13 3 si is high for terminal vertices and low for central vertices Local vertex invariants The eccentricity i of the i-th atom is the upper bound of the distance dij between the atom i and the other atoms j Topological descriptors Petitjean shape index (1992) A simple shape descriptor I PJ DR R IPJ = 0 for structure strictly cyclic IPJ = 1 for structure strictly acyclic and with an even diameter Topological descriptors Wiener index (1947) 1 A A W dij 2 i 1 j 1 dij topological distances 2W W A A 1 high values for big molecules and for linear molecules low values for small molecules and for branched or cyclic molecules The Average Wiener index is independent from the molecular size. Topological descriptors Balaban distance connectivity index (1982) C B A 1 B 0.5 J b si s j C 1 number of atoms B 0.5 J b si s j C 1 _ B number of bonds C number of cycles si sum of the i-th row distances si si B average sum of the i-th row distances one of the most discriminant indices Edge descriptors 5 6 d 1 2 3 a 4 e b c f atom 7 Es E i i a b c d e f a 0 1 2 1 2 1 7 2 b 1 0 1 1 1 1 5 1 c 2 1 0 2 1 2 7 2 d 1 1 2 0 2 1 7 2 e 2 1 1 2 0 2 8 2 f 1 1 2 1 2 0 7 2 bond d e a b c f Topographic descriptors Some geometrical descriptors are derived from the corresponding topological descriptors substituting the topological distances dst by the geometrical distances rst. They are called topographic descriptors. For example, the 3D-Wiener index: A 3D A 1 W rij 2 i 1 j 1 Molecular geometry The geometry matrix G (or geometric distance matrix) is a square symmetric matrix whose entry rst is the geometric distance calculated as the Euclidean distance between the atoms s and t: 0 r12 r21 0 G rA1 rA 2 r1 A r2 A 0 Milano Chemometrics and QSAR Research Group Roberto Todeschini Viviana Consonni Manuela Pavan Andrea Mauri Davide Ballabio Alberto Manganaro chemometrics molecular descriptors QSAR multicriteria decision making environmetrics experimental design artificial neural networks statistical process control Department of Environmental Sciences University of Milano - Bicocca P.za della Scienza, 1 - 20126 Milano (Italy) Website: michem.disat.unimib.it/chm/ THANK YOU coffee break Goal Goal Molecular graph Molecular graph Molecule graph invariants Molecular graph Molecular graph Molecular graph Molecular graph Molecular graph Molecular graph Hansch approach Hansch molecular descriptors lipophilic properties electronic properties steric properties partition coefficients - logP, logKow Hammett constants molecular weight molar refraction VDW volume chromatog. param. - Rf, RT, dipole moment molar volume Solubility HOMO, LUMO surface area …. Ionization potential …. …. Molecular graph Molecular graph Molecular graph