An Introduction to QSAR Dr. Bahram Hemmateenejad Chemistry Department Shiraz University Computer-Aided Molecular Design (CAMD) Computer-Aided Ligand Design (CALD) Computer-Aided Drug Design (CADD) Approaches in CAMD Receptor-based design Known receptor Protein binding site Supramolecular host Ligand-based design Known set of ligands Unknown receptor Receptor-Based Design Build or Find the key that fits the lock Receptor-Based Design Docking Interaction energy Molecular alignment Pharmacophor modeling Ligand-Based Design Quantitative structure-activity relationship (QSAR) Quantitative structure-Property relationship (QSPR) Quantitative structure-Toxicity relationship (QSTR) Quantitative structure-Retention relationship (QSRR) Quantitative structure-Migration relationship (QSMR) Quantitative structure-Electrochemistry relationship (QSAR) Quantitative structure-Function relationship (QSFR) QSAR/QSPR Definition Prediction of biological activities or chemical property of organic compounds from their molecular structures using mathematical equations (obs. biological activity) (molecular descriptors) Y = f (Xi) Prediction Ligand-Based Molecular Design Infer Binding Pocket QSAR What to achieve estimate the value of unknown physical/chemical/biological properties of compounds based on known or computationally accessible properties How to achieve: determine the value of the response variable y as a function of descriptor variables xi yik=F(Ai,xk) + Bik What we need for QSAR models? Dataset including compounds with known biological activity Descriptors that are accessible for all members of the dataset Algorithms for the development of a QSAR model Validation protocol for the evaluation of the model Requirements for QSAR datasets Compounds should belong to a congeneric series (more important in 2D) have same mechanism of action have comparable binding mode have biological activtity that correlates to binding affinity Requirements for QSAR datasets Compounds should have enumerated biological response measured in same organism/tissue/cell/protein using same type of measurement (binding/functional/IC50/Ki etc.) using same protocol (radioligand, activator, cofactor, pH, buffer etc.) QSAR Origin Linear Free-Energy Relationships (LFER) Hammet K Log K 0 Free-Wilson analysis log 1/C = Σ ai + μ ai = substituents (R1, R2, etc.) contributions μ = activity contribution of reference compound R1 R2 R3 Free-Wilson analysis NO R1 Cl 1 0 2 1 R2 OH Me Cl 1 0 1 0 0 1 R3 OH Me Cl 0 0 0 0 0 0 OH Me 0 0 1 0 3 4 5 0 1 1 0 1 0 0 0 0 Hansch Analysis Official Birth 1 2 Log a(log P) b log P c ... k C C=Biological effect P=Partition Coefficient σ=Electronic Hammet Constant Linear Hansch model log 1/C = a log P + b σ + c MR + ... + k Nonlinear Hansch models log 1/C = a (log P)2 + b log P + c σ + ... + k log 1/C = a log P - b log (ßP + 1) + c σ + ... + k Mixes Hanch/Free-Wilson model log 1/C = a (log P)2 + b log P + c σ +...+ Σ ai + k Complementarily principles in binding molecules to macromolecular targets Interaction Property Descriptor Steric Topology Distance, volume, surface Electrostatic Electron Density σ, partial charges, Quantum chemical Hydrophobic Lipophilicity logP, π van der Waals Polarizability MR, parachor Descriptor types for QSAR Substituent variables: Property of substituents only Molecule variables: Property of the whole molecule Interaction variables: Property of a given interaction Descriptors for QSAR Constitutional MW, Nheteroatoms ,Natoms Topological Connectivity, Weiner index, E-state indices Electrostatic Polarity, dipol moments, partial charges Geometrical Descriptors Distances, molecular volume, PSA Quantum chemical HOMO and LUMO energies Vibrational frequencies Bond orders Energies, entalphies, entropy Descriptors for QSAR 3D descriptors MEP – Molecular Electrostatic Potential MLP – Molecular Lipophilicity Potential GRID – total energy of interaction: the sum of steric (Lennard-Jones), H-bonding and electrostatics CoMFA – standard: steric and electrostatic, additional: H-bonding, indicator, parabolic and others. Conditions for applicability of QSAR Selection of compounds The same mechanism of action Homogeneity Representativity Experimental design Biological data High quality and reliable Same protocol and same laboratory The level of experimental error Conditions for applicability of QSAR Type of data Continues Discrete Data scaling and transformation Logarithmic transformation Normalization Conditions for applicability of QSAR Descriptors As meaningful as possible Interpretable Calculation simplicity Calculation uncertainty Software reliability QSAR Steps 1. Formulation of classes of similar compounds Ideal situation: Classes of chemically and biologically similar compounds All the compounds should be structurally similar and function according to the same mode of action Compounds must be disimilar enough to cause some systematic change in biological activity QSAR Steps 2. Quantitative description of structural variations (descriptor calculation) Usually several descriptors are required It is difficult to predict which descriptors will be useful It is convenient to have a set of independent descriptors QSAR Steps 3. Selection of training set compounds Training set: is used to optimize and develop the model Calibration set: Calculating model coefficient Validation set: Validate the constructed model External test set has no contribution in the model development step Measures the overall prediction ability of the proposed model Selection criteria Random Experimental design Classification methods PCA Classification SIMCA … and regression tree (CART) Example of a PCA-based selection method QSAR steps 4. QSAR model development (data analysis) Regression method Variable selection method Regression methods Linear regression Preferred for simplicity and ease of calculation More descriptive Non-linear regress Usually are complex Higher prediction ability High risk of over-fitting Linear regression Multiple linear regression (MLR) The simplest and the mostly used method More interpretable Collinearity Number of variables considered in the model Factor analysis based methods Principal component regression (PCR) Partial least squares (PLS) PCR and PLS Both overcome colinearity by producing orthogonal variables PCR is a continuum between MLR and PLS PLS is more predictive PCR is more descriptive PLS generate latent variables Two-step model building Variable selection Factor selection Higher risk of over-fitting with respect to MLR Nonlinear regression Artificial neural networks (ANN) Feed-forward Counter propagation Kohonen networks Wavelet neural network Neuro fuzzy Nonlinear PCR and PLS Quadratic PCR or PLS PC-ANN PLS-ANN Support vector machine (SVM) … Variable selection Search strategy Searching different subsets of descriptors Scoring function Evaluating the performances of the variable combination Regression methods are used for scoring Variable selection always is coupled with a regression method Variable selection Feature selection Different variable selection methods Stepwise Genetic algorithm Ant colony optimization … Feature extraction PCA scores Kohonen scores SVM scores Wavelet coefficients Combined feature selection-feature extraction QSAR steps 5. Model validation The essential part of a QSAR study Internal validation Cross-validation External validation Some advises QSAR models should be Simple Transparence Mechanistically Non comprehensible over-fitted Use as low number of variables as possible Some advises Be associated with a biological end point Take the form of unambiguous and easily applicable algorithm Ideally have a clear mechanistic basis Be accompanied by a definition of applicability domain Be associated with a measure of good-ness of fit Be accessed in term of its predictive capacity The last advise Using Experimental design and QSAR to increase the rate of proposing new compounds Medicinal chemists or drug designers Good diversity Molecular volume HO C 2H O 2HC HC HN O Rotatable bonds 2HC 2HC Dipole Poor diversity O O 2HC 2HC O Synthesis-> Biol. testing-> QSPR model Predicted value O HO Actual Value Models are not real but sometimes are helpful