Master Chemoinfo • Criblage virtuel Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE computational Hit Target Protein Filtering, QSAR, Docking Large libraries of molecules Small Library of selected hits experimental Virtual Screening High Throughout Screening Chemical universe: • 10200 molecules • 1060 druglike molecules Virtual screening must be fast and reliable Molecules are considered as vectors in multidimentional chemical space defined by the descriptors Criblage à haut débit Génomique Cible HTS Criblage à haut débit High-throughput screening Hits Analyse de données Lead Optimisation Candidat au développement Drug Discovery and ADME/Tox studies should be performed in parallel idea target combichem/HTS hit lead candidate ADME/Tox studies drug Methodologies of a virtual screening from A.R. Leach, V.J. Gillet “An Introduction to Chemoinformatics”, Kluwer Academic Publisher, 2003 Platform for Ligand Based Virtual Screening ~106 – 109 molecules • Filters • Similarity search ~103 - – 104 molecules • QSAR models Candidates for docking or experimental tests Criblage à haut débit (HTS) Mots clés: - Chimie combinatoire -Criblage à haut débit (High Throughput Screening (HTS)) - Screening virtuel - Aspect Drug-like - Training sets jusqu’à 1000000 composés Virtual Screening Molecules available for screening (1) Real molecules 1 - 2 millions in in-house archives of large pharma and agrochemical companies 3 - 4 millions of samples available commercially (2) Hypothetical molecules Virtual combinatorial libraries (up to 1060 molecules) Methods of virtual High-Throughput Screening • Filters • Similarity search • Classification and regression structure – property models • Docking Filters to estimate “drug-likeness” Lipinski rules for intestinal absorption (« Rules of 5 ») • H-bond donors < 5 • (the sum of OH and NH groups); • MWT < 500; • LogP < 5 • H-bond acceptors < 10 (the sum of N and O atoms without H attached). Lipinski rules for drug-like molecules (« Rules of 5 ») Lipinski rules for drug-like molecules (« Rules of 5 ») Example of different filters: Rules for Absorbable compounds Lipinski Veber AB/HIA < 500 < 770 < 1,000 Log P <5 <9 < 10 H-Don. <5 --- <6 H-Acc. < 10 --- < 19 H-D + H-A --- < 12 < 22 Rot-Bonds --- < 10 < 19 tPSA --- < 140 < 291 Mol. W. Remove compounds containing too many rings Remove compounds with toxic groups Remove compounds with reactive groups Remove False-Positive Hits Remove poorly soluble compounds Filter on inorganic and heteroatom compounds Remove compounds with multiple chiral centers Paclitaxel (Taxol): violation of 2 rules MW = 837 logP=4.49 HD = 3 HA = 15 O O H3C O CH3 HO HO O CH3 O O O H3C O O HN O H3C O CH3 logD vs logP 95% of all drugs are ionizable : 75% are bases and 20% acids Utilizing pH dependent log D as a descriptor for lipophilicity in place of log P significantly increases the number of compounds correctly identified as drug-like using the drug-likeness filter: log D5.5 < 5 The Rule of Five Revisited: Applying Log D in Place of Log P in Drug-Likeness Filters S. K. Bhal, K. Kassam, I. G. Peirson, and G. M. Pearl , MOLECULAR PHARMACEUTICS, v.4, 556-560, (2007) Synthetic Accessibility is proportional to fragment’s occurrence in the PubChem database Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8 Synthetic Accessibility Frequency distribution of fragments Altogether 605,864 different fragment types have been obtained by fragmenting the PubChem structures. Most of them (51%), however are singletons (present only once in the whole set). Only a relatively small number of fragments, namely 3759 (0.62%), are frequent (i.e. present more than 1000-times in the database). Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8 Synthetic Accessibility The most common fragments present in the million PubChem molecules. The "A" represents any nonhydrogen atom, "dashed" double bond indicates an aromatic bond and the yellow circle marks the central atom of the fragment. Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8 Synthetic Accessibility Distribution of (- Sascore) for natural products, bioactive molecules and molecules from catalogues. Correlation of calculated (-SAscore ) and average chemist estimation for 40 molecules (r2 = 0.890) Ertl and Schuffenhauer Journal of Cheminformatics 2009 1:8 Similarity Search: unsupervised and supervised approaches 2d (unsupervised) Similarity Search H N O N N S O O H N Cl N O N S O O Tanimoto coef NA &B T 0.80 NA NB NA &B 1010001001110110101 0010001001110110101 molecular fingerprints Contineous and Discontineous SAR Structural Spectrum of Thrombin Inhibitors structural similarity “fading away” … reference compounds 0.56 0.72 0.53 0.84 0.67 0.52 0.82 0.64 0.39 continuous SARs gradual changes in structure result in moderate changes in activity “rolling hills” (G. Maggiora) Structure-Activity Landscape Index: discontinuous SARs small changes in structure have dramatic effects on activity “cliffs” in activity landscapes SALIij = DAij / DSij DAij (DSij ) is the difference between activities (similarities) of molecules i and j R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646 VEGFR-2 tyrosine kinase inhibitors discontinuous SARs 6 nM MACC STc: 1.00 Analog 2390 nM bad news for molecular similarity analysis... small changes in structure have dramatic effects on activity “cliffs” in activity landscapes lead optimization, QSAR Example of a “Classical” Discontinuous SAR Any similarity method must recognize these compounds as being “similar“ ... (MACCS Tanimoto similarity) Adenosine deaminase inhibitors Supervised Molecular Similarity Analysis Dynamic Mapping of Consensus Positions Prototypic “mapping algorithm” for simplified binarytransformed* descriptor spaces Uses known active compounds to create activity-dependent consensus positions in chemical space Operates in descriptor spaces of step-wise increasing dimensionality (“dimension extension”) Selects preferred descriptors from large pools * median-based, i.e. assign “1” to a descriptor if its value is greater than (or equal to) its screening database median; assign “0” if it is smaller Godden et al. & Bajorath. J Chem Inf Comput Sci 44, 21 (2004) Descriptor bit strings for reference molecules DMC Algorithm … Calculate and binary transform descriptors Compare descriptor bit strings of reference molecules and determine consensus bits Calculate consensus bit string: = 1.0 or = 0.0 no variability 1. Dimension extension: 0.9 or 0.1 10% variability 2. Dimension extension: 0.8 or 0.2 20% variability Select DB compounds matching consensus bits Re-generate bit strings permitting bit variability (white “0”, black “1” gray, variably set bits) 0 1 2 Select DB compounds matching extended bit strings Repeat until a small selection set is obtained e.g. 0%, 10%, 20% permitted bit variability: longer bit strings – fewer matching DB compounds QSAR/QSPR models Screening and hits selection Database O COOH Cl Br OH N OH Virtual Sreening N OH QSPR model N COOH Useless compounds O Br Hits Experimental Tests Libraries profiling: indexing a database by simultaneous assessment of various activities Example: PASS software (Prediction of Activity Spectra for Substances) For each fragment i wi acti acti inacti PASS Naïve Bayes estimator Calculations of « P(act) » and « P(inact) » Molecule is considered as active if P(act) > P(inact) or/and P(act) > 0.7 Quantitative Structure-Property Relationships (QSPR) Y = f (Structure) = f (descriptors) QSPR restricts reliable predictions for compounds which are similar to those used for the obtaining the models. Similarity / pharmacophore search approaches are still inevitable as complementary tools Combinatorial Library Design Virtual Screening ... when target structure is unknown Screening library Virtual library Diverse Subset Hits HTS Design of focussed library Screening Parallel synthesis or synthesis of single compounds Generation of Virtual Combinatorial Libraries Fragment Marking approach O Markush structure R1 P R3 R2 if R1, R2, R3 = and then O O O O P P P P O O O O P P P P The types of variation in Markush structures: 1. 2. 3. 4. OH R1 = Me, Et, Pr R1 R2 R3 = alkyl or heterocycle R3 R2 =NH2 Cl (CH2)n n=1– 3 Substituent variation (R1) Position variation (R2) Frequency variation Homology variation (R3) (only for patent search) Generation of Virtual Combinatorial Libraries Reaction transform approach from A.R. Leach, V.J. Gillet “An Introduction to Chemoinformatics”, Kluwer Academic Publisher, 2003 Issues and Concepts in Combinatorial Library Design • Size of the library • Coverage of properties („chemical space“) • Diversity, Similarity, Redundancy • Descriptor validation • Subset selection from virtual libraries Hot topics in chemoinformatics Predictions vs interpretation New approaches in structure-property modeling - descriptors, - applicability domain - machine-learning methods (inductive learning transfer, semi-supervised learning, ....) New techniques to mine chemical reactions QSAR of complex systems - multi-component synergistic mixtures, new materials, metabolic pathways, ... Public availability of chemoinformatics tools Predictions vs interpretation Nathan BROWN “Chemoinformatics—An Introduction for Computer Scientists” ACM Computing Surveys, Vol. 41, No. 2, Article 8, February 2009 Predictions vs interpretation Problems : • Ensemble modeling • Non-linear machine-learning methods (SVM, NN, …) • Descriptors correlations What do end users expect from QSAR models ? • Reliable estimation (prediction) of the given property. Public accessibility of models: WEB based platform for virtual screening Some Screen Shots: Welcome Page… ISIDA property prediction WEB server infochim.u-strasbg.fr/webserv/VSEngine.html ISIDA ScreenDB tools http://infochim.u-strasbg.fr/webserv/VSEngine.html -only INTERNET browser is required -Different descriptors -(ISIDA fragments, FPT, ChemAxon) - Similarity search with metrics (Tanimoto, Dice, …) different - ensemble modeling approach (simulteneous application of several models) - models applicability domain (automatic detection of useless models) The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968