Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000 2nd Sheffield Chemoinformatics Conference, Sheffield, UK Bioreason, Inc Presentation Outline Introduction Molecular similarity Observations on chemical data Analyzing screening data Using a traditional approach The Homogeneity Approach Definitions Implementation and experimental results Conclusions Bioreason, Inc Molecular Similarity Widely used all over drug discovery process Sample applications: Assessing diversity of a chemical dataset Picking representative dataset from compound library Given a compound and a compound library, identifying subset of similar compounds Analyzing screening data Major step: • Organizing screening data into chemical families Bioreason, Inc Typical Drug Discovery Process Library *Screening* Data *Data Analysis* Assay Further exploration Start Chemistry Bioreason, Inc Drug Candidates Technology Employed Compound representation methods Fingerprints/bit vectors, graph-based, ... 2D-keys Vs 3D-keys, fragment Vs distance based, ... Similarity and distance measures Tanimoto, Euclidean, …, graph-based, ... Clustering methods Classification methods Substructure searching/(sub)graph matching ... Bioreason, Inc Analyzing Chemical Compounds (1) Dictionary of Keys O N H O H N N-N Q-QH Q-C(-N)-C CH3-A-CH3 Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N>1 NH ... O 10111000001... Bioreason, Inc Analyzing Chemical Compounds (2) Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family Bioreason, Inc Analyzing Chemical Compounds (3) Information loss! E.g. “How” a key hits? Bioreason, Inc Dataset Used Derived from the NCI anti-HIV program Latest release, Oct. 99, 43 382 compounds Cell based, EC50 (effective concentration at which the test compound protects the cells by 50%) Pre-processing: Molecular weight <=500 Multiple EC50 values for compounds; kept highest concentration 33245 compounds left Activities: converted from molar concentrations to -log Activity threshold used: 5.5 Training set size (actives): 503 Bioreason, Inc Analyzing Screening Data Typical Approach Goal: Data Reduction To manageable size Organized fashion With minimal information loss Represent molecules as vectors, often binary Similarity/distance measure Clustering Algorithm Metacluster selection method (e.g. cluster level selection methods for hierarchical clustering) Bioreason, Inc Hierarchical Agglomerative Clustering Method NCI - HIV dataset 503 subset based on activity Clustered using Wards, Euclidean distance, bitvectors obtained via application of MACCS-like keys Cluster level selection using the Kelley method Results: 70 (meta)clusters Complete coverage of the dataset, no singletons! Average metacluster size: 7.2 compounds Bioreason, Inc Method Evaluation - Chemists Results validation by comparing to known truth: Some known chemical families were detected, e.g. AZTs, pyrimidine nucleosides, ... Smaller, less well-represented families not always detected, e.g. stilbenes, ... Results validation by assessing their quality On average chemists approved only 20-30 of the 70 clusters as chemical families of related compounds The remaining clusters(~2/3) were difficult to interpret Compounds that shouldn’t be in some clusters Compounds that should have been in some clusters (misclassified or not) Clusters that were made of dissimilar/diverse compounds Experts were puzzled by the absence of singletons Bioreason, Inc Method Evaluation - Computational Analyzed 70 groups of compounds: Simple method: average nearest neighbor distance within a set of compounds distance computed using the bit-vectors of the compounds 43/70: pretty low average nearest neighbor distance 22/70: moderate average nearest neighbor distance 5/70: quite high average nearest neighbor distance. Overall most of the groups had a low diversity; expected since the metaclusters were built using bitvectors Bioreason, Inc The problem Confusing? Method functioned just right from a computational perspective But, the results were not as satisfying to the human expert Clustering results often don’t: match expectations make chemical sense Why? Clustering is performed on molecular representations, often based on small keys, not on the molecules themselves No chemical “common sense” influence on the clustering process Bioreason, Inc The road ahead… (1) What is the end goal of screening data analysis? Finding the chemical families of interest, i.e. those that exhibit favorable biological characteristics How are we attempting to do it? Clustering and classification methods using vector encoding representations of molecules But, clustering only gives groups of compounds that have similar vector representations and, a successful classification session requires that one knows the chemical families of interest a priori. Bioreason, Inc The road ahead… (2) So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? Discover what the experts want Adapt our process to match results and expectations Bioreason, Inc Definitions Chemical family: A set of highly similar compounds sharing a common scaffold; else a set of compounds with high homogeneity Homogeneity: High structural similarity Based not only on similarity of molecular vectors but also on the presence of a significant common scaffold Scaffold: A substructure defined as a specific configuration of atom types and bond types Bioreason, Inc Processing traditional method results Processing the results of traditional methods: Easier to do than a complete re-design/reimplementation Will “remove” results not chemically sensible Will make life easier for human analysts by allowing them to focus on easily recognizable and interpretable pieces of knowledge Approach: Compute and use structural homogeneity on results of traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups Bioreason, Inc Identifying Scaffolds Maximum Common Substructure(MCS) extraction: Using extremely fast and efficient own implementations Highlights of analysis: 7 out of 70 compound sets: common scaffold size < 2! 5 MCSs appeared multiple times Range: 2-6, mostly benzene rings A total of 53 different scaffolds MCS size: Ranged from less than 2 atoms to greater than 14 atoms Bioreason, Inc Introducing Homogeneity Clusters Homogeneity: Fingerprint Homogeneity: Overall quite good average nearest neighbor distance Structural Homogeneity: Used: # of atoms in mcs / avg. # of atoms in set molecules Structural Homogeneity Threshold: 1/3 • MCS covering at least a third of the average molecule size Results: • 23/70 clusters below threshold • 47 above threshold Bioreason, Inc Method Assessment (1) Results were used to assign priority to clusters: Low Priority - low likelihood of chemical sense: clusters with small scaffolds, low structural homogeneity clusters with insignificant scaffolds, low-to-moderate structural homogeneity High Priority - high likelihood of chemical sense: well defined clusters, with high structural homogeneity and big, significant scaffolds Approach did make life easier to human analysts Ability to find important information faster Bioreason, Inc Method Assessment (2) Prioritization assessment: the 23 non-structurally homogeneous clusters were uninteresting to chemists. the 47 structurally homogeneous included all those (20-30) approved before by chemists as chemical families However, experts complained about: low information content of the clustering process results Too many clusters, too little knowledge the amount of information never found! High priority clusters contained only 2/3 of compounds analyzed! Clusters approved as chemical families from which knowledge could be derived easily contained only 1/3 of the compounds!!! Known knowledge never found. Bioreason, Inc The road ahead… (3) Do traditionally obtained clusters relate to chemical families? Do we need a different approach? Introduce chemically “aware” methods No simple clustering methods Take into account structural homogeneity Accommodate multi-domain nature of molecules Present results in a format that facilitates interpretation and knowledge discovery by chemists Bioreason, Inc A different approach: Can it work? Have been working on “chemically aware” screening data analysis methods Same dataset results with a typical Bioreason analysis: 102 classes, all with high structural homogeneity • All classes were easy to interpret • Only 10% of classes not interesting to chemists (~50 compounds) 47 singletons (~10% of dataset) Information content much higher than traditional approach • 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method) • 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method) Multi-domain nature is accommodated Bioreason, Inc Conclusions Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity As a consequence, relations -including clustersobtained via traditional methods often don’t make chemical sense Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations Bioreason, Inc Acknowledgements Patricia Bacha Bobi Den Hartog Info: nicolaou@bioreason.com www.bioreason.com Bioreason, Inc