Powerpoint

Molecular Similarity and Chemical Families: The Homogeneity Approach C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck 11th April, 2000 2nd Sheffield Chemoinformatics Conference, Sheffield, UK Bioreason, Inc Presentation Outline Introduction Molecular similarity Observations on chemical data Analyzing screening data Using a traditional approach The Homogeneity Approach Definitions Implementation and experimental results Conclusions Bioreason, Inc Molecular Similarity Widely used all over drug discovery process Sample applications: Assessing diversity of a chemical dataset Picking representative dataset from compound library Given a compound and a compound library, identifying subset of similar compounds Analyzing screening data Major step: • Organizing screening data into chemical families Bioreason, Inc Typical Drug Discovery Process Library *Screening* Data *Data Analysis* Assay Further exploration Start Chemistry Bioreason, Inc Drug Candidates Technology Employed Compound representation methods Fingerprints/bit vectors, graph-based, ... 2D-keys Vs 3D-keys, fragment Vs distance based, ... Similarity and distance measures Tanimoto, Euclidean, …, graph-based, ... Clustering methods Classification methods Substructure searching/(sub)graph matching ... Bioreason, Inc Analyzing Chemical Compounds (1) Dictionary of Keys O N H O H N N-N Q-QH Q-C(-N)-C CH3-A-CH3 Q-N N-A-A-O N-C-O O not % A % A N-A-O Q-Q QH > 1 CH3 > 1 N>1 NH ... O 10111000001... Bioreason, Inc Analyzing Chemical Compounds (2) Compounds are multi-domain: multiple occurrences of a key/substructure members of more than one chemical family Bioreason, Inc Analyzing Chemical Compounds (3) Information loss! E.g. “How” a key hits? Bioreason, Inc Dataset Used Derived from the NCI anti-HIV program Latest release, Oct. 99, 43 382 compounds Cell based, EC50 (effective concentration at which the test compound protects the cells by 50%) Pre-processing: Molecular weight <=500 Multiple EC50 values for compounds; kept highest concentration 33245 compounds left Activities: converted from molar concentrations to -log Activity threshold used: 5.5 Training set size (actives): 503 Bioreason, Inc Analyzing Screening Data Typical Approach Goal: Data Reduction To manageable size Organized fashion With minimal information loss Represent molecules as vectors, often binary Similarity/distance measure Clustering Algorithm Metacluster selection method (e.g. cluster level selection methods for hierarchical clustering) Bioreason, Inc Hierarchical Agglomerative Clustering Method NCI - HIV dataset 503 subset based on activity Clustered using Wards, Euclidean distance, bitvectors obtained via application of MACCS-like keys Cluster level selection using the Kelley method Results: 70 (meta)clusters Complete coverage of the dataset, no singletons! Average metacluster size: 7.2 compounds Bioreason, Inc Method Evaluation - Chemists  Results validation by comparing to known truth: Some known chemical families were detected, e.g. AZTs, pyrimidine nucleosides, ... Smaller, less well-represented families not always detected, e.g. stilbenes, ...  Results validation by assessing their quality On average chemists approved only 20-30 of the 70 clusters as chemical families of related compounds The remaining clusters(~2/3) were difficult to interpret Compounds that shouldn’t be in some clusters Compounds that should have been in some clusters (misclassified or not) Clusters that were made of dissimilar/diverse compounds Experts were puzzled by the absence of singletons Bioreason, Inc Method Evaluation - Computational Analyzed 70 groups of compounds: Simple method:  average nearest neighbor distance within a set of compounds  distance computed using the bit-vectors of the compounds 43/70: pretty low average nearest neighbor distance 22/70: moderate average nearest neighbor distance 5/70: quite high average nearest neighbor distance. Overall most of the groups had a low diversity; expected since the metaclusters were built using bitvectors Bioreason, Inc The problem Confusing? Method functioned just right from a computational perspective But, the results were not as satisfying to the human expert Clustering results often don’t: match expectations make chemical sense Why? Clustering is performed on molecular representations, often based on small keys, not on the molecules themselves No chemical “common sense” influence on the clustering process Bioreason, Inc The road ahead… (1) What is the end goal of screening data analysis? Finding the chemical families of interest, i.e. those that exhibit favorable biological characteristics How are we attempting to do it? Clustering and classification methods using vector encoding representations of molecules But, clustering only gives groups of compounds that have similar vector representations and, a successful classification session requires that one knows the chemical families of interest a priori. Bioreason, Inc The road ahead… (2) So, what do we do now that we are aware of the loose coupling between clusters obtained traditionally and human experts’ expectations? Discover what the experts want Adapt our process to match results and expectations Bioreason, Inc Definitions Chemical family: A set of highly similar compounds sharing a common scaffold; else a set of compounds with high homogeneity Homogeneity: High structural similarity Based not only on similarity of molecular vectors but also on the presence of a significant common scaffold Scaffold: A substructure defined as a specific configuration of atom types and bond types Bioreason, Inc Processing traditional method results Processing the results of traditional methods: Easier to do than a complete re-design/reimplementation Will “remove” results not chemically sensible Will make life easier for human analysts by allowing them to focus on easily recognizable and interpretable pieces of knowledge Approach: Compute and use structural homogeneity on results of traditional methods. Basically construct “chemically sensible” methods for selecting the important compound groups Bioreason, Inc Identifying Scaffolds Maximum Common Substructure(MCS) extraction: Using extremely fast and efficient own implementations Highlights of analysis: 7 out of 70 compound sets: common scaffold size < 2! 5 MCSs appeared multiple times Range: 2-6, mostly benzene rings A total of 53 different scaffolds MCS size: Ranged from less than 2 atoms to greater than 14 atoms Bioreason, Inc Introducing Homogeneity Clusters Homogeneity: Fingerprint Homogeneity: Overall quite good average nearest neighbor distance Structural Homogeneity: Used: # of atoms in mcs / avg. # of atoms in set molecules Structural Homogeneity Threshold: 1/3 • MCS covering at least a third of the average molecule size Results: • 23/70 clusters below threshold • 47 above threshold Bioreason, Inc Method Assessment (1) Results were used to assign priority to clusters: Low Priority - low likelihood of chemical sense: clusters with small scaffolds, low structural homogeneity clusters with insignificant scaffolds, low-to-moderate structural homogeneity High Priority - high likelihood of chemical sense: well defined clusters, with high structural homogeneity and big, significant scaffolds Approach did make life easier to human analysts Ability to find important information faster Bioreason, Inc Method Assessment (2)  Prioritization assessment: the 23 non-structurally homogeneous clusters were uninteresting to chemists. the 47 structurally homogeneous included all those (20-30) approved before by chemists as chemical families  However, experts complained about: low information content of the clustering process results Too many clusters, too little knowledge the amount of information never found! High priority clusters contained only 2/3 of compounds analyzed! Clusters approved as chemical families from which knowledge could be derived easily contained only 1/3 of the compounds!!! Known knowledge never found. Bioreason, Inc The road ahead… (3) Do traditionally obtained clusters relate to chemical families? Do we need a different approach? Introduce chemically “aware” methods No simple clustering methods Take into account structural homogeneity Accommodate multi-domain nature of molecules Present results in a format that facilitates interpretation and knowledge discovery by chemists Bioreason, Inc A different approach: Can it work? Have been working on “chemically aware” screening data analysis methods Same dataset results with a typical Bioreason analysis: 102 classes, all with high structural homogeneity • All classes were easy to interpret • Only 10% of classes not interesting to chemists (~50 compounds) 47 singletons (~10% of dataset) Information content much higher than traditional approach • 90% of compounds placed in homogeneous clusters (Vs 66% in traditional method) • 80% of compounds placed in clusters approved as structural families (Vs 34% in traditional method) Multi-domain nature is accommodated Bioreason, Inc Conclusions  Molecular fingerprint similarity does not supply a certain indication of high structural molecular similarity  Most traditional chemical data analysis methods make heavy use of molecular fingerprint similarity  As a consequence, relations -including clustersobtained via traditional methods often don’t make chemical sense  Structural Homogeneity may be employed to enable formation of clusters and identification of chemical relations closer to chemists’ expectations Bioreason, Inc Acknowledgements Patricia Bacha Bobi Den Hartog Info: nicolaou@bioreason.com www.bioreason.com Bioreason, Inc

Powerpoint

Related documents

Products

Support

Powerpoint

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib