Powerpoint

advertisement
Molecular Similarity and Chemical Families:
The Homogeneity Approach
C.A. Nicolaou, B.P. Kelley, D.W. Miller, T.K. Brunck
11th April, 2000
2nd Sheffield Chemoinformatics Conference,
Sheffield, UK
Bioreason, Inc
Presentation Outline
Introduction
Molecular similarity
Observations on chemical data
Analyzing screening data
Using a traditional approach
The Homogeneity Approach
Definitions
Implementation and experimental results
Conclusions
Bioreason, Inc
Molecular Similarity
Widely used all over drug discovery process
Sample applications:
Assessing diversity of a chemical dataset
Picking representative dataset from compound library
Given a compound and a compound library, identifying
subset of similar compounds
Analyzing screening data
Major step:
• Organizing screening data into chemical families
Bioreason, Inc
Typical Drug Discovery Process
Library *Screening*
Data
*Data Analysis*
Assay
Further
exploration
Start Chemistry
Bioreason, Inc
Drug
Candidates
Technology Employed
Compound representation methods
Fingerprints/bit vectors, graph-based, ...
2D-keys Vs 3D-keys, fragment Vs distance based, ...
Similarity and distance measures
Tanimoto, Euclidean, …, graph-based, ...
Clustering methods
Classification methods
Substructure searching/(sub)graph matching
...
Bioreason, Inc
Analyzing Chemical Compounds (1)
Dictionary of Keys
O
N
H
O
H
N
N-N
Q-QH
Q-C(-N)-C
CH3-A-CH3
Q-N
N-A-A-O
N-C-O
O not % A % A
N-A-O
Q-Q
QH > 1
CH3 > 1
N>1
NH
...
O
10111000001...
Bioreason, Inc
Analyzing Chemical Compounds (2)
Compounds are multi-domain:
multiple occurrences of a key/substructure
members of more than one chemical family
Bioreason, Inc
Analyzing Chemical Compounds (3)
Information loss!
E.g. “How” a key hits?
Bioreason, Inc
Dataset Used
Derived from the NCI anti-HIV program
Latest release, Oct. 99, 43 382 compounds
Cell based, EC50 (effective concentration at which the
test compound protects the cells by 50%)
Pre-processing:
Molecular weight <=500
Multiple EC50 values for compounds; kept highest
concentration
33245 compounds left
Activities: converted from molar concentrations to -log
Activity threshold used: 5.5
Training set size (actives):
503
Bioreason, Inc
Analyzing Screening Data
Typical Approach
Goal: Data Reduction
To manageable size
Organized fashion
With minimal information loss
Represent molecules as vectors, often binary
Similarity/distance measure
Clustering Algorithm
Metacluster selection method (e.g. cluster level
selection methods for hierarchical clustering)
Bioreason, Inc
Hierarchical Agglomerative Clustering
Method
NCI - HIV dataset
503 subset based on activity
Clustered using Wards, Euclidean distance, bitvectors obtained via application of MACCS-like keys
Cluster level selection using the Kelley method
Results:
70 (meta)clusters
Complete coverage of the dataset, no singletons!
Average metacluster size: 7.2 compounds
Bioreason, Inc
Method Evaluation - Chemists
 Results validation by comparing to known truth:
Some known chemical families were detected, e.g. AZTs,
pyrimidine nucleosides, ...
Smaller, less well-represented families not always detected,
e.g. stilbenes, ...
 Results validation by assessing their quality
On average chemists approved only 20-30 of the 70 clusters as
chemical families of related compounds
The remaining clusters(~2/3) were difficult to interpret
Compounds that shouldn’t be in some clusters
Compounds that should have been in some clusters (misclassified or not)
Clusters that were made of dissimilar/diverse compounds
Experts were puzzled by the absence of singletons
Bioreason, Inc
Method Evaluation - Computational
Analyzed 70 groups of compounds:
Simple method:
 average nearest neighbor distance within a set of
compounds
 distance computed using the bit-vectors of the compounds
43/70: pretty low average nearest neighbor distance
22/70: moderate average nearest neighbor distance
5/70: quite high average nearest neighbor distance.
Overall most of the groups had a low diversity;
expected since the metaclusters were built using bitvectors
Bioreason, Inc
The problem
Confusing?
Method functioned just right from a computational perspective
But, the results were not as satisfying to the human expert
Clustering results often don’t:
match expectations
make chemical sense
Why?
Clustering is performed on molecular representations, often
based on small keys, not on the molecules themselves
No chemical “common sense” influence on the clustering
process
Bioreason, Inc
The road ahead… (1)
What is the end goal of screening data analysis?
Finding the chemical families of interest, i.e. those
that exhibit favorable biological characteristics
How are we attempting to do it?
Clustering and classification methods using vector
encoding representations of molecules
But,
clustering only gives groups of compounds that have similar
vector representations and,
a successful classification session requires that one knows
the chemical families of interest a priori.
Bioreason, Inc
The road ahead… (2)
So, what do we do now that we are aware of the
loose coupling between clusters obtained
traditionally and human experts’ expectations?
Discover what the experts want
Adapt our process to match results and expectations
Bioreason, Inc
Definitions
Chemical family:
A set of highly similar compounds sharing a common
scaffold; else a set of compounds with high homogeneity
Homogeneity:
High structural similarity
Based not only on similarity of molecular vectors but also
on the presence of a significant common scaffold
Scaffold:
A substructure defined as a specific configuration of
atom types and bond types
Bioreason, Inc
Processing traditional method results
Processing the results of traditional methods:
Easier to do than a complete re-design/reimplementation
Will “remove” results not chemically sensible
Will make life easier for human analysts by allowing
them to focus on easily recognizable and interpretable
pieces of knowledge
Approach:
Compute and use structural homogeneity on results of
traditional methods. Basically construct “chemically sensible”
methods for selecting the important compound groups
Bioreason, Inc
Identifying Scaffolds
Maximum Common Substructure(MCS) extraction:
Using extremely fast and efficient own implementations
Highlights of analysis:
7 out of 70 compound sets: common scaffold size < 2!
5 MCSs appeared multiple times
Range: 2-6, mostly benzene rings
A total of 53 different scaffolds
MCS size:
Ranged from less than 2 atoms to greater than 14 atoms
Bioreason, Inc
Introducing Homogeneity
Clusters Homogeneity:
Fingerprint Homogeneity:
Overall quite good average nearest neighbor distance
Structural Homogeneity:
Used: # of atoms in mcs / avg. # of atoms in set molecules
Structural Homogeneity Threshold: 1/3
• MCS covering at least a third of the average molecule size
Results:
• 23/70 clusters below threshold
• 47 above threshold
Bioreason, Inc
Method Assessment (1)
Results were used to assign priority to clusters:
Low Priority - low likelihood of chemical sense:
clusters with small scaffolds, low structural homogeneity
clusters with insignificant scaffolds, low-to-moderate
structural homogeneity
High Priority - high likelihood of chemical sense:
well defined clusters, with high structural homogeneity and
big, significant scaffolds
Approach did make life easier to human analysts
Ability to find important information faster
Bioreason, Inc
Method Assessment (2)
 Prioritization assessment:
the 23 non-structurally homogeneous clusters were
uninteresting to chemists.
the 47 structurally homogeneous included all those (20-30)
approved before by chemists as chemical families
 However, experts complained about:
low information content of the clustering process results
Too many clusters, too little knowledge
the amount of information never found!
High priority clusters contained only 2/3 of compounds analyzed!
Clusters approved as chemical families from which knowledge could
be derived easily contained only 1/3 of the compounds!!!
Known knowledge never found.
Bioreason, Inc
The road ahead… (3)
Do traditionally obtained clusters relate to
chemical families?
Do we need a different approach?
Introduce chemically “aware” methods
No simple clustering methods
Take into account structural homogeneity
Accommodate multi-domain nature of molecules
Present results in a format that facilitates interpretation
and knowledge discovery by chemists
Bioreason, Inc
A different approach: Can it work?
Have been working on “chemically aware”
screening data analysis methods
Same dataset results with a typical Bioreason analysis:
102 classes, all with high structural homogeneity
• All classes were easy to interpret
• Only 10% of classes not interesting to chemists (~50 compounds)
47 singletons (~10% of dataset)
Information content much higher than traditional approach
• 90% of compounds placed in homogeneous clusters (Vs 66% in
traditional method)
• 80% of compounds placed in clusters approved as structural
families (Vs 34% in traditional method)
Multi-domain nature is accommodated
Bioreason, Inc
Conclusions
 Molecular fingerprint similarity does not supply a
certain indication of high structural molecular similarity
 Most traditional chemical data analysis methods make
heavy use of molecular fingerprint similarity
 As a consequence, relations -including clustersobtained via traditional methods often don’t make
chemical sense
 Structural Homogeneity may be employed to enable
formation of clusters and identification of chemical
relations closer to chemists’ expectations
Bioreason, Inc
Acknowledgements
Patricia Bacha
Bobi Den Hartog
Info:
nicolaou@bioreason.com
www.bioreason.com
Bioreason, Inc
Download