basic algorithm for DBCS

advertisement
Similarity and Diversity
Alexandre Varnek,
University of Strasbourg, France
What is similar?
Different „spaces“, classified by:
Shape
Size
Colour
Pattern
16 diverse aldehydes...
O
OH
H
O
O
H
O
H
C O OH
O
O
N
OH
H
H
OH
NH2
O
N
O
O
O
H
C O OH
O
H
H
C O OH
C O OH
H
Cl
O
Cl
O
H
H
O
Cl
O
H
OH
H
NH2
O
N
H
O
N
NH2
O
H
NH2
O
H
Cl
O
...sorted by common scaffold
O
O
O
O
OH
H
OH
H
C O OH
O
O
C O OH
H
H
O
O
NH2
Cl
H
H
Cl
NH2
O
O
O
H
H
H
H
H
C O OH
OH
O
N
N
H
O
O
C O OH
OH
O
O
O
H
Cl
O
N
H
NH2
H
N
H
O
Cl
O
NH2
...sorted by functional groups
O
O
O
O
H
H
OH
H
H
OH
Cl
Cl
O
O
O
O
N
Cl
N
OH
H
H
H
O
H
O
Cl
OH
O
O
H
C O OH
H
C O OH
O
O
H
H
O
O
H
NH2
O
C O OH
NH2
NH2
N
C O OH
H
O
O
H
N
H
O
NH2
The „Similarity Principle“ :
Structurally similar molecules are assumed to have
similar biological properties
Compounds active as opioid receptors
Structural Spectrum of Thrombin Inhibitors
structural similarity “fading away”
…
reference
compounds
0.56
0.72
0.53
0.84
0.67
0.52
0.82
0.64
0.39
Key features in similarity/diversity
calculations:
• Properties to describe elements
(descriptors, fingerprints)
• Distance measure („metrics“)
N-Dimensional Descriptor Space
descriptorn
• Each chosen descriptor adds a
dimension to the reference space
molecule Mi
= (descriptor1(i), descriptor2(i), …,
descriptorn(i))
• Calculation of n descriptor values
produces an n-dimensional coordinate
vector in descriptor space that
determines the position of a molecule
descriptor2
descriptor1
descriptor3
Chemical Reference Space
• Distance in chemical space is used
as a measure of molecular
“similarity“ and “dissimilarity“
descriptorn
DAB
B
• “Molecular similarity“ covers only
chemical similarity but also property
similarity including biological
activity
descriptor2
A
descriptor3
descriptor1
Distance Metrics in n-D Space
• If two molecules have comparable values in all
the n descriptors in the space, they are
located close to each other in the n-D space.
– how to define “closeness“ in space as a measure
of molecular similarity?
– distance metrics
Descriptor-based Similarity
• When two molecules A and B are projected into an n-D
space, two vectors, A and B, represent their descriptor
values, respectively.
– A = (a1,a2,...an)
– B = (b1,b2,...bn)
• The similarity between A and B, SAB,
is negatively correlated with the
distance DAB
descriptorn
– shorter distance ~ more similar molecules
– in the case of normalized distance
descriptor3
(within value range [0,1]),
similarity = 1 – distance
DAB>DBC  SAB<SBC
B
DAB
DBC
C
A
descriptor2
descriptor1
Metrics Properties
(1) The distance values dAB 0; dAA= dBB= 0
(2) Symmetry properties: dAB= dBA
(3) Triangle inequality: dAB dAC+ dBC
descriptorn
B
DAB
DBC
C
A
descriptor2
descriptor1
descriptor3
Euclidean Distance in n-D Space
• Given two n-dimensional vectors, A and B
– A = (a1,a2,...an)
– B = (b1,b2,...bn)
• Euclidean distance DAB is defined as:
descriptorn
DA
B
n
D AB 

(ai  bi )
2
i 1
A
• Example:
descriptor3
( 3  5 )  ( 0  2 )  (1  0 )
2
descriptor2
descriptor1
– A = (3,0,1); B = (5,2,0)
– DAB =
B
2
2
=3
Manhattan Distance in n-D Space
• Given two n-dimensional vectors, A and B
– A = (a1,a2,...an)
– B = (b1,b2,...bn)
• Manhattan distance DAB is defined as:
descriptorn
DA
n
D AB 
 | a i  bi |
B
i 1
A
• Example:
– A = (3,0,1); B = (5,2,0)
– DAB =
| 3  5 |  | 0  2 |  | 1= 50 |
B
descriptor2
descriptor1
descriptor3
Distance Measures („Metrics“):
Euclidian distance:
[(x11 - x21) 2 + (x12 - x22)2] 1/2 =
= (42 + 22)1/2 = 4.472
Manhattan (Hamming) distance:
|x11 - x21| + |x12 - x22| = 4 + 2 = 6
Sup distance:
Max (|x11 - x21|, |x12 - x22|) =
= Max (4, 2) = 4
Binary Fingerprint
Popular Similarity/Distance Coefficients
• Similarity metrics:
– Tanimoto coefficient
– Dice coefficient
– Cosine coefficient
• Distance metrics:
– Euclidean distance
– Hamming distance
– Soergel distance
Tanimoto Coefficient (Tc)
• Definition:
s ( A , B )  Tc ( A , B ) 
c
abc
– value range: [0,1]
– Tc is also known as Jaccard coefficient
– Tc is the most popular similarity coefficient
A
B
C
Example Tc Calculation
binary
A
B
a = 4, b = 4, c = 2
Tc ( A , B ) 
2
442

2
6

1
3
Dice Coefficient
• Definition:
s ( A ,B ) 
2c
ab
– value range: [0,1]
– monotonic with the Tanimoto coefficient
Cosine Coefficient
• Definition:
s ( A,B ) 
c
ab
• Properties:
– value range: [0,1]
– correlated with the Tanimoto coefficient but not
strictly monotonic with it
Hamming Distance
• Definition:
d ( A,B )  a  b  2c
– value range: [0,N] (N, length of the fingerprint)
– also called Manhattan/City Block distance
Soergel Distance
• Definition:
d ( A,B ) 
a  b  2c
abc
• Properties:
– value range: [0,1]
– equivalent to (1 – Tc) for binary fingerprints
Similarity coefficients
Metric Properties
(1) The distance values dAB 0; dAA= dBB= 0
(2) Symmetry properties: dAB= dBA
(3) Triangle inequality: dAB dAC+ dBC
Properties of Similarlity and Distance Coefficients
The Euclidean and Hamming distances and the Tanimoto
coefficients (dichotomous variables) obey all properties.
The Tanimoto, Dice and Cosine coefficients do not obey
inequality (3).
Coefficients are monotonic if they produce the same
similarlity ranking
Similarity search
Using bit strings to encode molecular size. A biphenyl query is compared to a series of
analogues of increasing size. The Tanimoto coefficient, which is shown next to the
corresponding structure, decreases with increasing size, until a limiting value is
reached.
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
Similarity search
Molecular similarity at a range of Tanimoto coefficient values
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
Similarity search
The distribution of Tanimoto coefficient values found in database searches with a range of
query molecules of increasing size and complexity
D.R. Flower, J. Chem. Inf. Comput. Sci., Vol. 38, No. 3, 1998, pp. 379-386
Molecular Similarity
A comparison of the Soergel and Hamming distance values for two pairs
of structures to illustrate the effect of molecular size
A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003
Molecular Similarity
The maximum common subgraph (MCS) between the two molecules is in bold
Similarity = Nbonds(MCS) / Nbonds(query)
A R. Leach and V. J. Gillet "An Introduction to Chemoinformatics" , Kluwer Academic Publisher, 2003
Activity landscape
How important is a choice of descriptors ?
Inhibitors of acyl-CoA:cholesterol acyltransferase represented with
MACCS (a), TGT (b), and Molprint2D (c) fingerprints.
continuous SARs
gradual changes in structure result in moderate
changes in activity
“rolling hills” (G. Maggiora)
Structure-Activity Landscape Index:
discontinuous SARs
small changes in structure have
dramatic effects on activity
“cliffs” in activity landscapes
SALIij = DAij / DSij
DAij (DSij ) is the difference between activities (similarities) of molecules i and j
R. Guha et al. J.Chem.Inf.Mod., 2008, 48, 646
VEGFR-2 tyrosine kinase inhibitors
discontinuous SARs
6 nM
MACC
STc:
1.00
Analog
2390 nM
bad news for molecular similarity analysis...
small changes in structure have
dramatic effects on activity
“cliffs” in activity landscapes
lead optimization, QSAR
Example of a “Classical” Discontinuous SAR
Any similarity method
must recognize these
compounds as being
“similar“ ...
(MACCS Tanimoto similarity)
Adenosine deaminase inhibitors
Libraries design
Goal: to select a representative subset from a large database
Chemical Space
Overlapping similarity radii  Redundancy
„Void“ regions  Lack of information
Chemical Space
„Void“ regions  Lack of information
Chemical Space
No redundancy, no „voids“
 Optimally diverse compound library
Subset selection from the libraries
• Clustering
• Dissimilarity-based methods
• Cell-based methods
• Optimisation techniques
Clustering in chemistry
What is clustering?

Clustering is the separation of a set of objects
into groups such that items in one group are
more like each other than items in a different
group

A technique to understand, simplify and interpret
large amounts of multidimensional data

Classification without labels (“unsupervised
learning”)
Where clustering is used ?
General:
data mining, statistical data analysis, data
compression, image segmentation, document
classification (information retrieval)
Chemical:



representative sample,
subsets selection,
classification of new compounds
Overall strategy
Select descriptors
 Generate descriptors for all items
 Scale descriptors
 Define similarity measure (« metrics »)
 Apply appropriate clustering method to group
the items on basis of chosen descriptors and
similarity measure
 Analyse results

Data Presentation
molecules
molecules
molecules
descriptors
Pattern matrix
Library contains n molecules,
each molecule is described
by p descriptors
Proximity matrix
dii = 0; dij = dji
Clustering methods
Single Link
Complete Link
Agglomerative
Hierarchical
Monothetic
Divisive
Polythetic
Group Average
Weighted Gr Av
Centroid
Median
Single Pass
Jarvis-Patrick
Nearest Neighbour
Non-hierarchical
Mixture Model
Relocation
Topographic
Others
Ward
Hierarchical Clustering
A dendrogram representing an hierarchical clustering of 7 compounds
Sequential Agglomerative Hierarchical
Non-overlapping (SAHN) methods
Simple link
Complete link
Group average
In the Single Link method, the intercluster distance is equal to the minimum
distance between any two compounds, one from each cluster.
In the Complete Link method, the intercluster distance is equal to the furthest
distance between any two compounds, one from each cluster.
The Group Average method measures intercluster distance as the average of the
distances between all compounds in the two clusters.
Hierarchical Clustering:
Johnson’s method
The algorithm is an agglomerative scheme that erases rows and columns
in the proximity matrix as old clusters are merged into new ones.
Step 1. Group the pair of objects into a cluster
d [( r ), ( s )]  m in { d [( i ), ( j )]}
Step 2. Update the proximity matrix
Single-link
d [( k ), ( r , s )]  m in { d [( k ), ( r )], d [( k ), ( s )]}
d [( k ), ( r , s )]  m ax { d [( k ), ( r )], d [( k ), ( s )]}
Complete-link
Hierarchical Clustering:
single link
Hierarchical Clustering:
complete link
Hierarchical Clustering:
single vs complete link
Non-Hierarchical Clustering:
the Jarvis-Patrick method
At the first step, all nearest neighbours of each compound
are found by calculating of all paiwise similarities and
sorting according to this similarlity.
Then, two compounds are placed into the same cluster if:
1.They are in each other’s list of m nearest neighbours.
2.They have p (where p< m) nearest neighbours in
common. Typical values: m = 14 ; p = 8.
Pb: too many singletons.
Non-Hierarchical Clustering:
the relocation methods
Relocation algorithms involve an initial assignment of compounds
to clusters, which is then iteratively refined by moving (relocating)
certain compounds from one cluster to another.

Example: the K-means method
1. Random choise of c « seed » compounds. Other compounds
are assigned to the nearest seed resulting in an initial set of c
clusters.
2.The centroides of cluster are calculated. The objects are reassigned to the nearest cluster centroid.
Pb: the method is dependent upon the initial set of cluster
centroids.
Efficiency of Clustering Methods
Method
Storage
space
Time
Hierarchical
(general)
O (N2)
O (N3)
Hierarchical
(Ward’s method + RNN)
O (N)
O (N2)
Non-Hierarchical
(general)
O (N)
O (MN)
Non-Hierarchical
(Jarvis-Patric method)
O (N2)
O (MN)
N is the number of compounds and M is the number of clusters
Validity of clustering




How many clusters are in the data ?
Does partitioning match the categories ?
Where should be the dendrogram be cut ?
Which of two partitions fit the data better ?
Dissimilarity-Based Compound
Selection (DBCS)
4 steps basic algorithm for DBCS:
1.
2.
3.
4.
Select a compound and place it in the subset.
Calculate the dissimilarity between each compound
remaining in the data set and the compounds in
the subset.
Choose the next compound as that which is most
dissimilar to the compounds in subset.
If n < n0 (n0 being the desired size number of
compounds in the final subset), return to step 2.
Dissimilarity-Based Compound
Selection (DBCS)
Basic algorithm for DBCS:
1st step – selection of the initial compound
1.
2.
3.
Select it at random;
Choose the molecules which is « most representative » (e.g.,
has the largest sum of similarlities to other molecules);
Choose the molecules which is « most dissimilar » (e.g., has
the smallest sum of similarlities to other molecules).
Dissimilarity-Based Compound
Selection (DBCS)
Basic algorithm for DBCS:
2nd step – calculation of dissimilarity
• Dissimilarity is the opposite of similarity
(Dissimilarity)i,j = 1 – (Similarity )i,j
(where « Similarity » is Tanimoto, or Dice, or Cosine, … coefficients)
Diversity

Diversity characterises a set of
molecules
 
Diversity
 I J I
Dissimilar ity ( I , J )
N ( N  1)
Dissimilarity-Based Compound
Selection (DBCS)
Basic algorithm for DBCS:
3nd step – selection the most dissimilar compound
There are several methods to select a diversed subset containing m compounds
1). MaxSum method selects the compound i that has the maximum sum
of distances to all molecules in the subset
m
score i 
D
i, j
j 1
2). MaxMin method selects the compound i with the maximum
distance to its closest neighbour in the subset
score i  m in( D i , j ; j 1, m )
Basic algorithm for DBCS:
3nd step – selection the most dissimilar compound
3). The Sphere Exclusion Algorithm
1. Define a threshold dissimilarity, t
2. Select a compound and place it in the subset.
3. Remove all molecules from the data set that have a dissimilarity to
the selected molecule of less than t
4. Return to step 2 if there are molecules remaining in the data set.
The next compound can be selected
• randomly;
• using MinMax-like method
DBCS : Subset selection from the libraries
Cell-based methods

Cell-based or Partitioning methods operated within a predefined
low-dimentional chemical space.

If there are K axes (properties) and each is devided into bi bins,
then the number of cells Ncells in the multidimentianal space is
K
N cells   bi
i 1
Cell-based methods
The construction of 2-dimentional chemical
space.
LogP bins: <0, 0-0.3, 3-7 and >7
MW bins: 0-250, 250-500, 500-750, > 750.
Cell-based methods

A key feature of Cell-based methods is that they do not
requere the calculation of paiwise distances Di,j between
compounds; the chemical space is defined independently of
the molecules that are positioned within it.
•Advantages of Cell-based methods
1.Empty cells (voids) or cells with low ocupancy can be easily identified.
2.The diversity of different subsets can be easily compared by examining
the overlap in the cells occupied by each subset.
Main pb: Cell-based methods are restricted to relatively low-dimentional space
Optimisation techniques


DBCS methods prepare a diverse subset selecting
interatively ONE molecule a time.
Optimisation techniques provide an efficient ways
of sampling large search spaces and selection of
diversed subsets
Optimisation techniques
•
Example: Monte-Carlo search
1.
Random selection of an initial subset and calculation of its diversity D.
2.
A new subset is generated from the first by replacing some of its
compounds with other randomly selected.
3.
The diversity of the new subset Di+1 is compared with Di
if DD = Di+1 - Di > 0, the new set is accepted
if DD < 0, the probability of acceptence depends on the Metropolis
condition, exp(- DD / kT).
Scaffolds and Frameworks
Frameworks
Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893
Frameworks
Dissection of a molecule according to Bemis and Murcko. Diazepam contains three
sidechains and one framework with two ring systems and a zero-atom linker.
G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171
Graph Frameworks for Compounds in the CMC Database
(Numbers Indicate Frequency of Occurrence)
Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893
Scaffolds et Frameworks
L’algorithme de Bemis et Murcko de génération de framework :
1) les hydrogènes sont supprimés,
2) les atomes avec une seule liaison sont supprimés successivement,
3) le scaffold est obtenu,
4) tous les types d’atomes sont définis en tant que C et tous les types de liaisons
sont définis en tant que simples liaisons, ce qui permet d’obtenir le framework.
Bemis, G.W.; Murcko, M.A. J.Med.Chem 1996, 39, 2887-2893
Contrairement à la méthode de Bemis et Murcko, A. Monge a proposé de distinguer
les liaisons aromatiques et non aromatiques (thèse de doctorat, Univ. Orléans, 2007)
Scaffold-Hopping: How Far Can You Jump?
G. Schneider, P. Schneider, S. Renner, QSAR Comb.Sci. 25, 2006, No.12, 1162 – 1171
The Scaffold Tree − Visualization of the Scaffold Universe by
Hierarchical Scaffold Classification
A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.Waldmann
J. Chem. Inf. Model., 2007, 47 (1), 47-58
Scaffold tree for the results of pyruvate kinase assay. Color intensity represents
the ratio of active and inactive molecules with these scaffolds.
A. Schuffenhauer, P. Ertl, S. Roggo, S. Wetzel, M. A. Koch, and H.Waldmann J. Chem. Inf. Model., 2007, 47 (1), 47-58
Download