Clustering Ambiguity: An Overview

advertisement
Clustering Ambiguity:
An Overview
John D. MacCuish
Norah E. MacCuish
3rd Joint Sheffield Conference on
Chemoinformatics
April 23, 2004
Outline
• The Problem: Clustering Ambiguity and Chemoinformatics
• Preliminaries:
–
–
–
–
bit strings,
measures,
similarity distributions
Ties in Proximities and, more generally, Decision Ties.
• Clustering Algorithms and Decision Ties
• Examples:
–
–
–
–
Taylor-Butina, (leader algorithm)
K-means and K-modes
Jarvis Patrick,
RNN Hierarchical (Wards, Complete Link, Group Average)
• Remarks
Clustering Ambiguity Problem
• Where: clustering algorithms that find distinct
groups in data.
• However, a quantitative decision process (“Idiot
Proof”) may lead to ambiguous results.
• Symptom: permute input data Æ different results.
– Namely, not “stable” with respect to input order.
• Ambiguity Æ it is not clear what belongs to what
group
• Distinct from:
– fingerprint collisions (different compounds Æ same
fingerprints)
– Precision
Clustering Applications and Binary
Fingerprints
•
•
•
•
•
Lead selection in HTS data
Diversity analysis
Lead hopping
Compound acquisition decisions
Etc.
– Downs, G. M.; Barnard, J. M. Clustering Methods and Their Uses in
Computational Chemistry, Reviews in Computational Chemistry; Vol. 18,
Lipkowitz, K. B. and Boyd, D. B., Eds; Wiley-VCH: New York, 2002, 1-40.
Binary Fingerprints Descriptor
CH3
Cl
NH2
Encode
1 0 0 0 1 0 0 0 1 ...1
NH2
Fixed length bit strings such as
Daylight
MDL
BCI
etc.
Common (Dis)Similarity
Coefficients
• Tanimoto
• Euclidean
• Cosine
• Hamman
• Tversky
Simple Bit String Similarity
Measure Properties
• Symmetric (e.g.,Tanimoto) Similarity from A to B is the same as the
similarity from B to A.
• Asymmetric (e.g., Tversky)
same as the similarity from B to A.
Similarity from A to B is not necessarily the
Clustering Compound Data: Asymmetric Clustering of
Compound Data, MacCuish and MacCuish, Chemometrics and Chemoinformatics, ACS Symposium Series, in press
• Metric (e.g., Euclidean) Satisfies the triangle inequality
• Non-metric (e.g., Soergel) Does not satisfy the triangle inequality
– Note, the square root of the Soergel does satisfy the triangle inequality for binary bit
strings. Gower and Legendre, Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 1986, 3, 548.
Tie in Proximity
S
H
N
N
H
H2C
N
N
H
O
Euclidean Dist = .16
Euclidean Dist = .16
S
S
H
N
H
N
N
H
H2C
N
N
H
O
One structure (or Cluster!) equidistant
from two others.
N
H
H2C
N
O
N
H
Are Proximity Ties Common? Example:
Binary Fingerprints with the Tanimoto
Here are all bit
strings of length 5:
00000
00001
00010
00100
01000
.
.
.
11111
32 strings
Here are all
possible
Tanimoto
similarities
for distinct bit strings
of length 5:
0, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5
ƒAll reduced fractions, denominators of 5 or less
ƒThis is the Farey Sequence N, where N is 5
ƒThere are just 10 such distinct similarities
ƒAnd 496 all pairs similarities between these
strings, given 32 distinct strings.
ƒAnd the distribution is……
All possible Tanimoto Similarities for Bit Strings of Length 5
1/2
80
1/4
60
1/3
2/5
40
1/5
Average frequency 49.6
20
3/5
2/3
3/4
4/5
0
Frequency of Similarities
100
120
0
0.0
0.2
0.4
0.6
Tanimoto Similarity
0.8
1.0
Finite Number of Proximities
• How many possible Tanimoto similarities
are there given N bits in a fixed length
fingerprint?
3 2
≅
π
2
N + O ( N log N )
– Namely, the sum of the number of reduced fractions with
denominators up to N. (Proof of above expected bound, 1883)
• How many possible Euclidean similarities?
= N +1
• How many possible Cosine similarities?
No known closed form in terms of N
Any Number Theorists in the house?
…For Fingerprints of Size 1024
• How many possible Tanimoto
similarities
~329,000
• How many Euclidean similarities?
1,025
• How many Cosine similarities?
In the low millions (empirical estimate)
Exact Discrete Distributions vs
Probabistic Discrete Distributions
All possible Tanim oto Sim ilarities for Bit Strings of Length 5
300
100
200
Frequency of Similarities
60
40
0
0
20
Frequency of Similarities
80
400
100
380 NCI actives: Daylight Fingerprints w/ 512 bits
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
Tanim oto Sim ilarity
0.4
0.6
0.8
1.0
Tanim oto Sim ilarity
380 NC I actives: Daylight Fingerprints w/ 512 bits
5
10
Frequency of Similarities
40
30
20
10
Frequency of Similarities
15
50
60
20
All possible O chiai Sim ilarities for Bit Strings of Length 5
0.0
0.2
0.4
0.6
O chiai (1-C osine) Sim ilarity
0.8
1.0
0.0
0.2
0.4
0.6
O chiai Sim ilarity
0.8
1.0
Clustering and Ties in Proximity
• Measures with small numbers of possible
similarities (e.g., Euclidean), or distributions that lead
to this same effect (e.g., Tanimoto, Ochiai), are prone to
the problem of ties in proximity in clustering.
This can effect derived measures as well, such as
the square error of Wards merging criterion. Algorithms
for Clustering Data, Jain and Dubes. Godden, et al, JCICS, 2001, 40, 163-166, and MacCuish , et al, JCICS, 2001,
41 (1), 134-146.
• …Namely, we are clustering in a space that is a
rigid lattice of proximities and/or derived
measures rather than a continuum. (Note: typically for the lengths of
the binary descriptors of the vendors mentioned, this lattice is far more course than the lattice that would be created
by the typical floating point machine representation of real numbers.)
In the literature beware of:
“We resolve ties arbitrarily…”
Decision Ties in Clustering
Algorithms
• A simple decision tie Æ tie in proximity
• Other decision ties may be algorithm dependent
(can occur even with continuous data).
• In practice most decision ties lead to cluster
ambiguity – an inability to discriminate nondisjoint (overlapping) clusters.
• Namely, disjoint clusters don’t reflect the amount
of ambiguity identified by decision ties as the
resulting non-disjoint clustering suggests.
Algorithms
Taylor-Butina (TB) Leader or Exclusion
Cluster Sampling Algorithm
1.
2.
3.
4.
Create thresholded nearest neighbor table
Find true singletons: all those compounds with an empty nearest
neighbor list.
Find the compound with the largest nearest neighbor list
(representative compound or centrotype). This becomes a group and is
excluded from consideration – these compounds are removed from all
nearest neighbor lists.
Repeat 3 until no compounds exist with a non-empty nearest neighbor
list.
–
–
5.
Taylor, JCICS, 1995, 35, 59-67
Butina, JCICS 1999, 39, 747-750.
Optional:
1.
2.
3.
4.
Assign remaining compounds, false singletons, to the group that contains their
nearest neighbor;
Use other criterion to break exclusion region ties;
Use asymmetric measures;
Can be made to return overlapping clusters.
Representative Compound
Tie Cases in TB Algorithm
Exclusion Region Tie
False Singleton Tie
False Singleton,
Which Region?
Exclusion Regions
Diameter Set by
Threshold value
True Singleton
May form ambiguous clustering if sum
of minimum distances is also tied
False singleton tie, but regions not
ambiguous, no need to sum minimum
distances
K-Means and K-Modes
and overlapping versions
•
Continuous K-means with fingerprints
(convert binary to real Æ 0.0s, 1.0s)
1.
2.
3.
4.
•
Choose k seed centroids from data set (e.g., quasi-randomly via 1D Halton
sequence)
Find nearest neighbors to the centroids -- TIES HERE -- Overlapping
Recompute new centroids
Repeat 2 until no neighbors change groups or some iterative threshold.
K-modes with fingerprints
(fingerprints remain binary)
1.
2.
3.
4.
Choose k seed modes from data set (or “frequency” of categories method, etc.)
Find nearest neighbors to the modes (euclidean, tanimoto, etc.) TIES HERE
Recompute new modes (simple matching coefficient)
Same as 4 in continuous K-means
“Continuous K-means”, Los Alamos Science, Faber, Kelly, White, 1994
Jarvis Patrick
• Two common versions:
1. Kmin:
ƒ
ƒ
Fixed length, k, NN list -- TIES HERE -- kth neighbor tied
If two compound NN lists have j neighbors in common Æ those
compounds are in the same group.
2. Pmin:
ƒ
ƒ
Fixed length, K, NN list -- TIES HERE -- kth neighbor tied
If two compound NN lists have a percentage, p, of neighbors in
common Æ those compounds are in the same group.
“Improvements to Daylight Clustering”, Delaney, Bradshaw, MUG’04
Reciprocal Nearest Neighbors
(RNN) Hierarchical Clustering
• Wards, Group Average, Complete Link
clustering algorithms can use RNN as a fast
( )
( )
method of obtaining the hierarchy.
O N 2 − vs − O N 3
Murtagh, A survey on
recent advances in hierarchical clustering algorithms, Computer Journal, 26, 354-359, 1983
• The RNN form of these algorithms contain
specific decision ties unique to this method.
• The resulting ambiguity can be quantified
by enumerating decision tie events.
RNN Algorithm Decision Ties
1. Form a nearest neighbor (NN) chain until a RNN
(each is the NN of the other) is found.
ƒ
ƒ
What if there is more than one NN? -- Ties in
Proximity (or merging criterion) problem, increasing
the ambiguity.
What if in turn there is more than one RNN? Ties in
Proximity and Algorithm Decision Tie problem, more
Ambiguity.
2. Use another merge criterion than the criterion
used in the algorithm to choose RNN in this case
-- decrease the Ambiguity.
ƒ
What if the results of this new criterion is also tied?
Another Algorithm Decision Tie, increasing the
Ambiguity.
For hierarchical algorithms that return overlapping clusterings based solely on ties in proximity, see Nicolaou,
MacCuish, and Tamura, “A new multi-domain clustering algorithm for lead discovery that exploits ties in
proximity”, Proceedings from the 13th Euro-QSAR, Prous Science, 2000, pp. 486-495.
How can we address this
problem?
Levels of Ambiguity
Two Groups with Considerable Overlap
.. Or, Smaller, More Distinct Groups
…Or, the difficulty of making sense of large numbers of
overlapping clusters where the intersections are large.
Distinct Clusters
Overlapping clusters, a result of combining all decision ties
Distinct (Disjoint) Clusters:
just one clustering of many
possible
Overlapping clusters, but
understandable
1
2
Fewer Decision Ties less Ambiguity
Overlapping clusters but
Difficult to understand
3
More Decision Ties more Ambiguity
An “Ambiguity” Index Defined
with TB Algorithm
• The difference between the disjoint and nondisjoint results of Taylor’s algorithm can give us a
sense of the ambiguity inherent in clustering
fingerprints at a given Tanimoto or Tversky
threshold.
• Many simple indices can be defined. We use an
index that reflects the number of shared
compounds in the non-disjoint clustering.
0.30
Clustering Ambiguity with Taylor's Algorithm
380 NCI-HIV Actives
0.20
0.15
0.10
0.05
0.0
Increasing Ambiguity Index
0.25
MACCS 166 Bits
MACCS 320 Bits
Daylight 512 Bits
Approx. 10% of Cmpds shared among clusters
0.70
0.75
0.80
0.85
Tanimoto Threshold
0.90
0.95
Jarvis-Patrick Results Summary
• The number of proximity ties are significant
in both algorithms when reasonable values
for k, j, and p are chosen -- on par with that
of Taylor’s and RNN algorithms.
• Kmin typically has more ties in general,
though it is hard to make a one to one
comparison with Pmin.
K-means, K-modes
Results Summary
• K-means:
– Typically small number of ties depending on K on just
the first iteration.
– Rarely are there ties after the first iteration.
– Very little overlap when the algorithm converges.
– Ambiguity confounded by local optima problem.
• K-mode with frequency method:
– Fewer ties overall than even K-means
– But -- ties occur more frequently in subsequent
iterations
– Again, very little overlap when the algorithm
converges.
– Ambiguity confounded by local optima problem.
Level Selection and Ambiguity in Hierarchical Clustering
Total
Ambiguity
Ambiguity ?
Ambiguity ?
Best
Level
?
100
0
50
Total Ambiguity in the Form of Ties
600
400
200
Kelly Level Selection Values
800
150
Ambiguity ?
0
100
200
300
Number of clusters
Level
Selection
Heuristic
400
500
100
200
300
400
Number of clusters
Ambiguity
Index
Ambiguity Index for Hierarchical
RNN class Algorithms
• Count the number of decision ties as a rough
estimate of the ambiguity.
• Use this in conjunction with level selection
techniques (e.g., Kelley’s), where the objective is
to find the best non-trivial level selection value
with the lowest ambiguity index.
–
–
Kelley, L. A.; Gardner, S. P.; Sutcliffe, M. J. An automated approach for clustering an
ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein
Eng. 1996, 9, 1063-1065.
Wild, D.J.; Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level
Selection Methods for Structural Grouping Using Ward’s Clustering, J. Chem. Inf. Comput.
Sci. 2000, 40, 155-162.
Two Wards Clusterings with Euclidean distance
same data (482 NCI-HIV) -- entered in a different order
10
5
0
0
5
10
“Best” Kelley Cuts
63 Clusters
136 Clusters
Similar groups at the top of the dendrogram
mask very different groups below
Complete Link -- Direct
Understanding of Level
• Using Tanimoto as the measure we can
inspect ambiguity and the similarity level or
threshold directly.
• Namely, check ambiguity at various
tanimoto similarity thresholds (levels)
common in the field: 0.7, 0.85
1.0
1.0
Two Complete Link Clusterings with Soergel measure
same data (482 NCI-HIV) -- entered in a different order
0.8
0.8
13 clusters
“Best” Kelley Cuts
0.6
0.4
0.4
0.6
24 clusters
0.2
0.0
0.0
255 clusters
(not all the same)
For each
0.2
0.7 similarity
Same problem …
Conclusions
• DON’T PANIC – Clustering is Good!
– Ambiguity important in terms of choosing K clusters
– Combining Level Selection information with ambiguity
information can help to make sense of results.
– Modifying algorithms to use secondary grouping criterion when
faced with decision ties can help reduce the ambiguity, often
providing tighter more useful clusterings -- data, measure, and
algorithm dependent however!
• BUT BE CAREFUL!
– Is it important for your application??
– Determining ambiguity or adding secondary grouping criteria can
have significant computational cost.
– In general, the choice of bit string length, measures, and algorithms
can all lead to differing amounts of ambiguity.
Future Work
• Further work on Ambiguity Indices
• Ideally
(FPLength)X(Measures)X(Algorithms)X(FindK)X(DataSetSize)X(DataSetDiversity)
• Explore other algorithms
Acknowledgements
•
•
•
•
•
John Bradshaw
Daylight, CIS
John Blankley
Pfizer (Retired)
John Barnard
BCI
David Wild
Wild Ideas
OpenEye Scientific Software, Inc.
•
This talk can be found at http://www.mesaac.com
Download