Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline • The Problem: Clustering Ambiguity and Chemoinformatics • Preliminaries: – – – – bit strings, measures, similarity distributions Ties in Proximities and, more generally, Decision Ties. • Clustering Algorithms and Decision Ties • Examples: – – – – Taylor-Butina, (leader algorithm) K-means and K-modes Jarvis Patrick, RNN Hierarchical (Wards, Complete Link, Group Average) • Remarks Clustering Ambiguity Problem • Where: clustering algorithms that find distinct groups in data. • However, a quantitative decision process (“Idiot Proof”) may lead to ambiguous results. • Symptom: permute input data Æ different results. – Namely, not “stable” with respect to input order. • Ambiguity Æ it is not clear what belongs to what group • Distinct from: – fingerprint collisions (different compounds Æ same fingerprints) – Precision Clustering Applications and Binary Fingerprints • • • • • Lead selection in HTS data Diversity analysis Lead hopping Compound acquisition decisions Etc. – Downs, G. M.; Barnard, J. M. Clustering Methods and Their Uses in Computational Chemistry, Reviews in Computational Chemistry; Vol. 18, Lipkowitz, K. B. and Boyd, D. B., Eds; Wiley-VCH: New York, 2002, 1-40. Binary Fingerprints Descriptor CH3 Cl NH2 Encode 1 0 0 0 1 0 0 0 1 ...1 NH2 Fixed length bit strings such as Daylight MDL BCI etc. Common (Dis)Similarity Coefficients • Tanimoto • Euclidean • Cosine • Hamman • Tversky Simple Bit String Similarity Measure Properties • Symmetric (e.g.,Tanimoto) Similarity from A to B is the same as the similarity from B to A. • Asymmetric (e.g., Tversky) same as the similarity from B to A. Similarity from A to B is not necessarily the Clustering Compound Data: Asymmetric Clustering of Compound Data, MacCuish and MacCuish, Chemometrics and Chemoinformatics, ACS Symposium Series, in press • Metric (e.g., Euclidean) Satisfies the triangle inequality • Non-metric (e.g., Soergel) Does not satisfy the triangle inequality – Note, the square root of the Soergel does satisfy the triangle inequality for binary bit strings. Gower and Legendre, Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 1986, 3, 548. Tie in Proximity S H N N H H2C N N H O Euclidean Dist = .16 Euclidean Dist = .16 S S H N H N N H H2C N N H O One structure (or Cluster!) equidistant from two others. N H H2C N O N H Are Proximity Ties Common? Example: Binary Fingerprints with the Tanimoto Here are all bit strings of length 5: 00000 00001 00010 00100 01000 . . . 11111 32 strings Here are all possible Tanimoto similarities for distinct bit strings of length 5: 0, 1/5, 1/4, 1/3, 2/5, 1/2, 3/5, 2/3, 3/4, 4/5 All reduced fractions, denominators of 5 or less This is the Farey Sequence N, where N is 5 There are just 10 such distinct similarities And 496 all pairs similarities between these strings, given 32 distinct strings. And the distribution is…… All possible Tanimoto Similarities for Bit Strings of Length 5 1/2 80 1/4 60 1/3 2/5 40 1/5 Average frequency 49.6 20 3/5 2/3 3/4 4/5 0 Frequency of Similarities 100 120 0 0.0 0.2 0.4 0.6 Tanimoto Similarity 0.8 1.0 Finite Number of Proximities • How many possible Tanimoto similarities are there given N bits in a fixed length fingerprint? 3 2 ≅ π 2 N + O ( N log N ) – Namely, the sum of the number of reduced fractions with denominators up to N. (Proof of above expected bound, 1883) • How many possible Euclidean similarities? = N +1 • How many possible Cosine similarities? No known closed form in terms of N Any Number Theorists in the house? …For Fingerprints of Size 1024 • How many possible Tanimoto similarities ~329,000 • How many Euclidean similarities? 1,025 • How many Cosine similarities? In the low millions (empirical estimate) Exact Discrete Distributions vs Probabistic Discrete Distributions All possible Tanim oto Sim ilarities for Bit Strings of Length 5 300 100 200 Frequency of Similarities 60 40 0 0 20 Frequency of Similarities 80 400 100 380 NCI actives: Daylight Fingerprints w/ 512 bits 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 Tanim oto Sim ilarity 0.4 0.6 0.8 1.0 Tanim oto Sim ilarity 380 NC I actives: Daylight Fingerprints w/ 512 bits 5 10 Frequency of Similarities 40 30 20 10 Frequency of Similarities 15 50 60 20 All possible O chiai Sim ilarities for Bit Strings of Length 5 0.0 0.2 0.4 0.6 O chiai (1-C osine) Sim ilarity 0.8 1.0 0.0 0.2 0.4 0.6 O chiai Sim ilarity 0.8 1.0 Clustering and Ties in Proximity • Measures with small numbers of possible similarities (e.g., Euclidean), or distributions that lead to this same effect (e.g., Tanimoto, Ochiai), are prone to the problem of ties in proximity in clustering. This can effect derived measures as well, such as the square error of Wards merging criterion. Algorithms for Clustering Data, Jain and Dubes. Godden, et al, JCICS, 2001, 40, 163-166, and MacCuish , et al, JCICS, 2001, 41 (1), 134-146. • …Namely, we are clustering in a space that is a rigid lattice of proximities and/or derived measures rather than a continuum. (Note: typically for the lengths of the binary descriptors of the vendors mentioned, this lattice is far more course than the lattice that would be created by the typical floating point machine representation of real numbers.) In the literature beware of: “We resolve ties arbitrarily…” Decision Ties in Clustering Algorithms • A simple decision tie Æ tie in proximity • Other decision ties may be algorithm dependent (can occur even with continuous data). • In practice most decision ties lead to cluster ambiguity – an inability to discriminate nondisjoint (overlapping) clusters. • Namely, disjoint clusters don’t reflect the amount of ambiguity identified by decision ties as the resulting non-disjoint clustering suggests. Algorithms Taylor-Butina (TB) Leader or Exclusion Cluster Sampling Algorithm 1. 2. 3. 4. Create thresholded nearest neighbor table Find true singletons: all those compounds with an empty nearest neighbor list. Find the compound with the largest nearest neighbor list (representative compound or centrotype). This becomes a group and is excluded from consideration – these compounds are removed from all nearest neighbor lists. Repeat 3 until no compounds exist with a non-empty nearest neighbor list. – – 5. Taylor, JCICS, 1995, 35, 59-67 Butina, JCICS 1999, 39, 747-750. Optional: 1. 2. 3. 4. Assign remaining compounds, false singletons, to the group that contains their nearest neighbor; Use other criterion to break exclusion region ties; Use asymmetric measures; Can be made to return overlapping clusters. Representative Compound Tie Cases in TB Algorithm Exclusion Region Tie False Singleton Tie False Singleton, Which Region? Exclusion Regions Diameter Set by Threshold value True Singleton May form ambiguous clustering if sum of minimum distances is also tied False singleton tie, but regions not ambiguous, no need to sum minimum distances K-Means and K-Modes and overlapping versions • Continuous K-means with fingerprints (convert binary to real Æ 0.0s, 1.0s) 1. 2. 3. 4. • Choose k seed centroids from data set (e.g., quasi-randomly via 1D Halton sequence) Find nearest neighbors to the centroids -- TIES HERE -- Overlapping Recompute new centroids Repeat 2 until no neighbors change groups or some iterative threshold. K-modes with fingerprints (fingerprints remain binary) 1. 2. 3. 4. Choose k seed modes from data set (or “frequency” of categories method, etc.) Find nearest neighbors to the modes (euclidean, tanimoto, etc.) TIES HERE Recompute new modes (simple matching coefficient) Same as 4 in continuous K-means “Continuous K-means”, Los Alamos Science, Faber, Kelly, White, 1994 Jarvis Patrick • Two common versions: 1. Kmin: Fixed length, k, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have j neighbors in common Æ those compounds are in the same group. 2. Pmin: Fixed length, K, NN list -- TIES HERE -- kth neighbor tied If two compound NN lists have a percentage, p, of neighbors in common Æ those compounds are in the same group. “Improvements to Daylight Clustering”, Delaney, Bradshaw, MUG’04 Reciprocal Nearest Neighbors (RNN) Hierarchical Clustering • Wards, Group Average, Complete Link clustering algorithms can use RNN as a fast ( ) ( ) method of obtaining the hierarchy. O N 2 − vs − O N 3 Murtagh, A survey on recent advances in hierarchical clustering algorithms, Computer Journal, 26, 354-359, 1983 • The RNN form of these algorithms contain specific decision ties unique to this method. • The resulting ambiguity can be quantified by enumerating decision tie events. RNN Algorithm Decision Ties 1. Form a nearest neighbor (NN) chain until a RNN (each is the NN of the other) is found. What if there is more than one NN? -- Ties in Proximity (or merging criterion) problem, increasing the ambiguity. What if in turn there is more than one RNN? Ties in Proximity and Algorithm Decision Tie problem, more Ambiguity. 2. Use another merge criterion than the criterion used in the algorithm to choose RNN in this case -- decrease the Ambiguity. What if the results of this new criterion is also tied? Another Algorithm Decision Tie, increasing the Ambiguity. For hierarchical algorithms that return overlapping clusterings based solely on ties in proximity, see Nicolaou, MacCuish, and Tamura, “A new multi-domain clustering algorithm for lead discovery that exploits ties in proximity”, Proceedings from the 13th Euro-QSAR, Prous Science, 2000, pp. 486-495. How can we address this problem? Levels of Ambiguity Two Groups with Considerable Overlap .. Or, Smaller, More Distinct Groups …Or, the difficulty of making sense of large numbers of overlapping clusters where the intersections are large. Distinct Clusters Overlapping clusters, a result of combining all decision ties Distinct (Disjoint) Clusters: just one clustering of many possible Overlapping clusters, but understandable 1 2 Fewer Decision Ties less Ambiguity Overlapping clusters but Difficult to understand 3 More Decision Ties more Ambiguity An “Ambiguity” Index Defined with TB Algorithm • The difference between the disjoint and nondisjoint results of Taylor’s algorithm can give us a sense of the ambiguity inherent in clustering fingerprints at a given Tanimoto or Tversky threshold. • Many simple indices can be defined. We use an index that reflects the number of shared compounds in the non-disjoint clustering. 0.30 Clustering Ambiguity with Taylor's Algorithm 380 NCI-HIV Actives 0.20 0.15 0.10 0.05 0.0 Increasing Ambiguity Index 0.25 MACCS 166 Bits MACCS 320 Bits Daylight 512 Bits Approx. 10% of Cmpds shared among clusters 0.70 0.75 0.80 0.85 Tanimoto Threshold 0.90 0.95 Jarvis-Patrick Results Summary • The number of proximity ties are significant in both algorithms when reasonable values for k, j, and p are chosen -- on par with that of Taylor’s and RNN algorithms. • Kmin typically has more ties in general, though it is hard to make a one to one comparison with Pmin. K-means, K-modes Results Summary • K-means: – Typically small number of ties depending on K on just the first iteration. – Rarely are there ties after the first iteration. – Very little overlap when the algorithm converges. – Ambiguity confounded by local optima problem. • K-mode with frequency method: – Fewer ties overall than even K-means – But -- ties occur more frequently in subsequent iterations – Again, very little overlap when the algorithm converges. – Ambiguity confounded by local optima problem. Level Selection and Ambiguity in Hierarchical Clustering Total Ambiguity Ambiguity ? Ambiguity ? Best Level ? 100 0 50 Total Ambiguity in the Form of Ties 600 400 200 Kelly Level Selection Values 800 150 Ambiguity ? 0 100 200 300 Number of clusters Level Selection Heuristic 400 500 100 200 300 400 Number of clusters Ambiguity Index Ambiguity Index for Hierarchical RNN class Algorithms • Count the number of decision ties as a rough estimate of the ambiguity. • Use this in conjunction with level selection techniques (e.g., Kelley’s), where the objective is to find the best non-trivial level selection value with the lowest ambiguity index. – – Kelley, L. A.; Gardner, S. P.; Sutcliffe, M. J. An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally-related subfamilies. Protein Eng. 1996, 9, 1063-1065. Wild, D.J.; Blankley, C.J. Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward’s Clustering, J. Chem. Inf. Comput. Sci. 2000, 40, 155-162. Two Wards Clusterings with Euclidean distance same data (482 NCI-HIV) -- entered in a different order 10 5 0 0 5 10 “Best” Kelley Cuts 63 Clusters 136 Clusters Similar groups at the top of the dendrogram mask very different groups below Complete Link -- Direct Understanding of Level • Using Tanimoto as the measure we can inspect ambiguity and the similarity level or threshold directly. • Namely, check ambiguity at various tanimoto similarity thresholds (levels) common in the field: 0.7, 0.85 1.0 1.0 Two Complete Link Clusterings with Soergel measure same data (482 NCI-HIV) -- entered in a different order 0.8 0.8 13 clusters “Best” Kelley Cuts 0.6 0.4 0.4 0.6 24 clusters 0.2 0.0 0.0 255 clusters (not all the same) For each 0.2 0.7 similarity Same problem … Conclusions • DON’T PANIC – Clustering is Good! – Ambiguity important in terms of choosing K clusters – Combining Level Selection information with ambiguity information can help to make sense of results. – Modifying algorithms to use secondary grouping criterion when faced with decision ties can help reduce the ambiguity, often providing tighter more useful clusterings -- data, measure, and algorithm dependent however! • BUT BE CAREFUL! – Is it important for your application?? – Determining ambiguity or adding secondary grouping criteria can have significant computational cost. – In general, the choice of bit string length, measures, and algorithms can all lead to differing amounts of ambiguity. Future Work • Further work on Ambiguity Indices • Ideally (FPLength)X(Measures)X(Algorithms)X(FindK)X(DataSetSize)X(DataSetDiversity) • Explore other algorithms Acknowledgements • • • • • John Bradshaw Daylight, CIS John Blankley Pfizer (Retired) John Barnard BCI David Wild Wild Ideas OpenEye Scientific Software, Inc. • This talk can be found at http://www.mesaac.com