JKlustor

advertisement
JKlustor
clustering chemical libraries
presented by …
maintained by Miklós Vargyas
Last update: 25 March 2010
JKlustor
Chemical clustering by similarity and structure
JKlustor
Description of the product
JKlustor performs similarity and structure based clustering of
compound libraries and focused sets in both hierarchical and
non-hierarchical fashion.
Availability
•
•
•
•
•
•
•
part of Jchem
IJC (parts)
server version (accessible via API)
batch application programs
HTML user interface
one desktop application with GUI
GUI is available as an applet
Summary of key features
Wide range of methods
• Unsupervised, agglomerative clustering
• Hierarchical and non-hierarchical methods
• Similarity based and structure based techniques
Flexible search options
• Tanimoto and Euclidean metrics, weighting
• Maximum common substructure identification
• chemical property matching including atom type, bond type,
hybridization, charge
Interactive display
• interactive hierarchy browser (dendrogram viewer)
• SAR-table
• R-table
Efficient
• performance of tools varies between linear and quadratic scale
Benefits
Versatile
• Choose the most appropriate method to the clustering
•
•
•
•
problem
Combine methods to achieve best results
Use your trusted molecular descriptors in similarity
calculation
Easy integration in corporate discovery pipelines
Cluster chemical files directly no need to import structures
in database
Intuitive
• Cluster formation is self-explanatory
Similarity based clustering
Hierarchical
• Ward
Non-hierarchical
• Sphere exclusion
• k-means
• Jarvis-Patrick
Ward Clustering Features
• Ward's minimum variance method results in tight,
well separated clusters
• Murtagh's reciprocal nearest neighbor (RNN)
algorithm to speed it up
• quadratic scaling of running time (with respect to
number of input structures)
• memory consumption scales linearly
• best used with smaller sets (like focused libraries),
copes with < 100K structures
Sphere Exclusion Clustering Features
• based on fingerprints and/or other numerical data
• running time linear with respect to number of input structures
• memory scales sub-linearly
• can easily cope with 1Ms of structures
• suitable for diverse subset selection
k-means Clustering Features
• based on fingerprints and/or other numerical data
• minimises variance within each clusters
• number of clusters can directly be controlled
• finds the centre of natural clusters in the input data
• running time scales exponentially with respect to number of
input structures
• can cope with <100Ks of structures
Jarp Clustering Features
• variable-length Jarvis-Patrick clustering
• based on fingerprints and/or other numerical data
• takes structures/fingerprint and data values from either files or
form database tables
• running time scales better than quadratic but worse than linear
(with respect to number of input structures)
• memory scales linearly
• Jarp can cope with 100Ks of structures
• depending on data and parameters may create large number
of singletons
Ward Clustering Example
•
8 different sets of know active compounds mixed together
•
•
•
•
•
•
•
•
5-HT3-antagonists
ACE inhibitors
angiotensin 2 antagonists
D2 antagonists
delta antagonists
FTP antagonists
mGluR1 antagonists
thrombin inhibitors
•
ChemAxon’s 2D Pharmacophore fingerprint was generated
•
Fingerprints of the mixture were clustered by Ward
•
9 clusters were formed
• 8 centroids (cluster representative element) corresponded to the 8 activity classes
• 1 was a singleton
•
All 8 real clusters contained structures only from the activity class of the
centroid (over 95% true positive classification)
Ward Clustering Example
Centroids
Ward Clustering Example
Cluster of the D2 antagonists
Structure based clustering
Non-hierarchical
•
Bemis-Mucko frameworks
Hierarchical
•
LibraryMCS
Bemis-Murcko frameworks
Bemis-Murcko frameworks
Bemis-Murcko frameworks features
• based on structure of molecules
• cluster formation is apparent, visual, meets human
expectations
• running time linear with respect to number of input structures
• memory scales sub-linearly
• can easily cope with 1Ms of structures
• suitable for quick overview of very large sets
• spots scaffold hops
LibraryMCS
Identifies the largest subgraph shared by several molecular
structures
LibraryMCS: Hierarchical MCS
SAR table view
R-group decomposition
LibraryMCS features
• based on structure of molecules
• cluster formation is apparent, visual, meets human
expectations
• running time near-linear with respect to number of input
structures
• can cope with 100K-200K of structures
• suitable for very thorough analysis
• spots scaffold hops
• substituent-activity (property analysis)
LibraryMCS integration at Abbott
“Clustering for the masses…”,
presented by Derek Debe at ChemAxon’s US UGM, Boston, 2008
Running time (min)
Clustering performance comparison
90
80
70
60
50
40
30
20
10
0
LibraryMCS
Jarvis-Patrick
Ward-Murtagh
0
20000
40000
60000
Structure count
80000
100000
120000
Jklustor roadmap
In the development pipeline
•
•
•
•
•
•
•
Bemis-Murcko generalisations
IJC integration
KNIME integartion
New GUI
Manual clustering
Multiple class membership
Disconnected MCS (MOS)
Planned
•
•
•
•
PipelinePilot integration
Spotfire integration
JChemBase, JChemCartridge integration
JC4XLS integration
Blue sky
• Multitouch gestures
• LibraryMCS for 1M compound libraries
Download