Fusing database rankings in similarity-based virtual

Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield Overview • Similarity-based virtual screening • Combination of similarity rankings • Similarity fusion • Group fusion • Comparison of fusion rules Drug discovery • The pharmaceutical industry has been one of the great success stories of scientific research, discovering a range of novel drugs for important therapeutic areas • The computer has revolutionised how the industry uses chemical (and increasingly biological) information • Many of these developments are within the discipline we now know as chemoinformatics • “Chem(o)informatics is a generic term that encompasses the design, creation, organization, management, retrieval, analysis, dissemination, visualization and use of chemical information” (G. Paris at a 1999 ACS meeting, quoted at http://www.warr.com/warrzone.htm) • Focus on structural information (2D or 3D) cf bioinformatics Virtual screening • Chemoinformatics covers a wide range of techniques • Here, focus on virtual screening of existing public and in-house databases • Tools to rank compounds in order of decreasing probability of activity • The top-ranked molecules are then prioritised for biological screening • A range of virtual screening methods available, with similarity searching being one of the best established and most widely used Similarity searching • Use of a similarity measure to quantify the resemblance between an active reference (or target) structure and each database structure • Given a reference structure find molecules in a database that are most similar to it (“give me ten more like this”) • Compare the reference structure with each database structure and measure the similarity • Sort the database in order of decreasing similarity • Display the top-ranked structures (“nearest neighbours”) to the searcher 2D similarity searching H N N H N H N OH N O NH 2 H N N Reference structure N N OH H N HO N N H N N N H 2N H 2N Rationale for similarity searching • The similar property principle states that structurally similar molecules tend to have similar properties N N N O HO O Morphine OH O O Codeine OH O O O O Heroin • Given a known active reference structure, a similarity search of a database can be used to identify further molecules for testing • NB many exceptions to the similar property principle Similarity measures • A similarity measure has two principal components • A structure representation Characterise reference and database structures to enable rapid comparison • A similarity coefficient to compare two representations Quantitative measure of the resemblance of these characterisations • The most common measure is based on the use of 2D fingerprints and the Tanimoto coefficient (as in previous example) C C C C O C C C C Fingerprints • A simple, but approximate, representation that encodes the presence of fragment substructures in a bit-string or fingerprint • Cf keywords indexing textual documents • Each bit in the bit-string (binary vector) records the presence (“1”) or absence (“0”) of a particular fragment in the molecule. • Typical length is a few hundred or few thousand bits • Two fingerprints are regarded as similar if they have many common bits set Tanimoto coefficient • Tanimoto coefficient for binary bit strings • C bits set in common between Reference and Database structures • R bits set in Reference structure • D bits set in Database structure • More complex form for use with non-binary data, e.g., physicochemical property vectors • Many other similarity coefficients exist Data fusion: I • Many comparisons of effectiveness using different screening methods (e.g., different coefficients, different fingerprints, 2D or 3D methods) • Sheridan and Kearsley, Drug Discov. Today, 7, 2002, 903 • “We have come to regard looking for ‘the best’ way of searching chemical databases as a futile exercise. In both retrospective and prospective studies, different methods select different subsets of actives for the same biological activity and the same method might work better on some activities than others” • Different types of coefficient and different types of representation reflect different molecular characteristics, so may enhance search performance by using more than one similarity measure Data fusion: II • Use of ideas from textual information retrieval (IR) given analogies between the two domains • Documents, keywords with highly skewed frequency distributions, and relevance to a query • Molecules, fragments with highly skewed frequency distributions, and activity against a specific biological target • IR-like fusion first studied in the late Nineties • Generate multiple rankings from the same reference structure using different similarity measures (similarity fusion) • Found to give improved performance over use of a single similarity measure (more consistent, or even better than best individual) • Later work in chemoinformatics • Generate multiple rankings from different reference structures using the same similarity measure (group fusion) Similarity fusion • Conventional similarity searching yields a single database ranking • Work in IR on the “Authority Effect” • Experiments in TREC show that documents retrieved by multiple search engines more likely to be relevant to a query than if retrieved by a single search engine • Does the Effect also apply in chemoinformatics? • Extensive virtual screening experiments to investigate whether structures retrieved by multiple virtual screening methods more likely to be active than if retrieved by a single method Experimental details: I • Test collection methodology analogous to that used in IR • Use of MDDR (ca. 102K structures) and WOMBAT (ca. 130K structures) databases • Sets of molecules with known biological activities (several hundred known actives in each class) • Simulated virtual screening using an active as the reference structure • How many of the top-ranked molecules from a search are also active? Experimental details: II • Sets of 25 searches for a reference structure: • 5 different similarity coefficients (Tanimoto, cosine, Euclidean distance, Forbes, Russell-Rao) • 5 different fingerprints (MDL, BCI, Daylight, Unity and ECFP_4) • Apply cut-off to take, e.g., top-1% of a ranking • Numbers of molecules, and numbers of active molecules, retrieved by 1, 2….24, 25 searches • Average over different reference structures for each activity class, and over different activity classes Retrieval of molecules: WOMBAT top-1% searches 5HT1A 4500 5HT3 AChE 3500 AT1 COX 3000 D2 2500 fXa 2000 HIVP MMP1 1500 PDE4 PKC 1000 Renin 500 SubP 0 Thrombin 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Average number of overlap 4000 Average over all classes Num ber of sim ilarity searches Average number of overlap (log) Retrieval of molecules: WOMBAT top-1% searches (average over classes) 4.5 4 3.5 y = -1.9788x + 3.9568 R2 = 0.9666 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 1.2 Number of similarity searches (log) Zipf-like distribution 1.4 1.6 Retrieval of active molecules: WOMBAT top-1% searches 5HT 1A 100 5HT 3 Average percentage of active 90 AChE 80 AT 1 70 COX 60 D2 50 fXa HIVP 40 MMP1 30 PDE4 20 PKC 10 Renin SubP 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Number of similarity searches T hrombin Average over all classes Average percentage of actives Retrieval of active molecules: WOMBAT top-1% searches (average over classes) 100 90 80 70 60 50 40 30 20 10 0 y = 1.1104e0.1736x R² = 0.9948 0 5 10 15 Number of similarity searches 20 25 Similarity fusion: conclusions • Using multiple searches hence results in: • Rapid decrease in the numbers of molecules retrieved • Rapid increase in the percentage of those retrieved molecules that are active • Multiple searches could hence increase the effectiveness of similarity-based virtual screening • Provides empirical basis for similarity fusion (but very simple fusion rule). What about group fusion? Use of group fusion: I Reference 2 Reference 1 Reference 3 After truncation to required rank Reference 2 Reference 1 Reference 3 Group fusion • Use of MDDR database (ca. 102K structures) • Measured numbers of actives retrieved in top-5% of ranking? • Group fusion searches where pick ten actives at random • Comparison with the average of all the individual actives for each activity class • Comparison with the best single active for each activity class • Use of Unity and ECFP4 fingerprints) • Group fusion markedly out-performs the use of individual reference structures • Best results obtained using combination of scores and the MAX rule (see later) • Hert et al., J. Chem. Inf. Comput. Sci., 44, 2004, 1177 Recall at 5% (%ReReccall Recall (%)) Group fusion: average over 11 activity classes 70 65 60 55 50 45 40 35 30 25 Unity ECFP_4 Single Similarity Average Single Similarity Maximum Data Fusion (Scores - Max) Fusion rules • Given multiple input rankings, a fusion rule outputs a single, combined ranking • The rankings can be either the computed similarity values or the resulting rank positions • Work in IR and chemoinformatics has used simple arithmetical operations to combine rankings (though many other, more complex types of rule available): • CombMAX for similarity data • CombSUM for rank data • Detailed comparison of a range of rules Fusion rules for the x-th database structure • CombMax = max{S1(x), S2(x)..Si(x)..Sn(x)} • Also CombMIN • CombSum = ΣS (x) i • Also CombMED and other averages, using all or just some of the rankings • CombRKP = Σ(1/R (x)) i • Used only with rank data Very simple rules! • Other studies use supervised rules (logistic regression, belief theory etc) • But normally very limited training data (i.e., structures and bioactivity information) at the stage you want to use data fusion • If such data are available, other chemoinformatics approaches preferable Experimental details • Searches carried out using • Similarity fusion and group fusion • Various percentages of the ranked database • 15 different fusion rules • Results show conclusively that best results (for both similarity fusion and group fusion) obtained when: • Use just the top 1-5% of each ranked list in the fusion • Use the CombRKP fusion rule on the ranked lists Use of CombRKP: I Virtual screening seeks to rank molecules in decreasing order of probability of activity: MDDR searches (J. Med. Chem., 48, 2005, 7049) show a hyperbola-like plot Use of CombRKP: II Fusion scores for CombRKP best approximate probability of activity, and hence CombRKP likely to perform well, Results averaged over 200 MDDR searches Conclusions • Similarity-based virtual screening using fingerprints well-established • Can enhance screening effectiveness by use of data fusion: • Combining the rankings from different similarity measures • Combining the rankings from different reference structures • Range of simple fusion rules available for this purpose Acknowledgments • Organisations • Accelrys, Daylight Chemical Information Systems, Digital Chemistry, EPSRC, Government of Malaysia, Sunset Molecular, Royal Society, Tripos, Wolfson Foundation • People • Claire Ginn, Jerome Hert, John Holliday, Evangelos Kanoulas, Nurul Malim, Christoph Mueller, Naomie Salim

Fusing database rankings in similarity-based virtual

Related documents

Products

Support

Fusing database rankings in similarity-based virtual

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib