Fusing database rankings in similarity-based virtual

advertisement
Fusing database rankings in
similarity-based virtual
screening
Peter Willett, University of Sheffield
Overview
• Similarity-based virtual screening
• Combination of similarity rankings
• Similarity fusion
• Group fusion
• Comparison of fusion rules
Drug discovery
• The pharmaceutical industry has been one of the great
success stories of scientific research, discovering a
range of novel drugs for important therapeutic areas
• The computer has revolutionised how the industry uses
chemical (and increasingly biological) information
• Many of these developments are within the discipline we
now know as chemoinformatics
• “Chem(o)informatics is a generic term that encompasses the
design, creation, organization, management, retrieval, analysis,
dissemination, visualization and use of chemical information” (G.
Paris at a 1999 ACS meeting, quoted at
http://www.warr.com/warrzone.htm)
• Focus on structural information (2D or 3D) cf bioinformatics
Virtual screening
• Chemoinformatics covers a wide range of
techniques
• Here, focus on virtual screening of existing
public and in-house databases
• Tools to rank compounds in order of decreasing
probability of activity
• The top-ranked molecules are then prioritised for
biological screening
• A range of virtual screening methods available,
with similarity searching being one of the best
established and most widely used
Similarity searching
• Use of a similarity measure to quantify the
resemblance between an active reference (or
target) structure and each database structure
• Given a reference structure find molecules in a
database that are most similar to it (“give me ten
more like this”)
• Compare the reference structure with each database
structure and measure the similarity
• Sort the database in order of decreasing similarity
• Display the top-ranked structures (“nearest
neighbours”) to the searcher
2D similarity searching
H
N
N
H
N
H
N
OH
N
O
NH 2
H
N
N
Reference structure
N
N
OH
H
N
HO
N
N
H
N
N
N
H 2N
H 2N
Rationale for
similarity searching
• The similar property principle states that structurally
similar molecules tend to have similar properties
N
N
N
O
HO
O
Morphine
OH
O
O
Codeine
OH
O
O
O
O
Heroin
• Given a known active reference structure, a similarity
search of a database can be used to identify further
molecules for testing
• NB many exceptions to the similar property principle
Similarity measures
• A similarity measure has two principal
components
• A structure representation
Characterise reference and database structures
to enable rapid comparison
• A similarity coefficient to compare two
representations
Quantitative measure of the resemblance of
these characterisations
• The most common measure is based on the
use of 2D fingerprints and the Tanimoto
coefficient (as in previous example)
C
C C C
O
C C C
C
Fingerprints
• A simple, but approximate, representation that encodes
the presence of fragment substructures in a bit-string or
fingerprint
• Cf keywords indexing textual documents
• Each bit in the bit-string (binary vector) records the
presence (“1”) or absence (“0”) of a particular fragment
in the molecule.
• Typical length is a few hundred or few thousand bits
• Two fingerprints are regarded as similar if they have
many common bits set
Tanimoto coefficient
• Tanimoto coefficient for binary bit strings
• C bits set in common between Reference and
Database structures
• R bits set in Reference structure
• D bits set in Database structure
• More complex form for use with non-binary data, e.g.,
physicochemical property vectors
• Many other similarity coefficients exist
Data fusion: I
• Many comparisons of effectiveness using different
screening methods (e.g., different coefficients, different
fingerprints, 2D or 3D methods)
• Sheridan and Kearsley, Drug Discov. Today, 7, 2002,
903
• “We have come to regard looking for ‘the best’ way of searching
chemical databases as a futile exercise. In both retrospective
and prospective studies, different methods select different
subsets of actives for the same biological activity and the same
method might work better on some activities than others”
• Different types of coefficient and different types of
representation reflect different molecular characteristics,
so may enhance search performance by using more
than one similarity measure
Data fusion: II
• Use of ideas from textual information retrieval (IR) given
analogies between the two domains
• Documents, keywords with highly skewed frequency
distributions, and relevance to a query
• Molecules, fragments with highly skewed frequency distributions,
and activity against a specific biological target
• IR-like fusion first studied in the late Nineties
• Generate multiple rankings from the same reference structure
using different similarity measures (similarity fusion)
• Found to give improved performance over use of a single
similarity measure (more consistent, or even better than best
individual)
• Later work in chemoinformatics
• Generate multiple rankings from different reference structures
using the same similarity measure (group fusion)
Similarity fusion
• Conventional similarity searching yields a single
database ranking
• Work in IR on the “Authority Effect”
• Experiments in TREC show that documents retrieved
by multiple search engines more likely to be relevant
to a query than if retrieved by a single search engine
• Does the Effect also apply in chemoinformatics?
• Extensive virtual screening experiments to investigate
whether structures retrieved by multiple virtual
screening methods more likely to be active than if
retrieved by a single method
Experimental details: I
• Test collection methodology analogous to that
used in IR
• Use of MDDR (ca. 102K structures) and WOMBAT
(ca. 130K structures) databases
• Sets of molecules with known biological activities
(several hundred known actives in each class)
• Simulated virtual screening using an active as the
reference structure
• How many of the top-ranked molecules from a search
are also active?
Experimental details: II
• Sets of 25 searches for a reference structure:
• 5 different similarity coefficients (Tanimoto, cosine,
Euclidean distance, Forbes, Russell-Rao)
• 5 different fingerprints (MDL, BCI, Daylight, Unity and
ECFP_4)
• Apply cut-off to take, e.g., top-1% of a ranking
• Numbers of molecules, and numbers of active
molecules, retrieved by 1, 2….24, 25 searches
• Average over different reference structures for
each activity class, and over different activity
classes
Retrieval of molecules:
WOMBAT top-1% searches
5HT1A
4500
5HT3
AChE
3500
AT1
COX
3000
D2
2500
fXa
2000
HIVP
MMP1
1500
PDE4
PKC
1000
Renin
500
SubP
0
Thrombin
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Average number of overlap
4000
Average over all classes
Num ber of sim ilarity searches
Average number of overlap (log)
Retrieval of molecules:
WOMBAT top-1% searches
(average over classes)
4.5
4
3.5
y = -1.9788x + 3.9568
R2 = 0.9666
3
2.5
2
1.5
1
0.5
0
0
0.2
0.4
0.6
0.8
1
1.2
Number of similarity searches (log)
Zipf-like distribution
1.4
1.6
Retrieval of active molecules:
WOMBAT top-1% searches
5HT 1A
100
5HT 3
Average percentage of active
90
AChE
80
AT 1
70
COX
60
D2
50
fXa
HIVP
40
MMP1
30
PDE4
20
PKC
10
Renin
SubP
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Number of similarity searches
T hrombin
Average over all classes
Average percentage of actives
Retrieval of active molecules:
WOMBAT top-1% searches
(average over classes)
100
90
80
70
60
50
40
30
20
10
0
y = 1.1104e0.1736x
R² = 0.9948
0
5
10
15
Number of similarity searches
20
25
Similarity fusion:
conclusions
• Using multiple searches hence results in:
• Rapid decrease in the numbers of molecules
retrieved
• Rapid increase in the percentage of those retrieved
molecules that are active
• Multiple searches could hence increase the
effectiveness of similarity-based virtual
screening
• Provides empirical basis for similarity fusion (but
very simple fusion rule). What about group
fusion?
Use of group fusion: I
Reference 2
Reference 1
Reference 3
After truncation to required rank
Reference 2
Reference 1
Reference 3
Group fusion
• Use of MDDR database (ca. 102K structures)
• Measured numbers of actives retrieved in top-5% of
ranking?
• Group fusion searches where pick ten actives at random
• Comparison with the average of all the individual actives for
each activity class
• Comparison with the best single active for each activity class
• Use of Unity and ECFP4 fingerprints)
• Group fusion markedly out-performs the use of
individual reference structures
• Best results obtained using combination of scores and the
MAX rule (see later)
• Hert et al., J. Chem. Inf. Comput. Sci., 44, 2004, 1177
Recall at 5%
(%ReReccall
Recall (%))
Group fusion: average
over 11 activity classes
70
65
60
55
50
45
40
35
30
25
Unity
ECFP_4
Single
Similarity Average
Single
Similarity Maximum
Data Fusion
(Scores - Max)
Fusion rules
• Given multiple input rankings, a fusion rule
outputs a single, combined ranking
• The rankings can be either the computed similarity
values or the resulting rank positions
• Work in IR and chemoinformatics has used
simple arithmetical operations to combine
rankings (though many other, more complex
types of rule available):
• CombMAX for similarity data
• CombSUM for rank data
• Detailed comparison of a range of rules
Fusion rules for the x-th
database structure
• CombMax = max{S1(x), S2(x)..Si(x)..Sn(x)}
• Also CombMIN
• CombSum =
ΣS (x)
i
• Also CombMED and other averages, using all
or just some of the rankings
• CombRKP =
Σ(1/R (x))
i
• Used only with rank data
Very simple rules!
• Other studies use supervised rules
(logistic regression, belief theory etc)
• But normally very limited training data (i.e.,
structures and bioactivity information) at
the stage you want to use data fusion
• If such data are available, other
chemoinformatics approaches preferable
Experimental details
• Searches carried out using
• Similarity fusion and group fusion
• Various percentages of the ranked database
• 15 different fusion rules
• Results show conclusively that best results (for
both similarity fusion and group fusion) obtained
when:
• Use just the top 1-5% of each ranked list in the fusion
• Use the CombRKP fusion rule on the ranked lists
Use of CombRKP: I
Virtual screening seeks to rank molecules in decreasing
order of probability of activity: MDDR searches (J. Med.
Chem., 48, 2005, 7049) show a hyperbola-like plot
Use of CombRKP: II
Fusion scores for CombRKP best approximate probability of activity,
and hence CombRKP likely to perform well, Results averaged over
200 MDDR searches
Conclusions
• Similarity-based virtual screening using
fingerprints well-established
• Can enhance screening effectiveness by
use of data fusion:
• Combining the rankings from different
similarity measures
• Combining the rankings from different
reference structures
• Range of simple fusion rules available for this
purpose
Acknowledgments
• Organisations
• Accelrys, Daylight Chemical Information Systems,
Digital Chemistry, EPSRC, Government of Malaysia,
Sunset Molecular, Royal Society, Tripos, Wolfson
Foundation
• People
• Claire Ginn, Jerome Hert, John Holliday, Evangelos
Kanoulas, Nurul Malim, Christoph Mueller, Naomie
Salim
Download