RF-Score: A new scoring function for Protein

advertisement
John Mitchell; James McDonagh; Neetika Nath
Rob Lowe; Richard Marchese Robinson
1
RF-Score:
a Machine Learning Scoring Function
for Protein-Ligand Binding Affinities
• Ballester, P.J. & Mitchell, J.B.O. (2010)
Bioinformatics 26, 1169-1175
Calculating the affinities of protein-ligand complexes:
 For docking
 For post-processing docking hits
 For virtual screening
 For lead optimisation
 For 3D QSAR
 Within series of related complexes
 For any general complex
 Absolute (hard!)
 Relative
A difficult, unsolved problem.
Three existing approaches …
1. Force fields
Three existing approaches …
2. Empirical Functions
Three existing approaches …
2. Empirical Functions
Three existing approaches …
3. Knowledge based
How knowledge-based scoring functions have worked …





P-L complexes from PDB
Assign atoms to types
Find histograms of type-type distances
Convert to an ‘energy’
Add up the energies from all P-L atom pairs
Nitrogen-Oxygen Distance
Distribution
Number observed
1200
1000
800
600
400
200
0
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
DIstance/ Angstroms
 This conversion of the histogram into an energy function
uses a “reverse Boltzmann” methodology.
 Thus it “assumes” that the atoms of protein and ligand are
independent particles in equilibrium at temperature T.
 For a variety of reasons, these are poor assumptions …
 Molecular connectivity: atom-atom distances are miles
from being independent.
 Excluded volume effects.
 No physical basis for assuming such an equilibrium.
 Changes in structure with T are small and not like
those implied by the Boltzmann distribution.
We thought about this …
… and wrote a paper saying
“It’s not true, but it sort of works”
We thought about this …
… and wrote a paper saying
“It’s not true, but it sort of works”
Then we had a better idea – could we dispense with the
reverse Boltzmann formalism?
 Instead of assuming a formula that relates the distance
distribution to the binding free energy …
… use machine learning to learn the relationship from
known structures and binding affinities.
 Instead of assuming a formula that relates the distance
distribution to the binding free energy …
… use machine learning to learn the relationship from
known structures and binding affinities.
 And persuade someone to pay for it!
Number observed
Nitrogen-Oxygen Distance
Distribution
1200
1000
800
600
400
200
0
2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8
DIstance/ Angstroms
Random Forest
Predicted binding affinity
Random Forest
● Introduced by Briemann and Cutler (2001)
● Development of Decision Trees (Recursive Partitioning):
● Dataset is partitioned into consecutively
smaller subsets
● Each partition is based upon the value of
one descriptor
● The descriptor used at each split is
selected so as to optimise splitting
● Bootstrap sample of N objects chosen from
the N available objects with replacement
 The
Random Forest is a just forest of randomly
generated decision trees …
… whose outputs are averaged to give the final prediction
Building RF-Score
PDBbind 2007
Building RF-Score
PDBbind 2007
Validation results: PDBbind set
 Following method of Cheng et al. JCIM 49, 1079 (2009)
 Independent test set PDBbind core 2007, 195 complexes from 65 clusters
Validation results: PDBbind set


RF-Score outperforms competitor scoring functions, at least on our test
RF-Score is available for free from our group website
John Mitchell; James McDonagh; Neetika Nath
Rob Lowe; Richard Marchese Robinson
26
Download