John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson 1 RF-Score: a Machine Learning Scoring Function for Protein-Ligand Binding Affinities • Ballester, P.J. & Mitchell, J.B.O. (2010) Bioinformatics 26, 1169-1175 Calculating the affinities of protein-ligand complexes: For docking For post-processing docking hits For virtual screening For lead optimisation For 3D QSAR Within series of related complexes For any general complex Absolute (hard!) Relative A difficult, unsolved problem. Three existing approaches … 1. Force fields Three existing approaches … 2. Empirical Functions Three existing approaches … 2. Empirical Functions Three existing approaches … 3. Knowledge based How knowledge-based scoring functions have worked … P-L complexes from PDB Assign atoms to types Find histograms of type-type distances Convert to an ‘energy’ Add up the energies from all P-L atom pairs Nitrogen-Oxygen Distance Distribution Number observed 1200 1000 800 600 400 200 0 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 DIstance/ Angstroms This conversion of the histogram into an energy function uses a “reverse Boltzmann” methodology. Thus it “assumes” that the atoms of protein and ligand are independent particles in equilibrium at temperature T. For a variety of reasons, these are poor assumptions … Molecular connectivity: atom-atom distances are miles from being independent. Excluded volume effects. No physical basis for assuming such an equilibrium. Changes in structure with T are small and not like those implied by the Boltzmann distribution. We thought about this … … and wrote a paper saying “It’s not true, but it sort of works” We thought about this … … and wrote a paper saying “It’s not true, but it sort of works” Then we had a better idea – could we dispense with the reverse Boltzmann formalism? Instead of assuming a formula that relates the distance distribution to the binding free energy … … use machine learning to learn the relationship from known structures and binding affinities. Instead of assuming a formula that relates the distance distribution to the binding free energy … … use machine learning to learn the relationship from known structures and binding affinities. And persuade someone to pay for it! Number observed Nitrogen-Oxygen Distance Distribution 1200 1000 800 600 400 200 0 2 2.5 3 3.5 4 4.5 5 5.5 6 6.5 7 7.5 8 DIstance/ Angstroms Random Forest Predicted binding affinity Random Forest ● Introduced by Briemann and Cutler (2001) ● Development of Decision Trees (Recursive Partitioning): ● Dataset is partitioned into consecutively smaller subsets ● Each partition is based upon the value of one descriptor ● The descriptor used at each split is selected so as to optimise splitting ● Bootstrap sample of N objects chosen from the N available objects with replacement The Random Forest is a just forest of randomly generated decision trees … … whose outputs are averaged to give the final prediction Building RF-Score PDBbind 2007 Building RF-Score PDBbind 2007 Validation results: PDBbind set Following method of Cheng et al. JCIM 49, 1079 (2009) Independent test set PDBbind core 2007, 195 complexes from 65 clusters Validation results: PDBbind set RF-Score outperforms competitor scoring functions, at least on our test RF-Score is available for free from our group website John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson 26