Slides 1

advertisement
Quantitative Structure-Activity Relationships
Quantitative Structure-Property-Relationships
SAR/QSAR/QSPR modeling
Alexandre Varnek
Faculté de Chimie, ULP, Strasbourg, FRANCE
SAR/QSAR/QSPR models
• Development
• Validation
• Application
Classification and Regression
models
• Development
• Validation
• Application
Development of the models
•
•
•
Selection and curation of experimental data
Preparation of training and test sets (optionaly)
Selection of an initial set of descriptors and their
normalisation
Variables selection (optionally)
Selection of a machine-learning method
•
•
Validation of models
•
•
Training/test set
Cross-validation
-
internal,
external
Application of the Models
•
Models Applicability Domain
Development the models
•
•
•
•
Experimental Data: selection and cleaning
Descriptors
Mathematical techniques
Statistical criteria
Data selection: Congenericity problem
• Congenericity principle is the assumption that « similar
compounds give similar responses ». This was the basic
requirement of QSAR. This concerns structurally
homogeneous data sets.
• Nowdays, experimentalists mostly produce structurally
diverse (non-congeneric) data sets
Data cleaning:
•
•
•
•
•
Similar experimental conditions
Dublicates
Structures standardization
Removal of mixtures
…..
The importance of Chemical Data Curation
Dataset curation is crucial for any cheminformatics analysis (QSAR
modeling, clustering, similarity search, etc.).
Currently, it is uncommon to describe procedures used for curation
in research papers; procedures are implemented or employed
differently in different groups.
We wish to emphasize the need to create and popularize
standardized curation strategy, applicable for any ensemble of
compounds.
What about these structures? (real examples)
Why duplicates are unsafe for QSAR ?
Duplicates are identical compounds present in a given dataset.
OH
CH3
HO
CH3
CH3
CH3
OH
H3C
H3C
CH3
OH
CH3
OH
CH3
OH
ID = 256
ID = 879
ID = 2346
Manual identification of duplicates is practically impossible especially when the dataset is large.
Activity analysis of duplicates is also highly important to identify cases where one occurrence is
identified as ‘active’ and another one as ‘weak active’ or ‘inactive’.
CH3
CH3
HO
OH
H3C
ACTIVE
CH3
INACTIVE
H3C
OH
OH
CH3
Structural standardization
For a given dataset, chemical groups have to be written in a standardized way, taking
into account critical properties (like pH) of the modeled system.
Aromatic compounds
OH
OH
These two different representations of the same
compound will lead to different descriptors, especially
with certain fingerprint or fragmental approaches.
Cl
Cl
CH3
CH3
Carboxylic acids, nitro groups etc.
O O
HO
X
O– O
O O
X
X
O O
OH
N
N+
X
X
For a given dataset, these functional groups have to be written in a consistent way to
avoid different descriptor values for the same chemical group.
Normalization of carboxylic, nitro groups, etc.
removal of inorganics
All inorganic compounds must be removed since our QSAR
modeling strategy includes the calculation of molecular
descriptors for organic compounds only.
This is an obvious limitation of the approach. However the total fraction of
inorganics in most available datasets is relatively small.
To detect inorganics, several solutions are
available:
- Automatic
identification
using
in
combination Jchem (ChemAxon, cxcalc
program) to output the empirical formula
of all compounds and simple scripts to
remove compounds with no carbon;
- Manual
inspection
of
compounds
possessing no carbon atom using
Notepad++ tools.
removal of mixtures
Fragments can be removed according to the number of
constitutive atoms or the molecular weight.
removal of mixtures
However, some cases are particularly difficult to treat.
Examples from DILI - BIOWISDOM dataset:
ID=172
CLEANED FORM BY CHEMAXON
The two eliminated compounds
could be active !
.
INITIAL FORM
MANUAL INSPECTION/VALIDATION IS STILL CRUCIAL
ID=1700
INITIAL FORM
CLEANED FORM BY CHEMAXON Ok.
removal of salts
Options Remove Fragments, Neutralize and Transform of Chemaxon
Standardizer. have to be used simultaneously for best results.
Aromatization and 2D cleaning
ChemAxon Standardizer offers two ways to aromatize benzene rings,
both of them based on Hűckel’s rules.
“General Style”
O
CH3
NH
OH
“Basic Style”
CH3
O
NH
OH
Most descriptor calculation
packages recognize the “basic
style” only.
http://www.chemaxon.com/jchem/marvin/help/sci/aromatization-doc.html
Preparation of training and test sets
Building of structure property models
Training set
Initial data set
Test
10 – 15 %
Splitting of an initial
data set into training
and test sets
Selection of the best
models according to
statistical criteria
“Prediction” calculations
using the best structure property models
Recommendations to prepare a test set
• (i) experimental methods for determination of activities in the training
and test sets should be similar;
•
(ii) the activity values should span several orders of magnitude, but
should not exceed activity values in the training set by more than 10%;
•
(iii) the balance between active and inactive compounds should be
respected for uniform sampling of the data.
References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215
Descriptors
• Variables selction
• Normalization
molecules
descriptors
Pattern matrix
Selection of descriptors for QSAR model
QSAR models should be reduced to a set of descriptors which is
as information rich but as small as possible.
Objective selection (independent variable only)
Statistical criteria of correlations
Pairwise selection (Forward or Backward Stepwise selection)
Principal Component Analysis
Partial Least Square analysis
Genetic Algorithm
……………….
Subjective selection
Descriptors selection based on mechanistic studies
Preprocessing strategy for the derivation of models
for use in structure-activity relationships (QSARs)
1. identify a subset of columns (variables) with significant
correlation to the response;
2. remove columns (variables) with zero (small) variance;
3. remove columns (variables) with no unique information;
4. identify a subset of variables on which to construct a model;
5. address the problem of chance correlation.
D. C. Whitley, M. G. Ford, D. J. Livingstone
J. Chem. Inf. Comput. Sci. 2000, 40, 1160-1168
Descriptors Normalisation
descriptors
n
molecules
m j  (1/ n) xij*
i 1
n
s  (1/ n) ( xij*  m j ) 2
2
j
i 1
Pattern matrix
Normalisation 1 (Unit Variance scaling):
Normalisation 2 (Mean Centring Scaling):
xij  x  m j
*
ij
xij 
xij*  m j
sj
Data Normalisation
Initial
descriptors
Norm. 1
Norm. 2
Machine-Learning Methods
Fitting models’ parameters
Y = F(ai , Xi )
Xi - descriptors (independent variables)
ai
- fitted parameters
The goal is to minimize Residual Sum of Squared (RSS)
N
RSS   ( yexp,i  ycalc,i )
i 1
2
Multiple Linear Regression
Activity
Descriptor
Y1
X1
Y2
Y2
…
…
Yn
Xn
Yi = a0 + a1 Xi1
Y
X
Multiple Linear Regression
y=ax+b
Residual Sum of
Squared (RSS)
N
RSS   ( yi  ycalc,i )
i 1
2
b
a
Multiple Linear Regression
Activity
Descr 1
Descr 2
…
Descr m
Y1
X11
X12
…
X1m
Y2
X21
X22
…
X2m
…
…
…
…
…
Yn
Xn1
Xn2
…
Xnm
Yi = a0 + a1 Xi1 + a2 Xi2 +…+ am Xim
kNN (k Nearest Neighbors)
Activity Y assessment calculating a weighted mean of the
activities Yi of its k nearest neighbors in the chemical space
TRAINING SET
Descriptor 1
Descriptor 2
A.Tropsha, A.Golbraikh, 2003
Biological and Artificial Neuron
Multilayer Neural Network
Neurons in the input layer correspond to descriptors, neurons in the output
layer – to properties being predicted, neurons in the hidden layer – to nonlinear
latent variables
SVM: Support Vector Machine
 w, x  b  1
 w, x  b  0
 w, x  b  1
2
w
Support Vector Classification (SVC)
SVM: Margins
The margin is the minimal
distance of any training point to
the separating hyperplane
Margin
1

w
Support Vector Regression
ε-Insensitive Loss Function
Only the points outside the εtube are penalized in a linear
fashion
 0 if   
  : 
    otherwise
Kernel Trick
In low-dimensional
input space
K ( x, x)  ( x), ( x) 
In high-dimensional
feature space
Any non-linear problem (classification, regression) in the original input space can be
converted into linear by making non-linear mapping Φ into a feature space with
higher dimension
QSAR/QSPR models
• Development
• Validation
• Application
Preparation of training and test sets
Building of structure property models
Training set
Initial data set
Test
10 – 15 %
Splitting of an initial
data set into training
and test sets
Selection of the best
models according to
statistical criteria
“Prediction” calculations
using the best structure property models
Validation
Estimation of the models predictive performance
5- Fold Cross Validation
All
compounds
of the
dataset are
predicted
Dataset Fold1 Fold2 Fold3 Fold4 Fold5
Leave-One Out Cross-Validation
N- Fold Internal Cross Validation
• Cross-validation is performed AFTER variables selection on the entire dataset.
• On each fold, the “test” set contains only 1 molecule
Statistical parameters for Regression
42
Fitting vs validation
Stabilities (logK) of Sr2+L complexes in water
LogKcalc
LogKpred
12
LOO
12
Fit
9
6
0
9
9
6
6
3
R2 = 0.886
RMSE = 0.97
3
12
3
R2= 0.826
RMSE = 1.20
0
5-CV
R2 = 0.682
RMSE = 1.62
0
-3
0
3
6
9
12
15
0
3
6
9
12
15
3
6
9
12
15
LogKexp
All molecules were used for
the model preparation
Each molecule was
“predicted” in internal CV
Each molecule was predicted
in external CV
Regression Error Characteristic (REC)
REC curves are widely used to compare of the performance of different models.
The gray line corresponds to average value model (AM). For a given model, the
area between AM and corresponding calculated curve reflects its quality.
Statistical parameters for Classification
Confusion Matrix
Classification Evaluation
sensitivity = true positive rate (TPR) = hit rate = recall
TPR = TP / P = TP / (TP + FN)
false positive rate (FPR)
FPR = FP / N = FP / (FP + TN)
specificity (SPC) = True Negative Rate
SPC = TN / N = TN / (FP + TN) = 1 − FPR
positive predictive value (PPV) = precision
PPV = TP / (TP + FP)
negative predictive value (NPV)
NPV = TN / (TN + FN)
accuracy (ACC)
ACC = (TP + TN) / (P + N)
balanced accuracy (BAC)
BAC = (sensitivity + sensitivity ) / 2 = (TP / (TP + FN) + TN / (FP + TN)) /2
Receiver Operating Characteristic (ROC)
TPR
Plot of the sensitivity vs (1 −
specificity) for a binary classifier
system as its discrimination threshold
is varied.
The ROC can also be represented
equivalently by plotting the fraction of
true positives (TPR = true positive
rate) vs the fraction of false positives
(FPR = false positive rate).
FPR
Ideally, Area Under Curve (AUC) => 1
ROC (Receiver Operating Characteristics)
100%
TP 0
FP a
1 2 3
b c d
4 5 6 7 8 9 e f g h i j
FN 0
TN a
1 2 3
b c d
4 5 6 7 8 9 e f g h i j
TP%
Ideal model:
AUC=0.84
AUC=1.00
j g
Useless model:
AUC=0.50
0%
FP%
0
a
1
b
h
100%
2
3
d
6
e
f
i
5
c 8
4
7 9
When a model is accepted ?
Regression Models
Classification Models
3 classes
Determination coefficient R2 > R02
Here, R02 = 0.5
49
BA > 1/q for q classes
“Chance correlation” problem
2,000
1
1,500
0.75
1,000
0.5
1965
1970
1975
year
1980
a model MUST be validated on new independent
data to avoid a chance correlation
Y-Scrambling
(for methods without descriptor selection)
X1
X2
Y1
Y2
Y2
Y5
X3
Y3
Y4
X4
Y4
Y6
X5
Y5
Y1
X6
Y6
Y7
X7
Y7
Y3
R2
0.0
1.0
Y-Scrambling
(for methods without descriptor selection)
X1
X2
Y1
Y2
Y4
Y1
X3
Y3
Y5
X4
Y4
Y2
X5
Y5
Y6
X6
Y6
Y3
X7
Y7
Y7
R2
0.0
1.0
Y-Scrambling
(for methods without descriptor selection)
X1
X2
Y1
Y2
Y7
Y6
X3
Y3
Y3
X4
Y4
Y5
X5
Y5
Y4
X6
Y6
Y1
X7
Y7
Y2
R2
0.0
1.0
QSAR/QSPR models
• Development
• Validation
• Application
Test compound
QSPR Models
Prediction Performance
Robustness of QSPR models
- Descriptors type;
- Descriptors selection;
- Machine-learning methods;
- Validation of models.
Applicability domain of models
Is a test compound similar
to
the
training
set
compounds?
Applicability domain of QSAR models
Descriptor 2
The new compound will be predicted by
the model, only if :
Di ≤ <Dk> + Z × sk
with Z, an empirical parameter (0.5 by default)
TRAINING SET
Descriptor 1
= TEST
INSIDE THE DOMAIN
OUTSIDE THE DOMAIN
Will be predicted
Will not be predicted
COMPOUND
Applicability
domain of QSAR models
Applicability
Domain Approaches
Fragment –based methods
 Fragment Control (FC)
Density based methods

1-SVM
 Model’s Fragment Control
(MFC)
Distance –based methods
 zkNN
Range –based methods
 Bounding Box (BB)
Ensemble modeling
Hunting season …
Single hunter
Hunting season …
Many hunters
Ensemble modelling
Ensemple modeling
Y1
Y2
1 n
Consensus = Y i
n i 1
Y3
Screening and hits selection
Database
O
COOH
Cl
Br
OH
N
OH
Virtual
Sreening
N
OH
QSPR model
N
COOH
Useless
compounds
O
Br
Hits
Experimental
Tests
Download