An Approach for determining applicability domain

advertisement
REVIEW OF METHODS FOR ASSESSING THE APPLICABILTY
DOMAINS OF SARS AND QSARS
PAPER 2: An approach to determining applicability domain for
QSAR group contribution models: an analysis of SRC KOWWIN
Author:
Dr Nina Nikolova (Bulgarian Academy of Sciences, Sofia, Bulgaria)
E-mail: nina@acad.bg
Dr Joanna Jaworska (Procter & Gamble, Strombeek – Bever, Belgium)
E-mail: jaworska.j@pg.com
Sponsor:
The European Commission - Joint Research Centre
Institute for Health & Consumer Protection - ECVAM
21020 Ispra (VA)
Italy
Contact: Dr Andrew Worth
E-mail: andrew.worth@jrc.it
http://ecb.jrc.it/QSAR
JRC Contract ECVA-CCR.496575-Z
VERSION OF 28 JANUARY 2005
An approach to determining applicability domain for QSAR group
contribution models: an analysis of SRC KOWWIN
Nina Nikolova1 and Joanna Jaworska2*
1
IPP - Bulgarian Academy of Sciences, 25A “acad. G.Bonchev” str., 1113 Sofia,
Bulgaria, nina@acad.bg , 2Procter and Gamble, Eurocor, Central Product Safety, 100
Temselaan, B-1853 Strombeek-Bever, Belgium, Fax 32 2 5683098, Tel 32 2 456 2076,
Jaworska.j@pg.com
*
To receive all correspondence and reprints
1
Summary
The Setubal Workshop report [1] provided a conceptual guidance on a (Q)SAR applicability
domain definition. However, an operational definition which allows designing an automatic
(computerized) procedure for determination of the applicability domain of a model is
necessary to apply the guidance in practice. The paper is an attempt to address this need for
models characterized by use of a large number of descriptors such as group contribution based
models. The high dimensionality of these models imposes special practical computational
restrictions for estimation of the interpolation region.
As an example we analyse the
KOWWIN model for n-octanol/water partition coefficient prediction from Syracuse Research
Corporation (SRC) that uses 508 descriptors and conclude that ranges approach combined
with Principal Component rotation is an acceptable compromise between finding a method
suitable to a given training set data distribution and simultaneously suitable to a number of
points available in that set.
Key words: QSAR, applicability domain, KOWWIN, group contribution method
2
Introduction
Predictions from applicability domain of a QSAR model should be reliable. The Setubal
workshop report [1] offered the following guidance to the applicability domain assessment:
“The applicability domain of a (Q)SAR is the physico-chemical, structural, or biological
space, knowledge or information on which the training set of the model has been developed,
and for which it is applicable to make predictions for new compounds. The applicability
domain of a (Q)SAR should be described in terms of the most relevant parameters i.e. usually
those that are descriptors of the model. Ideally the (Q)SAR should only be used to make
predictions within that domain by interpolation not extrapolation”. This description is helpful
in explaining the intuitive meaning of the “applicability domain” concept. However, practical
assessment of the domains needs guidance pertaining to the method and the boundary criteria.
There are two approaches developed to date to define applicability domain. The first one
estimates the training set coverage in the model’s descriptor space and has been recently
reviewed for regression and classification models [2]. The second approach to the
applicability domain estimation is based on similarity analysis on the premise -a QSAR
prediction is reliable if the compound is “similar” to the compounds in the training set. This
approach is more difficult to apply than the first, because the “similarity” is a subjective term
and different notions of similarity are relevant to different endpoints [3, 4].
This paper attempts to develop practical guidance for applicability domain assessment for
high dimensional models such as group contribution based models. The underlying premise of
these methods is that the property of a compound is a sum of the contributions associated with
an atom or fragment (additivity) assuming that the contributions of the identical atoms or
3
fragments are the same as that in the original compounds used to develop these contributions
(transferability). The group contribution method is a very robust approach to develop QSAR
models for broad chemical classes. Group contribution models use hundreds of fragments as
model descriptors. We examine which interpolation methods are suitable for highly
dimensional model and analyse n-octanol/water partition coefficient model – KOWWIN from
SRC [5, 6] with 508 descriptors as a working case study.
Methods
Interpolation in the multivariate space
Calculation of interpolation region in a multivariate space is equivalent to estimation of
convex hull [2]. Convex hull calculation in high dimensional space is very computationally
intensive. In this paper we compare the following convex hull approximations:
1. Ranges in descriptor space
2. Euclidean distance
3. City-block distance
4. Mahalanobis distance
5. leverage/Hotelling T2
Table 1 gives an overview of formulas and ingoing assumptions for each method. For more
detail, see a recent review [2]. We do not consider probability density approach because
parametric methods based on normal distribution yield same results as 3,4,5 approaches and
there is not enough data in the training set to use nonparametric probability density method
[7]. We explore and compare results obtained for the raw data with scaled, centered and
rotated to Principal Components data as advocated in [8,9].
4
Table 1. Formulas and assumptions for different interpolation methods.
Applicability domain criteria
The compounds are labelled out of the domain, if
1.
At least one fragment count and/or correction factors is out of range for the ranges
approach;
2. The distance between the chemical, and the center of the training data set, exceeds a
threshold for distance approaches. The threshold for all kinds of distances and Hotelling
T2 is the largest distance of a training set data point to the center of the training data set
(i.e. the distance to the most distant point);
Though the criteria may appear quite different, they have the same end result to estimate the
smallest space encompassing the whole training set.
Results of SRC KOWWIN case study
KOWWIN descriptors determination
The description of the AFC method [5] provides only a partial list of fragments and correction
factors. Fragments are described in a textual form and explicit structure is not given in many
cases. Several listed fragments have ambiguous descriptions what makes it difficult to directly
reuse the AFC method. Furthermore, the list of fragments and correction factors slightly
differs in subsequent versions of SRC KOWWIN software, because more compounds and
fragments are added. Therefore we decided to use the most reliable source for the KOWWIN
descriptor space - the full text output of SRC KOWWIN v1.66 software (Figure 1). This is a
5
text file listing all fragments and factors, their frequencies and weights applicable to each
compound in the training set.
Figure 1 SRC KOWWIN text output for a compound.
A software tool was developed to parse the text output and produce a table, where columns
are all possible fragments/correction factors and rows are compounds. Each cell in the table
denotes how many times a fragment occurs in a compound. The descriptor space of
KOWWIN model was obtained by running all 2434 compounds from the training set through
the software. This revealed 186 different fragments and 322 different correction factors,
resulting in a 508-dimensional descriptor space (Table 2). The log Kow values vary between 4.57 and 8.19.
All 10910 compounds from validation set were also run through the software. The validation
set makes use of 172 (out of 186) fragments and 316 (out of 322) correction factors (Table 2).
The log Kow values in the validation set vary between -4.99 and 11.71. The quality of the very
high log Kow values (ca above 8) may need to be reviewed but it is outside the scope of this
paper.
Table 2. Fragments’ list for the KOWWIN’s training and validation sets.
The full list of fragments and correction factors used in the KOWWIN model, as well as the
ranges for each fragment and correction factor is not presented but it is available from the
authors.
6
The descriptors were evaluated for uniform distribution by Kolmogorow-Smirnov test in
MATLAB with default rejection level of 5% and again all of them failed. Distributions of
individual descriptors were also evaluated for normality by Jarque-Bera test in MATLAB
with the default rejection level of 5%. The Jarque-Bera evaluates the hypothesis that X has a
normal distribution with unspecified mean and variance, against the alternative that X does
not have a normal distribution. According to this test none of the descriptors is normally
distributed. This is a hint that the ranges and distance-based approaches may not reflect well
the data distribution and therefore the determination of interpolation regions needs a more
sophisticated technique like the nonparametric probability density estimation However, we
lack sufficient amount of data to use this method.
Finally, we have scaled, centered the data and rotated the axis to PC orthogonal axis. This
step is important for KOWWIN because in the KOWWIN model the descriptors are highly
correlated. The results of PCA on the original data reveal that the first 16 principal
components (PCs) explain 90% of the variance and the first 36 PCs explain 95% of the
variance. The PCA on the scaled, centered data showed more balance: the first 197 PCs
explain 90% of the variance; the first 282 PCs explain 95% of the variance.
Comparison between KOWWIN training and validation set predictions
In order to assess the quality of applicability domain assessment we compared the observed
vs. predicted results for the chemicals in the validation set. Validation set only partially
overlaps the training set, thus it splits into in and out of the domain. Figure 2 shows different
projections of both sets. Statistics for experimental and estimated Log Kow values, absolute
and relative prediction errors are shown in Table 3.
7
Figure 2 Projections of training set (¼) and validation set (…) coverage. a) web plot of 7 of
the individual descriptors (b) fragment C and fragment F, (c) fragment –O- and fragment CH2.
Table 3 Selected KOWWIN validation set compounds out of the training set ranges and
corresponding experimental and calculated Log Kow values.
Comparison of different methods to approximate training set coverage by interpolation
Table 4 provides a summary of the statistics for different applicability domain estimation
methods applied to the validation set: number of compounds in and out of domain and root
mean squared error. The RMSE for the training set is 0.22.
Table 4. Summary of the statistics for different applicability domain estimation methods
applied to the validation set.
Developers of the KOWWIN model did not perform both scaling and PCA preprocessing
steps. This may affect quality and stability of the model. To compensate for it and to obtain
correct applicability domain such a preatreatment is necessary. Pretreatment of data for
applicability domain assessment and lack of it during model development phase complicate
interpretability of the domain. Ideally, pretreatment should be performed during model
development to allow the domain be assessed in the model space.
Pretreatment, specifically PC rotation, had an effect on ranges approach and little effect on
distance based approach [Table 4]. The lack of big difference in results for distance based
approach in case of KOWWIN data set is due to scale factors being almost the same along all
8
dimensions and that the considered distance approaches assume normal distribution of data
that is symmetric. Because principal component rotation is not invariant to scaling, i.e., the
principal components extracted from the original data are not the same with the principal
components extracted from the scaled data [9] we carried out the scaling step before PC
rotation.
The numbers of validation compounds in the domain for methods examined are similar except
for ranges approach after PC rotation of axis (Table 4). All the approaches results in lower
RMSE for the validation compounds in the domain (0.43 to 0.6) than for the compounds out
of the domain (0.57 to 1.10). Ranges approach after PC rotation of axis had the the lowest
RMSE of 0.57 for the in domain chemicals.
The Figures 3 and 4 illustrate the correspondence between domain assessment and prediction
error for examined approaches. The left plots show calculated vs. experimental values scatter
plots. The right plots show the distance or range of each point plotted against the residual for
that compound (prediction error). In case of ranges, the number at the abscissa means the
number of dimensions where the point is out of training set range i.e. zero means in-range).
No clear correlation exists between distances and prediction error, but Table 4 shows that on
average, validation compounds outside of the training set coverage have much larger
prediction errors then those compounds inside the training set. For example using ranges
approach for compounds inside the training set the relative prediction error spans from 0.65%
to 33%. For compounds outside the coverage of the training set the relative prediction error
spans from 8% to 600%.
9
Figure 3. The correspondence between domain assessment and prediction error for ranges
approach.
Figure 4. The correspondence between domain assessment and prediction error for Euclidean
distance approach.
Discussion
In this paper, we examined definition of the applicability domain as training data set coverage
in the multivariate space of the model parameters and assess it with several interpolation
methods. We conclude that for high dimension models the range is the simplest practical
approach; however it is a compromise because data distribution in the training set does not
meet the assumption of uniformity. It means that a lot of empty space not covered by the
training set is deemed as domain of the model.
Ranges approach is a refinement of
applicability domain assessment compared to verification if a given fragment exists in the
training set currently implemented in KOWWIN. The PC rotation was necessary because
fragments are highly correlated.
The training space as defined by fragment and correction factor ranges consists of 5.44E+41
unique points. Out of this enormous space, the training set uses only 2113 unique points
(some of the 2434 points coincide). This means that only 3.88E-37 % of the training space is
covered by the training set points! Good practical experience with the KOWWIN model
means that additivity and transferability of fragments is working reasonably well within the
training set space. The AFC method has problems with additivity of fragments for rigid
aromatic molecules and for compounds where the same fragment occurs many times in a
molecule such as in a long aliphatic chain. The method also fails for molecules with
10
“uncommon” functional groups - transferability of these fragments is difficult to establish due
to large uncertainties in their estimated contributions. Fragment ranges provide a rough
estimation of additivity boundaries.
More precise assessments of applicability domain for high dimension models require
development of approaches for which high dimensionality is not a limiting factor. One
possible approach is to define where model assumptions are valid. Let us examine the
possibility to verify additivity and transferability assumptions of a group contribution method.
Additivity [10, 11] implies that each of the structural components of a compound makes a
separate and additive contribution to the property of interest for the compound.
Transferability assumes that these contributions are the same across a wide variety of
compounds [10].
Additivity is widely agreed hypothesis, with evidence provided from both empirical studies
[11] and contemporary quantum theories [12]. While quantum mechanics predicts the
properties of the open systems to be additive, this “additivity” could be experimentally
observed only when the contribution of the atom or fragment is also transferable without
apparent change from one compound to another. Defining additivity and transferability
boundaries has been so far difficult to formalize. It is, in part, because until recently fragments
had been determined empirically as was done in the KOWWIN. The advances in
understanding of additivity and especially transferability of fragmental contributions may lead
the way to redefine fragments based on theoretical considerations which are far easier to
verify [11, 12, 13].
11
Even if progress to estimate the domain by better characterizing the training set coverage and
verifying model’s assumptions is achieved the assessment will still provide a warning and not
ultimate reason to rejection or acceptance of prediction. The representation of chemical
compounds by their properties may not be always unique (i.e. two different compounds may
have the same representation by the subset of selected properties) and that non unique
representation provides a potential risk of obtaining correct result for one compound and
wrong for another. The lack of uniqueness could be avoided only if the set of descriptors used
contains all the information about chemical compound, but this is practically impossible. Thus
models using a small number of descriptors are especially prone to this while models using
large number of parameters, like AFC, are less prone because chances of missing a parameter
relevant to explain activity are smaller.
Conclusions
A key component of the QSAR quality prediction evaluation is to define if the prediction
comes from the applicability domain. The training data set coverage provides basis for the
estimation of the model’s applicability domain. For high dimension models choice of the
estimation method is not trivial. One has to find a compromise between finding a method
suitable to a given distribution data distribution and suitable to a number of data points
available. We recommend simplest approach of ranges as a practical acceptable compromise
for group contribution models. At the same time, we recognize the need to carry more
research towards development of methods for which dimensionality is not a limiting factor.
One possible approach is to develop theoretical understanding of two key assumptions of
group contribution method: additivity and transferability that can be used of verification of
applicability domain boundaries.
12
Acknowledgements: The training and validation sets of KOWWIN models were kindly
provided by Syracuse Research Corp. (P. Howard). Nina Nikolova work was funded by
Procter & Gamble postdoctoral fellowship. We also acknowledge partial funding by ECVAM
project CCR.496575-Z.
References
[1]
Jaworska J., Comber M., Van Leeuwen C., Auer C. (2003) Summary of the workshop
on regulatory acceptance of QSARs. Environmental Health Perpectives 111(10), 1358-1360
[2]
Jaworska. J, Nikolova-Jeliazkova N., Aldenberg T, (2005) Review of methods for
QSAR applicability domain estimation by the training set. ATLA
[3]
Nikolova N., J. Jaworska, (2003) Approaches to measure chemical similarity – a
review QSAR & Combinatorial. Science., 22 , 1006-1026.
[4]
Bender A. Glen R.C., (2004) Molecular similarity: a key technique in molecular
informatics Journal of Organic and Biomolecular Chemistry , 2, 3204 – 3218
[5]
Meylan W.M., Howard P.H., (1995) Atom/fragment contribution method for
estimating octanol-water partition coefficients, Journal of Pharmacological. Sciences 84, 8392.
[6]
Meylan W.M., Howard, P.H., Boethling R.S. (1996) Improved Method for Estimating
Bioconcentration / Bioaccumulation Factor from Octanol/Water Partition Coefficient,
Enviromental Toxicoogy and Chemistry. 18(4), 664-672.
[7]
Silverman, B.W., (1986) Density Estimation for Statistics and Data Analysis,
Chapman and Hall, Monographs on Statistics and Applied Probability 26, London. 9, p.170
13
[8]
Eriksson L., Jaworska J., Worth A., Cronin M.T.D., McDowell R. M., & Gramatica, P.
(2003). Methods for Reliability and Uncertainty Assessment and for Applicability
Evaluations of Classification- and Regression- Based QSARs, Environmental Health
Perspectives, 111(10), 1351 – 1375.
[9]
Seber, G.A.F., Multivariate Observations, Wiley and Sons Inc, New York, 1984. pp
671
[10]
McNaught A.D. & Wilkinson A., eds., (1997)
Compendium of Chemical
Terminology., Blackwell Science, London, pp103
[11]
Benson S. W., Cruickshank F. R., Golden D. M., Haugen G. R., O'Neal, H. E.,
Rodgers A. S., Shaw R., R. Walsh, (1969) Additivity rules for the estimation of thermo
chemical properties. Chemical Reviews, 69, 279-324.
[12]
Bader R., D. Bayles, (2000) Properties of Atoms in Molecules: Group Additivity,
Journal of Physical Chemistry A, 104(23), 5579-5589.
[13]
Curutchet C., Salichs A., Barril X, Orozco M., Luque FJ (2003) Transferability of
Fragmental Contributions to the octanol/water partition coefficient: an NDDO based MST
study Journal of Computational Chemistry , 24, 32-45
14
Table 1. Formulas and assumptions for different interpolation methods.
Assumptions on data
Method
Formula
distribution
Ranges
d ( x, y ) = x − y
uniform
Euclidean distance
DE ( x, µ ) = ( x − µ )T ( x − µ )
Normal, equal variances,
uncorrelated variables
City block
uniform
n
d ( x, y ) = ∑ x i − y i
i =1
Mahalanobis distance
(leverage and Hotteling T2
are proportional to M.D.)
d ( x, y ) = ( xi − y i )Σ −1 (xi − y i )
Normal, arbitrary variances,
where Σ-1 is the inverse of the
arbitrary correlation
'
covariance matrix
15
Table 2. Fragment list for the KOWWIN training and validation sets. 1
Fragment
KOWWIN
Training set
Validation set
Frequency2
MIN
MAX Frequency
Min Max
Aromatic Carbon
1786 (73%)
2
24
8725 (80%)
1
30
CH3[aliphatic carbon]
1388 (57%)
1
13
7353 (67%)
1
20
CH2[aliphatic carbon]
1076 (44%)
1
18
7016 (64%)
1
28
CH[aliphatic carbon]
457 (18%)
1
16
3839 (35%)
1
23
C[aliphatic carbon-No H not tert]
229 (9%)
1
3
1343 (12%)
1
11
O[oxygen aliphatic attach]
108 (4%)
1
5
1231 (11%)
1
12
F[fluorine aliphatic attach]
103 (4%)
1
6
540 (5%)
1
23
Cl[chlorine aliphatic attach]
100 (4%)
1
6
354 (3%)
1
12
Si-[silicon aromatic or oxygen attach]
15 (0.6%)
1
4
14 (0.1%)
1
9
1
Full list available from the authors
2
Absolute (relative)
16
Table 3. Selected KOWWIN validation set compounds out of the training set ranges and
corresponding experimental and calculated Log KOW values *.
678262
Pentane,dodecafluoro
5.05
4.4
0.65
15
355680
Perfluorocyclohexane
3.33
2.91
0.42
14
47071114
4,6-NH2 2,2-DiMe1(4-CF3)Ph s-triazene
1.28
1.22
0.06
4.92
80616597
Butanamide,N(5amino1H1,2,4triazol3yl)2, 1.53
1.54
-0.01
0.65
tafluo
77963509
B30C10 Benzocrownether
-0.15
0.03
-0.18
600
104946625 B33C11 Benzocrownether
-0.43
-0.09
-0.34
378
63144763
B27C9 Benzocrownether
0.12
0.23
-0.11
48
88116590
Iohexolderivative
1.94
-2.80
4.74
169
2915
3AZAGLUTARAMIDEANALOGA37
4
3.6
0.4
11
93414552
Benzoic acid, 3,4,5-trimethoxy-, 2-[4-[[(2- 3.26
2.94
0.32
11
0.52
0.04
8
oxoethoxy)imino]methyl]-2-methoxypheno
121284206 [8,8]DB48C16Dibenzocrownether
0.56
*
Full list available from the authors; CAS numbers and names are taken from KOWWIN output
17
error %
33
Relative
-2.31
error
7.1
Absolute
PERFLUOROPMETHYLCYCLOHEXYL 4.79
error
3298
Experimental
SRC KOWWIN
value
NAME
Estimated
CAS
Table 4. Summary of the statistics for different applicability domain estimation methods
applied to validation set.
Nr
Data sets:
PC
Validation (in)
Validation out
space
Domain defined by:
No
RMSE
No compounds
RMSE
compounds
1
Ranges
10247
0.46
597
0.74
3
Euclidean distance
10796
0.47
48
0.94
5
City block distance
10797
0.47
47
0.96
7
Hotelling T2
10685
0.59
160
0.73
11
Ranges (scaled data)
yes
7460
0.43
3384
0.57
13
Euclidean distance - (Mahalanobis)
yes
10187
0.47
27
1.10
distance (scaled)
15
Hotelling T2/leverage ( scaled)
yes
10749
0.60
96
0.67
17
City block distance (scaled data)
yes
10708
0.46
136
0.97
18
Figure 1. SRC KOWWIN output for a compound.
Figure 2 Projections of training set (¼) and validation set (…) coverage. a) web plot of 7 of
the individual descriptors (b) fragment C and fragment F, (c) fragment –O- and fragment CH2.
Figure 3. The correspondence between domain assessment and prediction error for ranges
approach.
Figure 4. The correspondence between domain assessment and prediction error for Euclidean
distance approach.
19
Figure 1. SRC KOWWIN output for a compound
SMILES : Oc(c(cc(c1)Cc(cc(c(O)c2C(C)(C)C)C(C)(C)C)c2)C(C)(C)C)c1C(C)(C)C
CHEM
: Phenol, 4,4'-methylenebis 2,6-bis(1,1-dimethylethyl)-
MOL FOR: C29 H44 O2
MOL WT : 424.67
-------+-----+--------------------------------------------+---------+-------TYPE
| NUM |
LOGKOW FRAGMENT DESCRIPTION
|
COEFF
|
VALUE
-------+-----+--------------------------------------------+---------+-------Frag
| 12
|
-CH3
[aliphatic carbon]
| 0.5473
|
6.5676
Frag
|
1
|
-CH2-
[aliphatic carbon]
| 0.4911
|
0.4911
Frag
| 12
|
Aromatic Carbon
| 0.2940
|
3.5280
Frag
|
2
|
-OH
|-0.4802
| -0.9604
Frag
|
4
|
-tert Carbon
| 0.2676
|
Factor|
1
|
-CH2- (aliphatic), 2 phenyl attach correc |-0.2326
| -0.2326
Factor|
2
|
Ring rx: -OH / di-ortho;sec- or t- carbon |-0.8500
| -1.7000
|
Equation Constant
|
Const |
[hydroxy, aromatic attach]
[3 or more carbon attach]
|
1.0704
0.2290
-------+-----+--------------------------------------------+---------+-------Log Kow
20
=
8.9931
Figure 2 Projections of training set (¼) and validation set (…) coverage. a) web plot of
7 of the individual descriptors (b) fragment C and fragment F, (c) fragment –O- and fragment
CH2.
a)
b)
c)
21
Figure. 3. The correspondence between domain assessment and prediction error for ranges in
descriptor space and PC rotated descriptor space approaches. y chemicals in the domain; U–
chemicals out of the domain.
(a) Results with ranges in descriptor space
(b) Results with ranges in PC space
22
Figure 4 The correspondence between domain assessment and prediction error for Euclidean
distance in descriptor space and PC rotated descriptor space approaches; y chemicals in the
domain; U– chemicals out of the domain.
b) results after PC rotation are not shown, results are very similar to a)
23
Download