REVIEW OF METHODS FOR ASSESSING THE APPLICABILTY DOMAINS OF SARS AND QSARS PAPER 2: An approach to determining applicability domain for QSAR group contribution models: an analysis of SRC KOWWIN Author: Dr Nina Nikolova (Bulgarian Academy of Sciences, Sofia, Bulgaria) E-mail: nina@acad.bg Dr Joanna Jaworska (Procter & Gamble, Strombeek – Bever, Belgium) E-mail: jaworska.j@pg.com Sponsor: The European Commission - Joint Research Centre Institute for Health & Consumer Protection - ECVAM 21020 Ispra (VA) Italy Contact: Dr Andrew Worth E-mail: andrew.worth@jrc.it http://ecb.jrc.it/QSAR JRC Contract ECVA-CCR.496575-Z VERSION OF 28 JANUARY 2005 An approach to determining applicability domain for QSAR group contribution models: an analysis of SRC KOWWIN Nina Nikolova1 and Joanna Jaworska2* 1 IPP - Bulgarian Academy of Sciences, 25A “acad. G.Bonchev” str., 1113 Sofia, Bulgaria, nina@acad.bg , 2Procter and Gamble, Eurocor, Central Product Safety, 100 Temselaan, B-1853 Strombeek-Bever, Belgium, Fax 32 2 5683098, Tel 32 2 456 2076, Jaworska.j@pg.com * To receive all correspondence and reprints 1 Summary The Setubal Workshop report [1] provided a conceptual guidance on a (Q)SAR applicability domain definition. However, an operational definition which allows designing an automatic (computerized) procedure for determination of the applicability domain of a model is necessary to apply the guidance in practice. The paper is an attempt to address this need for models characterized by use of a large number of descriptors such as group contribution based models. The high dimensionality of these models imposes special practical computational restrictions for estimation of the interpolation region. As an example we analyse the KOWWIN model for n-octanol/water partition coefficient prediction from Syracuse Research Corporation (SRC) that uses 508 descriptors and conclude that ranges approach combined with Principal Component rotation is an acceptable compromise between finding a method suitable to a given training set data distribution and simultaneously suitable to a number of points available in that set. Key words: QSAR, applicability domain, KOWWIN, group contribution method 2 Introduction Predictions from applicability domain of a QSAR model should be reliable. The Setubal workshop report [1] offered the following guidance to the applicability domain assessment: “The applicability domain of a (Q)SAR is the physico-chemical, structural, or biological space, knowledge or information on which the training set of the model has been developed, and for which it is applicable to make predictions for new compounds. The applicability domain of a (Q)SAR should be described in terms of the most relevant parameters i.e. usually those that are descriptors of the model. Ideally the (Q)SAR should only be used to make predictions within that domain by interpolation not extrapolation”. This description is helpful in explaining the intuitive meaning of the “applicability domain” concept. However, practical assessment of the domains needs guidance pertaining to the method and the boundary criteria. There are two approaches developed to date to define applicability domain. The first one estimates the training set coverage in the model’s descriptor space and has been recently reviewed for regression and classification models [2]. The second approach to the applicability domain estimation is based on similarity analysis on the premise -a QSAR prediction is reliable if the compound is “similar” to the compounds in the training set. This approach is more difficult to apply than the first, because the “similarity” is a subjective term and different notions of similarity are relevant to different endpoints [3, 4]. This paper attempts to develop practical guidance for applicability domain assessment for high dimensional models such as group contribution based models. The underlying premise of these methods is that the property of a compound is a sum of the contributions associated with an atom or fragment (additivity) assuming that the contributions of the identical atoms or 3 fragments are the same as that in the original compounds used to develop these contributions (transferability). The group contribution method is a very robust approach to develop QSAR models for broad chemical classes. Group contribution models use hundreds of fragments as model descriptors. We examine which interpolation methods are suitable for highly dimensional model and analyse n-octanol/water partition coefficient model – KOWWIN from SRC [5, 6] with 508 descriptors as a working case study. Methods Interpolation in the multivariate space Calculation of interpolation region in a multivariate space is equivalent to estimation of convex hull [2]. Convex hull calculation in high dimensional space is very computationally intensive. In this paper we compare the following convex hull approximations: 1. Ranges in descriptor space 2. Euclidean distance 3. City-block distance 4. Mahalanobis distance 5. leverage/Hotelling T2 Table 1 gives an overview of formulas and ingoing assumptions for each method. For more detail, see a recent review [2]. We do not consider probability density approach because parametric methods based on normal distribution yield same results as 3,4,5 approaches and there is not enough data in the training set to use nonparametric probability density method [7]. We explore and compare results obtained for the raw data with scaled, centered and rotated to Principal Components data as advocated in [8,9]. 4 Table 1. Formulas and assumptions for different interpolation methods. Applicability domain criteria The compounds are labelled out of the domain, if 1. At least one fragment count and/or correction factors is out of range for the ranges approach; 2. The distance between the chemical, and the center of the training data set, exceeds a threshold for distance approaches. The threshold for all kinds of distances and Hotelling T2 is the largest distance of a training set data point to the center of the training data set (i.e. the distance to the most distant point); Though the criteria may appear quite different, they have the same end result to estimate the smallest space encompassing the whole training set. Results of SRC KOWWIN case study KOWWIN descriptors determination The description of the AFC method [5] provides only a partial list of fragments and correction factors. Fragments are described in a textual form and explicit structure is not given in many cases. Several listed fragments have ambiguous descriptions what makes it difficult to directly reuse the AFC method. Furthermore, the list of fragments and correction factors slightly differs in subsequent versions of SRC KOWWIN software, because more compounds and fragments are added. Therefore we decided to use the most reliable source for the KOWWIN descriptor space - the full text output of SRC KOWWIN v1.66 software (Figure 1). This is a 5 text file listing all fragments and factors, their frequencies and weights applicable to each compound in the training set. Figure 1 SRC KOWWIN text output for a compound. A software tool was developed to parse the text output and produce a table, where columns are all possible fragments/correction factors and rows are compounds. Each cell in the table denotes how many times a fragment occurs in a compound. The descriptor space of KOWWIN model was obtained by running all 2434 compounds from the training set through the software. This revealed 186 different fragments and 322 different correction factors, resulting in a 508-dimensional descriptor space (Table 2). The log Kow values vary between 4.57 and 8.19. All 10910 compounds from validation set were also run through the software. The validation set makes use of 172 (out of 186) fragments and 316 (out of 322) correction factors (Table 2). The log Kow values in the validation set vary between -4.99 and 11.71. The quality of the very high log Kow values (ca above 8) may need to be reviewed but it is outside the scope of this paper. Table 2. Fragments’ list for the KOWWIN’s training and validation sets. The full list of fragments and correction factors used in the KOWWIN model, as well as the ranges for each fragment and correction factor is not presented but it is available from the authors. 6 The descriptors were evaluated for uniform distribution by Kolmogorow-Smirnov test in MATLAB with default rejection level of 5% and again all of them failed. Distributions of individual descriptors were also evaluated for normality by Jarque-Bera test in MATLAB with the default rejection level of 5%. The Jarque-Bera evaluates the hypothesis that X has a normal distribution with unspecified mean and variance, against the alternative that X does not have a normal distribution. According to this test none of the descriptors is normally distributed. This is a hint that the ranges and distance-based approaches may not reflect well the data distribution and therefore the determination of interpolation regions needs a more sophisticated technique like the nonparametric probability density estimation However, we lack sufficient amount of data to use this method. Finally, we have scaled, centered the data and rotated the axis to PC orthogonal axis. This step is important for KOWWIN because in the KOWWIN model the descriptors are highly correlated. The results of PCA on the original data reveal that the first 16 principal components (PCs) explain 90% of the variance and the first 36 PCs explain 95% of the variance. The PCA on the scaled, centered data showed more balance: the first 197 PCs explain 90% of the variance; the first 282 PCs explain 95% of the variance. Comparison between KOWWIN training and validation set predictions In order to assess the quality of applicability domain assessment we compared the observed vs. predicted results for the chemicals in the validation set. Validation set only partially overlaps the training set, thus it splits into in and out of the domain. Figure 2 shows different projections of both sets. Statistics for experimental and estimated Log Kow values, absolute and relative prediction errors are shown in Table 3. 7 Figure 2 Projections of training set (¼) and validation set ( ) coverage. a) web plot of 7 of the individual descriptors (b) fragment C and fragment F, (c) fragment –O- and fragment CH2. Table 3 Selected KOWWIN validation set compounds out of the training set ranges and corresponding experimental and calculated Log Kow values. Comparison of different methods to approximate training set coverage by interpolation Table 4 provides a summary of the statistics for different applicability domain estimation methods applied to the validation set: number of compounds in and out of domain and root mean squared error. The RMSE for the training set is 0.22. Table 4. Summary of the statistics for different applicability domain estimation methods applied to the validation set. Developers of the KOWWIN model did not perform both scaling and PCA preprocessing steps. This may affect quality and stability of the model. To compensate for it and to obtain correct applicability domain such a preatreatment is necessary. Pretreatment of data for applicability domain assessment and lack of it during model development phase complicate interpretability of the domain. Ideally, pretreatment should be performed during model development to allow the domain be assessed in the model space. Pretreatment, specifically PC rotation, had an effect on ranges approach and little effect on distance based approach [Table 4]. The lack of big difference in results for distance based approach in case of KOWWIN data set is due to scale factors being almost the same along all 8 dimensions and that the considered distance approaches assume normal distribution of data that is symmetric. Because principal component rotation is not invariant to scaling, i.e., the principal components extracted from the original data are not the same with the principal components extracted from the scaled data [9] we carried out the scaling step before PC rotation. The numbers of validation compounds in the domain for methods examined are similar except for ranges approach after PC rotation of axis (Table 4). All the approaches results in lower RMSE for the validation compounds in the domain (0.43 to 0.6) than for the compounds out of the domain (0.57 to 1.10). Ranges approach after PC rotation of axis had the the lowest RMSE of 0.57 for the in domain chemicals. The Figures 3 and 4 illustrate the correspondence between domain assessment and prediction error for examined approaches. The left plots show calculated vs. experimental values scatter plots. The right plots show the distance or range of each point plotted against the residual for that compound (prediction error). In case of ranges, the number at the abscissa means the number of dimensions where the point is out of training set range i.e. zero means in-range). No clear correlation exists between distances and prediction error, but Table 4 shows that on average, validation compounds outside of the training set coverage have much larger prediction errors then those compounds inside the training set. For example using ranges approach for compounds inside the training set the relative prediction error spans from 0.65% to 33%. For compounds outside the coverage of the training set the relative prediction error spans from 8% to 600%. 9 Figure 3. The correspondence between domain assessment and prediction error for ranges approach. Figure 4. The correspondence between domain assessment and prediction error for Euclidean distance approach. Discussion In this paper, we examined definition of the applicability domain as training data set coverage in the multivariate space of the model parameters and assess it with several interpolation methods. We conclude that for high dimension models the range is the simplest practical approach; however it is a compromise because data distribution in the training set does not meet the assumption of uniformity. It means that a lot of empty space not covered by the training set is deemed as domain of the model. Ranges approach is a refinement of applicability domain assessment compared to verification if a given fragment exists in the training set currently implemented in KOWWIN. The PC rotation was necessary because fragments are highly correlated. The training space as defined by fragment and correction factor ranges consists of 5.44E+41 unique points. Out of this enormous space, the training set uses only 2113 unique points (some of the 2434 points coincide). This means that only 3.88E-37 % of the training space is covered by the training set points! Good practical experience with the KOWWIN model means that additivity and transferability of fragments is working reasonably well within the training set space. The AFC method has problems with additivity of fragments for rigid aromatic molecules and for compounds where the same fragment occurs many times in a molecule such as in a long aliphatic chain. The method also fails for molecules with 10 “uncommon” functional groups - transferability of these fragments is difficult to establish due to large uncertainties in their estimated contributions. Fragment ranges provide a rough estimation of additivity boundaries. More precise assessments of applicability domain for high dimension models require development of approaches for which high dimensionality is not a limiting factor. One possible approach is to define where model assumptions are valid. Let us examine the possibility to verify additivity and transferability assumptions of a group contribution method. Additivity [10, 11] implies that each of the structural components of a compound makes a separate and additive contribution to the property of interest for the compound. Transferability assumes that these contributions are the same across a wide variety of compounds [10]. Additivity is widely agreed hypothesis, with evidence provided from both empirical studies [11] and contemporary quantum theories [12]. While quantum mechanics predicts the properties of the open systems to be additive, this “additivity” could be experimentally observed only when the contribution of the atom or fragment is also transferable without apparent change from one compound to another. Defining additivity and transferability boundaries has been so far difficult to formalize. It is, in part, because until recently fragments had been determined empirically as was done in the KOWWIN. The advances in understanding of additivity and especially transferability of fragmental contributions may lead the way to redefine fragments based on theoretical considerations which are far easier to verify [11, 12, 13]. 11 Even if progress to estimate the domain by better characterizing the training set coverage and verifying model’s assumptions is achieved the assessment will still provide a warning and not ultimate reason to rejection or acceptance of prediction. The representation of chemical compounds by their properties may not be always unique (i.e. two different compounds may have the same representation by the subset of selected properties) and that non unique representation provides a potential risk of obtaining correct result for one compound and wrong for another. The lack of uniqueness could be avoided only if the set of descriptors used contains all the information about chemical compound, but this is practically impossible. Thus models using a small number of descriptors are especially prone to this while models using large number of parameters, like AFC, are less prone because chances of missing a parameter relevant to explain activity are smaller. Conclusions A key component of the QSAR quality prediction evaluation is to define if the prediction comes from the applicability domain. The training data set coverage provides basis for the estimation of the model’s applicability domain. For high dimension models choice of the estimation method is not trivial. One has to find a compromise between finding a method suitable to a given distribution data distribution and suitable to a number of data points available. We recommend simplest approach of ranges as a practical acceptable compromise for group contribution models. At the same time, we recognize the need to carry more research towards development of methods for which dimensionality is not a limiting factor. One possible approach is to develop theoretical understanding of two key assumptions of group contribution method: additivity and transferability that can be used of verification of applicability domain boundaries. 12 Acknowledgements: The training and validation sets of KOWWIN models were kindly provided by Syracuse Research Corp. (P. Howard). Nina Nikolova work was funded by Procter & Gamble postdoctoral fellowship. We also acknowledge partial funding by ECVAM project CCR.496575-Z. References [1] Jaworska J., Comber M., Van Leeuwen C., Auer C. (2003) Summary of the workshop on regulatory acceptance of QSARs. Environmental Health Perpectives 111(10), 1358-1360 [2] Jaworska. J, Nikolova-Jeliazkova N., Aldenberg T, (2005) Review of methods for QSAR applicability domain estimation by the training set. ATLA [3] Nikolova N., J. Jaworska, (2003) Approaches to measure chemical similarity – a review QSAR & Combinatorial. Science., 22 , 1006-1026. [4] Bender A. Glen R.C., (2004) Molecular similarity: a key technique in molecular informatics Journal of Organic and Biomolecular Chemistry , 2, 3204 – 3218 [5] Meylan W.M., Howard P.H., (1995) Atom/fragment contribution method for estimating octanol-water partition coefficients, Journal of Pharmacological. Sciences 84, 8392. [6] Meylan W.M., Howard, P.H., Boethling R.S. (1996) Improved Method for Estimating Bioconcentration / Bioaccumulation Factor from Octanol/Water Partition Coefficient, Enviromental Toxicoogy and Chemistry. 18(4), 664-672. [7] Silverman, B.W., (1986) Density Estimation for Statistics and Data Analysis, Chapman and Hall, Monographs on Statistics and Applied Probability 26, London. 9, p.170 13 [8] Eriksson L., Jaworska J., Worth A., Cronin M.T.D., McDowell R. M., & Gramatica, P. (2003). Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of Classification- and Regression- Based QSARs, Environmental Health Perspectives, 111(10), 1351 – 1375. [9] Seber, G.A.F., Multivariate Observations, Wiley and Sons Inc, New York, 1984. pp 671 [10] McNaught A.D. & Wilkinson A., eds., (1997) Compendium of Chemical Terminology., Blackwell Science, London, pp103 [11] Benson S. W., Cruickshank F. R., Golden D. M., Haugen G. R., O'Neal, H. E., Rodgers A. S., Shaw R., R. Walsh, (1969) Additivity rules for the estimation of thermo chemical properties. Chemical Reviews, 69, 279-324. [12] Bader R., D. Bayles, (2000) Properties of Atoms in Molecules: Group Additivity, Journal of Physical Chemistry A, 104(23), 5579-5589. [13] Curutchet C., Salichs A., Barril X, Orozco M., Luque FJ (2003) Transferability of Fragmental Contributions to the octanol/water partition coefficient: an NDDO based MST study Journal of Computational Chemistry , 24, 32-45 14 Table 1. Formulas and assumptions for different interpolation methods. Assumptions on data Method Formula distribution Ranges d ( x, y ) = x − y uniform Euclidean distance DE ( x, µ ) = ( x − µ )T ( x − µ ) Normal, equal variances, uncorrelated variables City block uniform n d ( x, y ) = ∑ x i − y i i =1 Mahalanobis distance (leverage and Hotteling T2 are proportional to M.D.) d ( x, y ) = ( xi − y i )Σ −1 (xi − y i ) Normal, arbitrary variances, where Σ-1 is the inverse of the arbitrary correlation ' covariance matrix 15 Table 2. Fragment list for the KOWWIN training and validation sets. 1 Fragment KOWWIN Training set Validation set Frequency2 MIN MAX Frequency Min Max Aromatic Carbon 1786 (73%) 2 24 8725 (80%) 1 30 CH3[aliphatic carbon] 1388 (57%) 1 13 7353 (67%) 1 20 CH2[aliphatic carbon] 1076 (44%) 1 18 7016 (64%) 1 28 CH[aliphatic carbon] 457 (18%) 1 16 3839 (35%) 1 23 C[aliphatic carbon-No H not tert] 229 (9%) 1 3 1343 (12%) 1 11 O[oxygen aliphatic attach] 108 (4%) 1 5 1231 (11%) 1 12 F[fluorine aliphatic attach] 103 (4%) 1 6 540 (5%) 1 23 Cl[chlorine aliphatic attach] 100 (4%) 1 6 354 (3%) 1 12 Si-[silicon aromatic or oxygen attach] 15 (0.6%) 1 4 14 (0.1%) 1 9 1 Full list available from the authors 2 Absolute (relative) 16 Table 3. Selected KOWWIN validation set compounds out of the training set ranges and corresponding experimental and calculated Log KOW values *. 678262 Pentane,dodecafluoro 5.05 4.4 0.65 15 355680 Perfluorocyclohexane 3.33 2.91 0.42 14 47071114 4,6-NH2 2,2-DiMe1(4-CF3)Ph s-triazene 1.28 1.22 0.06 4.92 80616597 Butanamide,N(5amino1H1,2,4triazol3yl)2, 1.53 1.54 -0.01 0.65 tafluo 77963509 B30C10 Benzocrownether -0.15 0.03 -0.18 600 104946625 B33C11 Benzocrownether -0.43 -0.09 -0.34 378 63144763 B27C9 Benzocrownether 0.12 0.23 -0.11 48 88116590 Iohexolderivative 1.94 -2.80 4.74 169 2915 3AZAGLUTARAMIDEANALOGA37 4 3.6 0.4 11 93414552 Benzoic acid, 3,4,5-trimethoxy-, 2-[4-[[(2- 3.26 2.94 0.32 11 0.52 0.04 8 oxoethoxy)imino]methyl]-2-methoxypheno 121284206 [8,8]DB48C16Dibenzocrownether 0.56 * Full list available from the authors; CAS numbers and names are taken from KOWWIN output 17 error % 33 Relative -2.31 error 7.1 Absolute PERFLUOROPMETHYLCYCLOHEXYL 4.79 error 3298 Experimental SRC KOWWIN value NAME Estimated CAS Table 4. Summary of the statistics for different applicability domain estimation methods applied to validation set. Nr Data sets: PC Validation (in) Validation out space Domain defined by: No RMSE No compounds RMSE compounds 1 Ranges 10247 0.46 597 0.74 3 Euclidean distance 10796 0.47 48 0.94 5 City block distance 10797 0.47 47 0.96 7 Hotelling T2 10685 0.59 160 0.73 11 Ranges (scaled data) yes 7460 0.43 3384 0.57 13 Euclidean distance - (Mahalanobis) yes 10187 0.47 27 1.10 distance (scaled) 15 Hotelling T2/leverage ( scaled) yes 10749 0.60 96 0.67 17 City block distance (scaled data) yes 10708 0.46 136 0.97 18 Figure 1. SRC KOWWIN output for a compound. Figure 2 Projections of training set (¼) and validation set ( ) coverage. a) web plot of 7 of the individual descriptors (b) fragment C and fragment F, (c) fragment –O- and fragment CH2. Figure 3. The correspondence between domain assessment and prediction error for ranges approach. Figure 4. The correspondence between domain assessment and prediction error for Euclidean distance approach. 19 Figure 1. SRC KOWWIN output for a compound SMILES : Oc(c(cc(c1)Cc(cc(c(O)c2C(C)(C)C)C(C)(C)C)c2)C(C)(C)C)c1C(C)(C)C CHEM : Phenol, 4,4'-methylenebis 2,6-bis(1,1-dimethylethyl)- MOL FOR: C29 H44 O2 MOL WT : 424.67 -------+-----+--------------------------------------------+---------+-------TYPE | NUM | LOGKOW FRAGMENT DESCRIPTION | COEFF | VALUE -------+-----+--------------------------------------------+---------+-------Frag | 12 | -CH3 [aliphatic carbon] | 0.5473 | 6.5676 Frag | 1 | -CH2- [aliphatic carbon] | 0.4911 | 0.4911 Frag | 12 | Aromatic Carbon | 0.2940 | 3.5280 Frag | 2 | -OH |-0.4802 | -0.9604 Frag | 4 | -tert Carbon | 0.2676 | Factor| 1 | -CH2- (aliphatic), 2 phenyl attach correc |-0.2326 | -0.2326 Factor| 2 | Ring rx: -OH / di-ortho;sec- or t- carbon |-0.8500 | -1.7000 | Equation Constant | Const | [hydroxy, aromatic attach] [3 or more carbon attach] | 1.0704 0.2290 -------+-----+--------------------------------------------+---------+-------Log Kow 20 = 8.9931 Figure 2 Projections of training set (¼) and validation set ( ) coverage. a) web plot of 7 of the individual descriptors (b) fragment C and fragment F, (c) fragment –O- and fragment CH2. a) b) c) 21 Figure. 3. The correspondence between domain assessment and prediction error for ranges in descriptor space and PC rotated descriptor space approaches. y chemicals in the domain; U– chemicals out of the domain. (a) Results with ranges in descriptor space (b) Results with ranges in PC space 22 Figure 4 The correspondence between domain assessment and prediction error for Euclidean distance in descriptor space and PC rotated descriptor space approaches; y chemicals in the domain; U– chemicals out of the domain. b) results after PC rotation are not shown, results are very similar to a) 23