Science of the Total Environment 409 (2010) 412–422 Contents lists available at ScienceDirect Science of the Total Environment j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / s c i t o t e n v Multimedia environmental chemical partitioning from molecular information Izacar Martínez a, Jordi Grifoll a, Francesc Giralt a,⁎, Robert Rallo b a b Departament d'Enginyeria Quimica, Universitat Rovira i Virgili, Av. Paisos Catalans, 26, 43007 Tarragona, Catalunya, Spain Departament d'Enginyeria Informatica i Matematiques, Universitat Rovira i Virgili, Av. Paisos Catalans, 26, 43007 Tarragona, Catalunya, Spain a r t i c l e i n f o Article history: Received 11 July 2010 Received in revised form 1 October 2010 Accepted 11 October 2010 Available online 6 November 2010 Keywords: Multimedia environmental model Uncertainty analysis Quantitative structure–fate relationships QSAR molecular descriptors Support vector regression Domain of applicability a b s t r a c t The prospect of assessing the environmental distribution of chemicals directly from their molecular information was analyzed. Multimedia chemical partitioning of 455 chemicals, expressed in dimensionless compartmental mass ratios, was predicted by SimpleBox 3, a Level III Fugacity model, together with the propagation of reported uncertainty for key physicochemical and transport properties, and degradation rates. Chemicals, some registered in priority lists, were selected according to the availability of experimental property data to minimize the influence of predicted information in model development. Chemicals were emitted in air or water in a fixed geographical scenario representing the Netherlands and characterized by five compartments (air, water, sediments, soil and vegetation). Quantitative structure–fate relationship (QSFR) models to predict mass ratios in different compartments were developed with support vector regression algorithms. A set of molecular descriptors, including the molecular weight and 38 counts of molecular constituents were adopted to characterize the chemical space. Out of the 455 chemicals, 375 were used for training and testing the QSFR models, while 80 were excluded from model development and were used as an external validation set. Training and test chemicals were selected and the domain of applicability (DOA) of the QSFRs established by means of self-organizing maps according to structural similarity. Best results were obtained with QSFR models developed for chemicals belonging to either the class [C] and [C; O], or the class with at least one heteroatom different than oxygen in the structure. These two class-specific models, with respectively 146 and 229 chemicals, showed a predictive squared coefficient of q2 ≥ 0.90 both for air and water, which respectively dropped to q2 ≈ 0.70 and 0.40 for outlying chemicals. Prediction errors were of the same order of magnitude as the deviations associated to the uncertainty of the physicochemical and transport properties, and degradation rates. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Multimedia environmental models (MEMs) are routinely used to estimate the environmental distribution of chemical pollutants based on their physicochemical and transport properties, degradation rates, site-specific geographical parameters and source emission rates (Cohen, 1986; Mackay, 2001; Cohen and Cooter, 2002a, b; den Hollander et al., 2004). MEMs serve to screen chemicals with respect to their persistence in the environment and to provide information needed to estimate the exposures and associated risks to human and ecological receptors (Efroymson and Murphy, 2001; Cohen and Cooter, 2002a; Breivik et al., 2004, 2006; Lohmann et al., 2007; and references therein). The reliability of predictions of chemical partitioning from MEMs is affected by model formulation (i.e., system definition, included environmental processes, calculation methods, etc.) and the uncertainties introduced via model parameters (Webster et al., 2004), including estimates of physicochemical parameters ⁎ Corresponding author. Tel.: + 34 977559638; fax: + 34 977559621. E-mail address: fgiralt@urv.cat (F. Giralt). 0048-9697/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.scitotenv.2010.10.016 (Cohen and Cooter, 2002a,b; Breivik and Wania, 2003). In particular, uncertainty in partitioning and degradation parameters can significantly affect predictions of chemical distribution in the environment (Kühne et al., 1997; Eisenberg et al., 1998; Kawamoto et al., 2001; Citra, 2004; Toose et al., 2004). The lack of adequate physicochemical and toxicological information for most commercial chemicals and the risk that they may represent for human health and the environment have motivated the development of new regulatory efforts (Tickner et al., 2005) such as REACH in the European Union and the Inventory Update Rule (USEPA, 2006) in the United States. These rules aim to collect information about the characteristics, emission rates and existing volumes of commercial chemicals for facilitating their screening and deciding whether to authorize or ban their production. Compiling all mandatory data will be a formidable task given the large number of chemicals that may be of concern. For example, in September 2009, the CAS registry, one of the largest substance registry databases, reported its 50-millionth unique chemical (Toussant, 2009). It is accepted that the regulatory assessments of the multimedia distribution of chemicals for which physicochemical properties and degradation data are lacking will require the use of estimation methods that I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 rely on quantitative structure–activity relationships (QSARs) (Fjodorova et al., 2008; Worth et al., 2007). In general, partitioning data (Boethling et al., 2004; Mackay, 2001) are more readily available (from experiments or estimations) relative to degradation data (Aronson et al., 2006; Howard et al., 1991; Klöpffer and Wagner, 2007; Raymond et al., 2001). QSARs are accepted worldwide in standard environmental assessments and decision-making tasks (Walker et al., 2002; Cronin et al., 2003). They are based on establishing quantitative relations between the target physicochemical (Hugo, 2002), or toxicological properties (Devillers, 2003; Mackay et al., 2003; Mackay and Webster, 2003) of chemicals and their molecular information. Often, even small structural differences between different molecules can lead to significant differences in chemical activity (Nikolova and Jaworska, 2003). Therefore, QSARs must be developed with careful considerations of data quality and diversity (Furusjö et al., 2006), and accurate discrimination of chemical descriptors (Cronin and Schultz, 2003; Stouch et al., 2003). QSARs are appropriate to use when the chemical of concern has a molecular structure (or chemical descriptors) similar to that of the chemicals used in the QSAR development (Taskinen and Yliruusi, 2003). Selecting appropriate chemical descriptors is crucial for the development of accurate QSARs as demonstrated, for example, for vapor pressure (Yaffe and Cohen, 2001; Godavarthy et al., 2006), water solubility (Yaffe et al., 2001), Henry's law constant (Yaffe et al., 2003; Modarresi et al., 2007) and octanol–water partition coefficient (Yaffe et al., 2002). QSAR development must consider the selection of model input features (Saeys et al., 2007), often from a large number of descriptors (Todeschini and Consonni, 2000; Duca and Hopfinger, 2001; Senese et al., 2004; Bredow and Jug, 2005; Burden et al., 2009), the selection and tuning of learning algorithms for building relationships (Basheer and Hajmeer, 2000; Xu et al., 2006), the risk of overtraining (Byvatov et al., 2003), the external validation of the models (Golbraikh and Tropsha, 2002; OECD, 2007; Schüürmann et al., 2008) and the definition of applicability domains (Weaver and Gleeson, 2008; Kühne et al., 2009). There are essentially two possible approaches to estimate the set of chemical properties required for modeling the environmental multimedia distribution of chemicals (Fig. 1). The first is to estimate the properties of each required chemical parameter from independent quantitative structure–property relationship (QSPR) and quantitative structure–biodegradation relationship (QSBR) models. Indeed, QSPRs for various chemical families have been proposed in the literature for an array of different physiochemical properties (Diudea, 2001). In this case the uncertainty associated with each QSARs is introduced into 413 the MEM analysis with such uncertainties scaled with respect to other MEM parameters (e.g., areas and volumes of the environmental compartments, as well as advection rates), thereby affecting the estimated mass distribution and concentrations of the chemicals in the various media. The second is to consider a single QSPR/QSBR for the collective chemical properties whereby, given a set of chemical descriptors, the various environmentally relevant physicochemical properties and reaction rate parameters are predicted by the single QSPR/QSBR model. When a specific regulatory multimedia model is used with specified emission scenario and geographical and meteorological settings, it may be beneficial to integrate the QSPR/QSBR approach with the multimedia model. In such an approach, a quantitative structure–fate relationship (QSFR) model is derived relating quantitative chemical descriptors information directly with MEM predictions of chemical distribution in the environment. Preliminary proposals in the above direction have considered the implementation of QSPRs in standard MEMs (Breivik and Wania, 2003; Zukowska et al., 2006) or the establishment of structure–fate relationships by partial orders (Brüggemann et al., 2006). In the current study, we propose training of machine-learning algorithms (Witten and Frank, 2005) to map directly the output of MEMs (in terms of media mass distribution) to relevant chemical descriptors. The resulting correlation model, which is referred to herein as a quantitative structure–fate relationship (QSFR), has the advantage of providing direct information on the environmental distribution of chemicals using a consistent set of chemical descriptors with respect to chemically relevant multimedia model properties. The present paper reports on the prospect of assessing the environmental fate of chemicals directly from their molecular information using QSFRs trained based on the MEM model output as an alternative to using a MEM with chemical properties and biodegradation rates estimated using QSPRs and QSBR (Fig. 1) models. The proposed approach is self-consistent because the same molecular descriptors were used for training the QSFR and their relevance (and as outcome also the uncertainty) was weighted with respect to geographical and meteorological parameters of the various media. QSFR models were developed with support vector regression (SVR) algorithms (Drucker et al., 1997) and trained with MEMs' chemical mass distribution outputs as an alternative to using a collection of QSPRs to estimate physicochemical and reaction rate parameters as MEM inputs (Fig. 1). The approach was evaluated with a data set of 455 chemicals using SimpleBox 3 (SB3) multimedia model (van de Meent, 1993; Brandes et al., 1996; den Hollander and van de Meent, 2004; den Hollander et al., 2004) for the specific geographical setting of the Netherlands, with air, water, sediments, soil, and vegetation compartments, and two air and water-emission scenarios. Chemicals were characterized by the molecular weight and 38 counts of molecular constituents. The performance of the QSFR models developed was compared to the conventional approach of applying MEMs with experimental and estimated input parameters by taking into consideration, for a statistically significant sample of the selected chemicals and environmental conditions, the propagation of the uncertainty associated to the range of errors reported in the literature (Boethling et al., 2004; Kühne et al., 2007) for the MEM input parameters. 2. Scenario for chemical multimedia distributions 2.1. Multimedia analysis of chemical distribution Fig. 1. Two approaches for assessing environmental chemical partitioning from molecular information when experimental input information is incomplete or lacking. Multimedia environmental simulations were carried out using the Level III (steady state with mass transfer limitations) SB3 fugacity model (van de Meent, 1993; Brandes et al., 1996; den Hollander and van de Meent, 2004; den Hollander et al., 2004) to assess the multimedia distribution of 455 chemicals (Martínez, 2010; see also the provided Supplementary information) in the Netherlands as a model environment (Struijs and Peijnenburg, 2002). Five 414 I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 homogeneous compartments, including air, water (including fresh and sea water), sediments (including fresh water sediments and sea water sediments), soil (including natural, agricultural and other soil) and vegetation (including natural and agricultural vegetation) were considered at the regional scale of this MEM. A total of 375 chemicals were selected for training and testing QSFR models according to chemical similarity in the descriptor space by means of self-organizing maps (SOM), while 80 chemicals were excluded from the modeling process and set aside for model validation. It is noted that for steadystate SB3 analysis (Level III), the variation of mass partitioning among the different chemicals is governed only by their physicochemical, transport and degradation constants since the simulations were carried out for fixed geographical and meteorological conditions. Model calculations were carried out for each of the selected chemicals given their physicochemical and degradation rate parameters (Section 2.2). The chemical concentration in the inflows (i.e., air and water) into the air and water compartments was assumed to be zero. The steady-state compartmental chemical mass concentrations calculated from the SB3 model were expressed as the dimensionless mass ratio of the chemical mass in each compartment according to: wn;g = Cn;g Vg ṁt T ð1Þ aerosol partition coefficients following the approach of Junge (1977a,b). The parameters MW, Tm, Pv, and Kow, were obtained from the PHYSPROP database (SRC, 2008), while the parameters Kaw, Ksw, and kair, were estimated from data in this database. The mass diffusivities Dair and Dwater were estimated internally by SB3 considering that they vary inversely with the square root of the MW and using as reference values the diffusion coefficient of water in air and the diffusion coefficient of oxygen in water (Schwarzenbach et al., 2003). The chemical atmospheric degradation rate constant was estimated as kair = kOH• COH• (where the rate of chemical degradation is given as rair = kOH•COH•Cn,air), in which kOH• (m3/g•s) is the second-order reaction constant (SRC, 2008), COH• (g/m3) is the concentration of hydroxyl radicals in air and Cn,air(g/m3) is the chemical concentration. In the present analysis, the global average hydroxyl radical concentration of 2.66× 10−11 g/m3 (Prinn et al., 2001) was assumed for simplicity. The degradation rate constant in water, kwater, was estimated from results of MITI-I biodegradability tests (NITE, 2006). The MITI-I tests are expressed as the degradation fraction of chemical samples over time periods ranging from 2 to 4 weeks, with sample mass determined by direct methods (using total organic carbon, high performance liquid chromatography and gas chromatography) and indirect methods (measuring biological oxygen demand). In the current work, kwater values were estimated as follows: −1 ln 1−fdeg ð2Þ kwater = t where Cn,g (kg/m3) is the steady-state concentration of chemical n in compartment g of volume Vg (m3), ṁt is the chemical emission rate (kg/s) and T is a time reference period (s). In this study this time reference period has been taken as one year. It should be noted that the addition of these mass ratios for the different compartments is usually less than one because, under steady state, the amount of chemical that remains in the overall system is not necessarily the amount emitted over the time reference period. Compartment concentrations are proportional to pollutant emission rates under the assumptions of linear equilibrium relationships, linear mass transfer coefficients limitations, steady compartment outflows and first-order degradation kinetics. As a consequence, mass ratios defined in Eq. (1) are emission-rate independent. Mackay and Webster (2006) defined persistence as the proportionality constant relating mass in the environment to the emission rate. According to this definition, the mass ratios given in Eq. (1) coincide with the persistence in years of the chemical in each compartment. The addition of the mass ratios for all the compartments defined in the system provides the total persistence (in years) of the chemical in the overall system. where t (seconds) is the range period of a test and fdeg is the degradation fraction determined by the biological oxygen demand (BOD) methodology. Only compounds for which the BOD and total organic carbon methods yielded degradation fraction fdeg that deviated by no more than 10% were included in the chemical data set. For modeling consistency, all fdeg values that were experimentally reported to be higher than 1 or lower than 0, (due to error measurements in the MITI-I tests), have been set to be equal to 0.99 (extremely fast degradability) or 0.01 (extremely low degradability), respectively. As noted in the literature (Boethling et al., 1995), for screening purposes the degradation half lives in water can be taken as similar to those in soil, which in turn tend to degrade 3 to 4 times faster than an anaerobic flooded soil. Accordingly, in the present analysis ksoil values were estimated to be equal to kwater values, while ksed values were taken to be ksoil/3.5. This approach of taking the kinetic parameters for degradation in soil and sediment as a fraction of the value for water is common for screening purposes (e.g., Fenner et al., 2005; U.S. EPA PBT Profiler, 2001). 2.2. Physicochemical properties Molecular information consisting of 39 molecular parameters was compiled for each of the 455 chemicals considered. The set of 39 molecular parameters included molecular weight, 10 atom counts (all atoms, bromine, carbon, chlorine, fluorine, hydrogen, nitrogen, oxygen, phosphorus, and sulfur), 4 bond counts (all bonds, single bonds, double bonds and triple bonds), 16 group counts (aldehyde, amide, amine, sec-amine, carbonyl, carboxyl, cyano, ether, hydroxyl, methyl, methylene, nitro, nitroso, sulfide, sulfone, and thiol), 8 ring counts (all rings, aromatic rings, small rings, 5-membered rings, 5membered aromatic rings, 6-membered rings, 6-membered aromatic rings and 7-to-12-membered rings). The SB3 model requires a total of 6 physicochemical, 2 transport and 4 degradation parameters for Level III type simulations. The physicochemical parameters include molecular weight (MW, g/mol), melting point (Tm, K), vapor pressure (Pv, Pa), octanol–water partition coefficient (Kow, dimensionless), air–water partition coefficient (Kaw, dimensionless), and the solid–water partition coefficient (Ksw, dimensionless). The chemical degradation rate parameters in air (kair, 1/s), water (kwater, 1/s), sediment (ksed, 1/s), and soil (ksoil, 1/s) media were all for first-order kinetics, and the fundamental transport coefficients were the mass diffusivity of the chemical in air (Dair, m2/s) and water (Dwater, m2/s). Kaw values (Mackay, 2001) were calculated as H/RT (H is the Henry's law constant, R is the ideal gas constant and T is the absolute temperature). Ksw was estimated internally by SB3 from Kow (European Commission, 2003) assuming an average organic carbon content of 2% for the sediment solids and solid soil density of 2.5 kg/L. It is noted that for particle-bound chemicals, SB3 uses Tm and Pv to estimate the air– 2.3. Molecular information 3. Methods 3.1. Uncertainty assessment of the MEM To simulate the impact that uncertainties of physicochemical properties and degradation rates estimated from QSPRs or QSBRs have on the outputs of MEMs (thus affecting any assessment of the I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 environment distribution of chemicals), a series of SB3 model simulations was carried out for all 455 chemicals applying 1000 random combinations (Monte Carlo simulations) of the following independent chemical properties: Tm, Pv, H, Kow, kair and kwater. Since this set of independent properties was used to derive a set of dependent properties (Kaw, Ksw, ksed, and ksoil), as described in Section 2.2, variations in the independent set generated variations in the dependent set as well. Uncertainty in MW was not considered and, thus, uncertainty in its dependent properties (Dair and Dwater) was also neglected. The uncertainty sources, in terms of statistical distributions, assigned to the varying independent properties are listed in Table 1. For Tm, Pv, H, and Kow standard deviations were taken from statistics of widely recommended QSPRs (Boethling et al., 2004), considering results for external validation chemicals wherever possible. For kair and kwater the statistical distributions were taken from QSBRs (Kühne et al., 2007). It was assumed that the mean value of every distribution coincided with the property value compiled as described in Section 2.2. Finally, it was assumed that a variable followed a normal distribution if the standard deviation given by Boethling et al. (2004) was in unit variables. A lognormal distribution was considered when the standard deviation was reported in logarithmic units. Although the standard deviation of Pv is given in terms of mm Hg, a lognormal distribution was used to avoid negative values in chemicals with very low Pv. The outputs (dimensionless mass ratios) of the SB3 model from the 1000 random combinations for each chemical, schematized in Fig. 2 for endrin, were used to generate a database. This database provided an estimation of the output distribution that one can expect when using recommended QSPRs and QSBRs to estimate the environmental distribution of chemicals. This database was used as a reference to evaluate the direct QSFR approach depicted in Fig. 1. 3.2. QSFR model development QSFRs were developed to estimate directly from the chemical's molecular descriptors [d1,...,dL] the chemical mass ratios wg predicted by SB3 for each environmental compartment of the reference pollution scenario. It is expected that these QSFR models will perform better than or at least similarly to the SB3 model when fed with properties estimated from several QSPRs and QSBRs. 3.2.1. Fundamentals Given N chemicals (characterized by K properties) emitted in a geographic region described by G compartments, a reference MEM can be considered to be a multivariate function of the form, C = f ðP;E;SÞ ð3Þ 415 where C is a matrix of mass ratio predictions of size N × G, P is a matrix of physicochemical properties of size N × K, E is a matrix of emission rates of size N × G and S is a matrix of site-specific parameters. When E and S remain constant, the chemical distribution in the environment can be solely analyzed in terms of P, the collection of physicochemical properties and degradation rates of the chemicals to assess. When key physicochemical properties and degradation rates are unavailable for chemicals of concern (P is unknown), alternative multimedia environmental models can be developed from L molecular descriptors in a matrix D (of size N × L) by means of QSFRs of the form, C≈fQSFR ðDÞ: ð4Þ To develop the QSFR model given by Eq. (4), a set of Ntr training chemicals (with Ntr b N) is required for which all properties and molecular structures are known. The model is then adjusted to emulate the output of the reference MEM (Eq. (3)), by tuning its internal parameters with respect to a set of Nte test chemicals. Its QSFR performance for new chemicals is later evaluated with a set of Nval validation chemicals. 3.2.2. Data preprocessing All input and output variables with values that span more than two orders of magnitude were logarithmically (base 10) scaled and then normalized in the range [−1,1] according to, yi −ymin −1 N½−1;1 ðyi Þ = 2 ymax −ymin ð5Þ where yi is a value to be normalized and N[−1,1](yi) is its normalized counterpart and ymin and ymax respectively are the minimum and maximum values in the working data set. Since the available molecular information spans less than two orders of magnitude, all molecular descriptors were normalized in the range [−1,1] with no prior logarithmic scaling. 3.2.3. Training, test and validation data sets To build a QSFR model, the original set of 375 working chemicals was split into a training data set and a test data set. About 80% of the working chemicals were dedicated to training every QSFR model, while the remaining 20% of working chemicals were allocated for testing model performance. The data selection scheme, based on the self-organizing map (SOM) algorithm (Kohonen et al., 1996), has to assure that models capture the diversity of chemical structures present in the data set during the training process and that the test set is also well represented in the training set. The SOM is a procedure for Table 1 Uncertainty distribution parameters assigned to independent properties affecting the chemical environmental partitioning. Input Tm Melting point Pv Vapor pressure H Henry's law constant Kow Octanol–water partition coefficient kair Kinetic constant for degradation in air kwater Kinetic constant for degradation in water a Assumed distribution Typical uncertainty distribution of parameters predicted by QSPRs Data set Statistic parametersa,b Statistic parameters units Source Normal Validation SD = 58. K Boethling et al. (2004) Log-normal Validation SD = 96. Pa Boethling et al. (2004) 3 Log-normal Training SD = 0.440 log10(Pa m /mol) Boethling et al. (2004) Log-normal Validation SD = 0.427 log10(–) Boethling et al. (2004) Discrete Training – Kühne et al. (2007) Discrete Training P(0) = 0.48, P(± 1) = 0.37, P(± 2) = 0.13, P(±N2) = 0.02 P(0) = 0.52, P(± 1) = 0.35, P(± 2) = 0.08, P(±N2) = 0.05 – Kühne et al. (2007) The parameters have been reported for QSPRs in standard deviations, SD. The reported parameters for QSBRs are probabilities, P(C), that indicate if a chemical has been classified as member of a degradation class C (0 = correct class, ± 1 = neighbor category predicted, ± 2 = two categories differing and ±N2 = more than two categories differing) in the 9-class scale proposed by Mackay et al. (1992). b 416 I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 Fig. 2. Random realizations of the Monte Carlo approach on SB3 for endrin. mapping and clustering high-dimensional data by fitting an optimal number of units (also called neurons, cells or nodes) to the data. It minimizes distances (e.g., the Euclidean distance) between units and data points (i.e., minimum average quantization error) while preserving the vicinity of units in both the map and the data space (i.e., minimum average topological error). The procedure for selecting the training and test data sets is as follows. First, SOMs of sizes in the range of 10 to 150% of the number of work chemicals (375), were trained to fit these chemicals in the input-target space of the desired QSFR model using the SOM toolbox 5 for Matlab (Vesanto et al., 2000). All SOMs were set to have toroidal shape to avoid border effects. Lattices were hexagonal to minimize both the mean quantization error (qerror ) and the mean topological error (terror ). Chemicals were represented by vectors whose elements included the selected of counts of molecular constituents and the P P target chemical partitioning variable. Note that qerror and terror are defined by (Kohonen, 2001), qerror = 1 Nwk ∑ ‖x −mxi ‖ Nwk n = 1 i ð6Þ 1 Nwk ∑ uðxi Þ Nwk n = 1 ð7Þ and terror = where Nwk is the number of work data vectors, mxi is the best matching unit (BMU) for the data vector xi , and u(xi ) is a function that takes the value of 1 when the BMU and the next BMU of xi are adjacent, and 0 otherwise. Second, for each trained SOM, chemicals were included into the training data set when showing the lowest or highest quantization error with respect to the vector (weights) of their corresponding BMUs. Chemicals with the lowest or highest values in the descriptor and target spaces were also included in the training data set. All remaining working chemicals not following the above rules were allocated to the test set. This procedure was designed to ensure that the test sets belonged to the applicability domain of the corresponding training sets. The number of training chemicals was kept at about 80% (±5%) of the number of working chemicals to ensure good generalization of the QSFR models. This also determines the dimension of the SOM since a higher dimension implies less populated SOM clusters and the selection of more training chemicals and less test chemicals. The 80 validation chemicals were not used in any stage of the development of QSFRs. 3.2.4. Supervised learning algorithms QSFR models were built with support vector regression (SVR) algorithms (Drucker et al., 1997) using RBF kernel functions. The εSVR implementation in the software package RapidMiner 4.4 (Mierswa et al., 2006) was used. For any environmental compartment g, the QSFR can be defined as, N½−1;1 log10 wg = fQSFR N½−1;1 ðd1 Þ;…;N½−1;1 ðdL Þ ð8Þ where the function fQSFR represents the SVR linking normalized molecular descriptors to normalized logarithmic mass ratios. An iterative evaluation of 4000 models was implemented to optimize the parameters of the SVR model (Mierswa et al., 2006) for every compartment and set of input features considered. The SVRbased QSFR model was developed with the training data set and evaluated on the test and validation data sets for every combination of parameters. The optimal SVR model in terms of generalization capabilities was the one with lowest mean absolute error (MAE) selected among the 10 with highest squared correlation (R2) values on the test data set. The MAE and R2 values measure the performance of the SVR-based models by comparing the target (SB3 MEM output tn) and predicted (SVR output pn) normalized logarithmic mass ratios in a given compartment for the N chemicals in the data set (tr = training, te = test or val = validation), Nset ∑ jtn −pn j MAEset = n = 1 ; set = tr; te or val: Nset ð9Þ and !2 Nset ∑ tn −t ðpn −p Þ n=1 2 Rset = Nset ∑ tn −t n=1 2 ! Nset ! ; set = tr; te or val: 2 ∑ ðpn −p Þ n=1 ð10Þ I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 The overbar indicates averages over the Nset chemicals in the data set (set = tr, te or val). The accuracy of an optimized SVR model was estimated by means of both a 10-fold cross-validation (CV) and a leave-one-out (LOO) validation procedure for all 375 working chemicals. Note that the evaluation of the SVRs is based on normalized logarithms of mass ratios. Normalized QSFR predictions (Eq. (8)) for all chemicals were denormalized (Eq. (5)) to yield logarithmic mass ratios. To avoid model overfitting, the predictive performance of all models was assessed in terms of the predictive squared coefficient (q2), as suggested by Schüürmann et al. (2008), q2set Nset 2 ∑ log10 wpredicted −log10 wtarget n;g n;g = 1− ! Nset 1 Nset 2 ; set = tr; te or val: target target − ∑ log10 wn;g ∑ log10 wn;g N n=1 n=1 n=1 ð11Þ This q2 coefficient varies in the range (−∞,1). Models with q2 values closer to 1 have a high predictive performance, while models having q2 values equal or lower than zero perform worst than simply averaging all targets. 4. Results and discussion Mass ratios predicted by the QSFR models emulating the reference scenario are presented and discussed in Section 4.1 for emissions in the water and air compartments. Section 4.2 explores how model performance can be improved by clustering chemicals and developing class-specific QSFRs models. groups and rings) listed in Table 2 with their minimum and maximum values in current data sets, together with the MW were used to develop the current QSFR models (Eq. (8)) to predict the environmental distribution of chemicals by following the direct approach depicted in Fig. 1. In this case the performance for the 80 validation chemicals increased to q2 = 0.64 and q2 = 0.68 for the air and water compartments, respectively. Thus, all QSFR models presented and discussed thereafter were built using MW and counts of molecular constituents. 4.1.2. Selection of training and test chemicals Chemicals, which were represented by vectors of dimension 40 in terms of their mass ratios in a single compartment, the molecular weight (MW) and 38 non-zero counts of molecular constituents, were clustered in a two-dimensional SOM representation of the input space. The selection of the training and test chemicals is illustrated in Fig. 3, where three SOM units arbitrarily labeled as K15, O2 and U8 are used as examples. The higher the similarities between chemicals within a single SOM unit in terms of structure and chemical partitioning, the lower the differences between their qerror values are; between dieldrin (training) and endrin (test) in unit K15, between propanal (training) and butanal (test), ethanal (test), and isobutylraldehide (test) in unit U8. On the other hand, cinnamyl alcohol (training) and hydroxymethyl benzene (training) encompass Table 2 Molecular constituents used in QSFR model development. Count Symbol Molecular weight (g/mol) Count of all atoms Count of bromine atoms Count of carbon atoms Count of chlorine atoms Count of fluorine atoms Count of hydrogen atoms Count of nitrogen atoms Count of oxygen atoms Count of phosphorus atoms Count of suplhur atoms Count of all bonds Count of single bonds Count of double bonds Count of triple bonds Count of aldehyde groups Count of amide groups Count of amine groups Count of sec-amine groups Count of carbonyl groups Count of carboxyl groups Count of cyano groups Count of ether groups Count of hydroxyl groups Count of methyl groups Count of methylene groups Count of nitro groups Count of nitroso groups Count of sulfide groups Count of sulfone groups Count of thiol groups Count of all rings Count of aromatic rings Count of small rings Count of 5-membered rings Count of aromatic 5-membered rings Count of 6-membered rings Count of aromatic 6-membered rings Count of (7–12)-membered rings MW 44.05 959.17 85.11 402.49 ACall 5 89 10 81 ACbromine 0 10 0 3 ACcarbon 1 32 3 26 ACchlorine 0 8 0 3 ACfluorine 0 27 0 3 AChydrogen 0 60 3 54 ACnitrogen 0 6 0 3 ACoxygen 0 8 0 8 ACphosphorus 0 1 0 1 ACsulphur 0 4 0 2 BCall 4 88 10 80 BCsingle 4 88 9 80 BCdouble 0 18 0 8 BCtriple 0 2 0 2 GCaldehyde 0 1 0 1 GCamide 0 2 0 2 GCamine 0 2 0 2 GCsec-amine 0 2 0 2 GCcarbonyl 0 2 0 2 GCcarboxyl 0 2 0 2 GCcyano 0 2 0 2 GCether 0 4 0 3 GChydroxyl 0 4 0 2 GCmethyl 0 9 0 7 GCmethylene 0 3 0 0 GCnitro 0 3 0 1 GCnitroso 0 1 0 0 GCsulfide 0 4 0 2 GCsulfone 0 1 0 1 GCthiol 0 1 0 1 RCall 0 12 0 2 RCaromatic 0 4 0 2 RCsmall 0 7 0 0 RC5-m 0 4 0 1 RCa-5-m 0 2 0 0 RC6-m 0 4 0 2 RCa-6-m 0 4 0 2 RC7–12-m 0 2 0 1 4.1. Chemical distribution 4.1.1. Molecular feature selection Two types of molecular parameters were tested as input for the QSFR models: molecular descriptors calculated from a semiempirical approximation of molecular orbital (MO) theory (Bredow and Jug, 2005) and simple counts of molecular constituents. Even though the former have been successfully used in the past to estimate physicochemical properties (Raymond et al., 2001; Devillers, 2003; Taskinen and Yliruusi, 2003), preliminary QSFR models developed with several combinations of the MW and 22 semiempirical descriptors estimated with the Parameterized Model 3 (PM3) software CACHE (Fujitsu, 2004) did not perform adequately. For example, the performance of these preliminary QSFR models was always q2 ≤ 0.15 and q2 ≤ 0.49 for the air and water compartments, respectively, for the 80 validation chemicals. In this preliminary screening of QSFR models, descriptors were selected by means of the CFS filtering algorithm (Hall, 1999) from an initial set of molecular information that included the heat of formation (ΔHf, kcal/mol), molar refractivity (MR, m3/mol), polarizability (PO, Å 3), total hybridization dipole moment (μhyb, debye), total point charge dipole moment (μpc, debye), total sum dipole moment (μ, debye), area (Area, Å2), volume (Vol, Å3), number of filled levels (NFL), highest occupied molecular orbital energy (HOMO, eV), lowest occupied molecular orbital energy (LUMO, eV), ionization potential (IP, eV), electron affinity (EA, eV), connectivity indexes (0χ, 1χ, 2χ), valence connectivity indexes (0χv, 1χv, 2χv) and kappa alpha shape indexes (1κ, 2κ, 3κ). Counts of molecular constituents were also tested as input to QSFR models since fragment contributions have provided adequate structural information in the development of QSPRs (Boethling et al., 2004) and QSBRs (Raymond et al., 2001) for a wide range of chemicals. This is also the case for the models traditionally included in EPI suiteTM (SRC, 2008). The molecular constituents (atoms, bonds, functional 417 Working data set Validation data set Min Min Max Max 418 I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 Fig. 3. Example of work chemicals distribution within three SOM units arbitrarily labeled as K15, O2 and U8. The SOM clustered the 375 work chemicals based on their 39 counts of molecular constituents and mass ratios in the air phase. Distances from the unit center are measured in terms of the quantization error (qerror). 2-phenylethanol (test) and hydroxybenzene (test) in unit O2. Thus, by selecting the training chemicals in each SOM unit, as the ones with the lowest and highest qerror, ensures diversity in the training set while keeping the vicinity (similarity) between the test and the training chemicals. Units that cluster two or more work chemicals contribute with a maximum of two training chemicals; while, SOM units clustering one chemical only make one contribution. The SOMs yielded 300 (80.0%) and 299 (79.7%) training chemicals for the air and water, respectively. The domain of applicability (DOA) of a QSFR is defined here as the aggregation of the domains covered by each non-empty SOM unit, i.e., the set of qerror ranges covered within all non-empty BMUs. The training chemical with the largest qerror in each non-empty SOM unit is taken as the domain border of the unit and chemicals with larger qerror are considered not well represented in their BMU and, thus, are outside of the DOA. 4.1.3. Multimedia chemical partitioning SVR-based QSFRs models for multimedia environmental chemical portioning in air and water compartments of the reference scenario have been developed and their generalization capabilities validated with 80 chemicals not seen at all by the models. Fig. 4a and b compares the target partitioning values (reference mass ratios) respectively generated for the air and water compartments by the MEM with the predictions obtained from the two specific (QSFRair) and (QSFRwater) models with chemicals characterized by the MW and 38 non-zero counts of molecular constituents. These figures also include the ranges and 95% confidence interval limits of the MEM output obtained from the Monte Carlo simulation. Fig. 4 shows that mass ratios predicted by the QSFR models both for the air and water compartments deviate from the MEMs' estimates within the envelope of variability obtained with the Monte Carlo realizations of MEM (MC-MEM). QSFR predictions with the highest deviations tend to be close to the extreme values delimited by the variation ranges of the MC-MEM, especially for the air compartment where mass ratios are very small and sensitive to models' input uncertainties. The variations in property values generated by the statistical distributions of standard property estimation methods (Table 1) produced variations in the outputs of MC-MEM of up to 12 logarithmic units. These results suggest that the outputs of standard MEMs should undergo a similar variability if input variables were estimated from available QSPR and QSBR models. Table 3 shows that the 1000 realizations of 455 chemicals of the MC-MEM yield q2mean = 0.88 and MAEmean = 0.80 for the air compartment and q2mean = 0.86 and MAEmean = 0.17 for the water compartment. Table 3 also shows that the predictive performances of QSFRair and QSFRwater per data set are reasonably high. The overall performances of these models, including all the 455 chemicals (i.e., the training, test and validation sets simultaneously) are q2 = 0.82 and MAE = 0.91 for air and q2 = 0.81 and MAE = 0.32 for water. Table 3 also indicates that the predictive capacity of the QSFR models in Fig. 4 is high for chemicals located within the boundaries of the DOA, which is the case of the training and test chemicals previously selected with the SOM. QSFR model performance drops for the validation set since some chemicals fall out of the DOA. QSFR models using simple counts of molecular constituents, as the ones we propose here, cannot distinguish between isomers that have the exact same number of bonds, functional groups and ring structures. With these characteristics, there are 81 working and 10 validation isomeric chemicals out of the 375 working and 80 validation chemicals, respectively. This is not a serious drawback because transport and degradation properties for these isomers in the current working and validation data sets are not significantly different. On the other hand, molecular constituent counts have the advantage that they can be easily retrieved or calculated given the molecular formula or the SMILES (Weininger, 1988), making them suitable for simple and rapid screenings of chemical partitioning. Since the constituents of a chemical (atoms, bonds, groups and rings) can be counted without errors and the SVR algorithm would always yield the same model for the same training data and parameters (unlike ANNs, which adjust internal parameters in search of a local minimum error), current SVR-based QSFR models can easily be reproduced. This represents an additional advantage over models using as input semi-empirical descriptors since their values vary depending on the specific MO method (Bredow and Jug, 2005) used to estimate them. I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 419 Table 3 Evaluation of different approaches to estimate mass ratios in air and water compartments when the chemical is emitted in the water compartment. Compartment Estimation Performance Performances per data seta approach measure Training Test Validation set set set Air MC-MEM Air QSFRair Air QSFRair,X/Y Water MC-MEM Water QSFRwater Water QSFRwater, X/Y q2 MAE q2 MAE q2 MAE q2 MAE q2 MAE q2 MAE 0.88 0.80 0.85 0.81 0.92 0.54 0.89 0.16 0.86 0.30 0.94 0.11 0.87 0.79 0.86 0.81 0.91 0.59 0.79 0.15 0.60 0.34 0.78 0.19 0.86 0.82 0.64 1.34 0.68 1.30 0.78 0.22 0.68 0.39 0.62 0.36 All sets 0.88 0.80 0.82 0.91 0.88 0.68 0.86 0.17 0.81 0.32 0.87 0.17 a The number of chemicals per data set varies per compartment. For the air compartment there are 300 training, 75 test and 80 validation chemicals. For the water compartment there are 299 training, 76 test and 80 validation chemicals. 4.2. Assessment of the chemical domain 4.2.1. QSFRs models for two broad chemical classes The development of QSFR models for more homogeneous classes of chemicals was undertaken with the aim of assessing ways of improving the performances. This implies (i) the definition of chemical classes (families), (ii) the labeling of chemicals according to the characteristics of the two classes, and (iii) the development and use of specialized QSFR models (one per chemical class). Different criteria could be proposed for creating chemical families with respect to molecular structure, but the performance of any classtailored QSFR is hampered by the availability of sufficient training data. In a preliminary screening of the 375 work chemicals in the reference scenario, it was observed that 39 chemicals were solely formed by carbon and hydrogen atoms, while the remaining 336 chemicals have at least one heteroatom (bromine, chlorine, fluorine, nitrogen, oxygen, phosphorus or sulfur atoms). These two classes would constitute a reasonable starting point for developing specialized QSFR models if they were equally populated. In the current data sets there is an unbalanced distribution of chemicals with only 39 chemicals in the first class, which is not sufficient to generate appropriate training and test data sets. Thus, an adjustment was made to create two meaningful chemical families, one with 146 chemicals containing solely carbon, hydrogen and oxygen (Class X) and another with 229 chemicals containing at least one heteroatom different than oxygen (Class Y). Chemicals in class X cover broad and meaningful ranges of sizes (44.05 ≤ MW ≤ 434.58) and hydrophobicity (1.05 × 10−2 ≤ Kow ≤ 3.80 × 1014). Chemicals in this class X can be described by solely MW, 4 atom counts (all atoms, carbon, hydrogen and oxygen), 3 bond counts (all bonds, single bonds and double bonds), 7 functional group counts (aldehyde, carbonyl, carboxyl, ether, hydroxyl, methyl and methylene) and 8 ring counts (all rings, aromatic rings, small rings, 5-membered, aromatic 5-membered, 6-membered, aromatic 6-membered and 7–12membered). On the other hand, chemicals in class Y also cover appropriate ranges of sizes (50.49 ≤ MW ≤ 959.17) and hydrophobicity (5.62 × 10−3 ≤ Kow ≤ 4.57 × 1012). They are described by the MW and the 38 constituent counts listed in Table 2 (like in the QSFR models of Section 4.1). For optimal results, specific training and test data sets should be used every time a new SVR-based QSFR is developed. Nevertheless, the same training and test chemicals previously selected for the models QSFRair and QSFRwater were maintained, for comparison purposes, when developing class-tailored QSFRs (i.e., for classes X and Y). In this Fig. 4. Comparison between the logarithmic mass ratios determined by SB3 and those estimated by (a) QSFRair for the air compartment (q2all = 0.82 and MAE = 0.91) and (b) QSFRwater for the water compartment (q2all = 0.81 and MAE = 0.32). Emissions are in the water compartment. case, four class-tailored models were developed: QSFRair,X, QSFRair,Y, QSFRwater,X and QSFRwater,Y. Logarithmic mass ratios were predicted for each chemical, according to its chemical class (X or Y) by using the appropriate model per compartment. The results for both classes (X and Y) are presented in the following discussion for each compartment with the acronyms QSFRair,X/Y (i.e., models QSFRair,X or QSFRair,Y) and QSFRwater,X/Y (i.e., models QSFRwater,X or QSFRwater,Y). Fig. 5 depicts predictions of logarithmic mass ratios for the air compartment (Fig. 5a) and the water compartment (Fig. 5b) obtained with the models QSFRair,X/Y and QSFRwater,X/Y, respectively. An improvement in model performance has been achieved with the class-tailored models in Fig. 5 with respect to the overall simple QSFRair and QSFRwater models in Fig. 4. Table 3 shows that the class-tailored QSFRair,X/Y and QSFRwater,X/Y, plotted in Fig. 5, yield higher q2 and lower MAE values for air (q2all = 0.88 and MAE= 0.68 for QSFRair,X/Y) and water (q2all =0.87 and MAE =0.17 for QSFRwater,X/Y), compared to the overall simple models QSFRair (q2all = 0.82 and MAE = 0.91) and QSFRwater (q2all = 0.81 and MAE =0.32), plotted in Fig. 4, when all data sets (training, test and 420 I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 tailored models. This is discussed in the next subsection. Also, the training and test data sets selected from SOMs for the general QSFR models (Fig. 4) were kept unchanged when training the classspecialized QSFRs (Fig. 5) to facilitate the comparison of performances among the different models (Table 3). 4.2.2. Domain of applicability QSFR predictions outside the DOA of the models should be avoided (Johnson, 2008). The DOA of any model is primarily defined by its training chemicals (Weaver and Gleeson, 2008). Thus, by identifying the DOA of an existing QSFR model it is possible to approximately assess how appropriate it is for a new chemical (Kühne et al., 2009). Reasonable estimations of the DOA of a model can be performed (Schroeter et al., 2007) by measuring distances or probability density distributions of training data vectors (training chemicals) to new data vectors (validation chemicals or new chemicals of concern). Since the SOM algorithm is based on the distances between data vectors in a multivariate space (Kohonen et al., 1996), it has been currently used to define the DOA of the QSFR models developed. Three different SOM-based approaches have been tested: (i) DOA defined by the same SOM as in the selection of training and test data sets. The training chemical with the highest qerror in each non-empty SOM units defines the DOA border. Chemicals clustered in the original SOM were represented by vector of dimension 40 (MW, 38 constituent counts and a mass ratio). When presenting new chemicals to the SOM to evaluate whether or not they belong to the DOA, their mass ratios are unknown and only 39 out of 40 elements of the vector are used for classification purposes. Nevertheless, the error in the determination of the distance used to identify the BMUs with only 39 elements out of 40 is negligible. (ii) DOA defined by the principal component analysis (Pearson, 1901) of the 39 input variables. It was found that the first five principal components (eigenvectors) accounted for about 59% of the cumulative variance. The SOM was trained with these five principal components and, again, the DOA was defined by the highest qerror of the training chemicals in each non-empty SOM unit. (iii) DOA defined by the intersection of the two approaches above. Fig. 5. Comparison between the logarithmic mass ratios determined by SB3 and those estimated by (a) QSFRair,X/Y for the air compartment (q2all = 0.88 and MAE = 0.68) and (b) QSFRwater,X/Y for the water compartment (q2all = 0.87 and MAE = 0.17). Emissions are in the water compartment. validation) are considered. These results of QSFRair,X/Y and QSFRwater,X/Y are on average very close to those from MC-MEM when considering all data sets simultaneously, as indicated also in Table 3. The same improvement pattern is observed in Table 3 for the test sets. QSFRair,X/Y and QSFRwater,X/Y respectively yield q2all = 0.91 and q2all = 0.78 compared to the lower q2all = 0.86 and q2all = 0.60 obtained with QSFRair and QSFRwater, respectively. Thus, the discrimination of chemicals with respect to their structure (i.e., by using classes X and Y) improves the generalization capability of the SVR-based QSFRs by establishing better relationships between chemical distribution and molecular structure. On the other hand, the performance of any QSFR model with the validation set is significantly poorer both in terms of q2 and MAE values for the air and water compartments, as shown in Figs. 4 and 5, and Table 3. Also, no improvement is observed when two chemical classes are considered. It should be noted that the number of training chemicals per SVR-based QSFR is reduced to about half when implementing chemical classes. This in turn increases the chances of having some validation chemicals outside the DOA of the class- Table 4 summarizes the performances of models QSFRair,X/Y and QSFRwater,X/Y in terms of q2 and MAE for the test and validation chemicals emitted in water, both for the complete data sets or only for those chemicals belonging to the DOAs defined previously. In the first two approaches (i) and (ii), test or validation chemicals with quantization errors higher than those of the upper bounding training chemicals in each BMU are considered to be out the DOA of the models. Since the number of chemicals within the (i) and (ii) DOA definitions differs because of the different variables considered and the errors in each SOM, their intersection (iii) is preferred because more restrictive conditions are imposed. Table 4 shows that by using this third approach, the mass ratios of 48 and 50 “new” chemicals (test plus validation) in the air and water compartments, respectively, when emitted in water and belonging to the DOA can be optimally predicted by QSFRair,X,Y (q2te + val =0.92 and MAE = 0.54) and QSFRwater,X,Y (q2te + val = 0.94 and MAE = 0.15). Table 4 also shows that model performances remain as high for emissions in air when using the same training, test and validation data sets already used in the water-emission models and for test and validation chemicals belong to the DOA defined according to approach (iii). The lowest performances (q2val ≈ 0.75) are attained for the 12 validation chemicals in the air compartment when emitted in either water or air. I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 421 Table 4 Effect of belonging or not to the domain of applicability (DOA) for various DOA definitions when using specialized QSFR for the air and water compartments. The DOAs are established by SOM classes generated with (i) 39 descriptors, (ii) by the first five principal components of the PCA of these 39 descriptors and (iii) by the intersection of the two previous SOM-based DOAs. Model DOA Parameters Emissions in water Emissions in air Chemicals within DOA QSFRair,X/Y (i) (ii) (iii) QSFRwater,X/Y (i) (ii) (iii) # Chem q2 MAE # Chem q2 MAE # Chem q2 MAE # Chemi q2 MAE # Chem q2 MAE # Chemi q2 MAE Chemicals out DOA Chemicals within DOA te val te, val te val te, val te val te, val 62 0.93 0.55 36 0.95 0.45 36 0.95 0.45 56 0.84 0.15 44 0.86 0.13 40 0.91 0.12 29 0.69 0.99 15 0.64 1.03 12 0.78 0.79 21 0.88 0.30 16 0.81 0.38 10 0.95 0.25 91 0.89 0.69 51 0.89 0.62 48 0.92 0.54 77 0.87 0.19 60 0.84 0.20 50 0.94 0.15 13 0.70 0.79 39 0.86 0.73 39 0.86 0.73 20 0.57 0.29 32 0.69 0.26 36 0.66 0.26 51 0.67 1.47 65 0.68 1.36 68 0.66 1.39 59 0.28 0.39 64 0.21 0.36 70 0.25 0.38 64 0.68 1.33 104 0.75 1.12 107 0.73 1.15 79 0.38 0.36 96 0.43 0.33 106 0.40 0.34 62 0.95 0.21 36 0.97 0.16 36 0.97 0.16 56 0.90 0.29 44 0.92 0.27 40 0.93 0.25 29 0.54 0.46 15 0.65 0.41 12 0.76 0.34 21 0.88 0.30 16 0.73 0.58 10 0.93 0.32 91 0.89 0.29 51 0.92 0.23 48 0.94 0.20 77 0.90 0.29 60 0.84 0.35 50 0.93 0.27 5. Conclusions Environmental concentrations of chemical pollutants have been assessed by developing quantitative structure–fate relationship (QSFR) models for 455 chemicals, 375 for training and testing and 80 for validation purposes, in a model scenario representing the Netherlands. Multimedia chemical partitioning for chemical emissions in air and water has been predicted by the Level III Fugacity model SimpleBox 3 by using mostly experimental information of partition coefficients and degradation rates as inputs to the models to minimize the influence of predicted information in QSFR model development. Best results have been obtained when dividing the chemical space into two classes, one with chemicals containing only carbon, hydrogen and oxygen, and the other with chemicals containing at least one heteroatom different than oxygen in their structure. Logarithmic mass ratios in the air and water compartments have been predicted within the uncertainty envelope of SimpleBox 3 (i.e., within deviations associated to the uncertainty of physicochemical and transport properties, and degradation rates) for the training, test and validation sets. The two-class QSFR models have shown good performance with predictive square coefficients q2 ≥ 0.87 for all data sets of 455 chemicals combined. This predictive performance increases to q2 ≈ 0.93 for the combination of test and validation data sets when chemicals outside the DOA of the models have been excluded. These results obtained demonstrate the feasibility of predicting multimedia chemical partitioning directly from molecular information. Supplementary data associated with this article can be found, in the online version, at doi: 10.1016/j.scitotenv.2010.10.016. Acknowledgements The authors are grateful to Professor Y. Cohen of UCLA for many fruitful discussion held during the course of this investigation. The research was financially supported by the European Union (NOMIRACLE, European Commission, FP6 contract No. 003956) and the Generalitat de Catalunya (2009SGR-01529). Francesc Giralt acknowledges the Distinguished Researcher Award, Generalitat de Catalunya. References Aronson D, Boethling R, Howard P, Stiteler W. Estimating biodegradation half-lives for use in chemical screening. Chemosphere 2006;63:1953–60. Basheer IA, Hajmeer M. Artificial neural networks: fundamentals, computing, design, and application. J Microbiol Meth 2000;43:3-31. Boethling RS, Howard PH, Beauman AJ, Lanosche ME. Factors in intermedia extrapolation in biodegradability assessment. Chemosphere 1995;30:741–52. Boethling RS, Howard PH, Meylan WM. Finding and estimating chemical property data for environmental assessment. Environ Toxicol Chem 2004;23:2290–308. Brandes LJ, den Hollander H, van de Meent D. SimpleBox 2.0: a nested multimedia fate model for evaluating the environmental fate of chemicals. Bilthoven, The Netherlands: RIVM; 1996. p. 156. Bredow T, Jug K. Theory and range of modern semiempirical molecular orbital methods. Theor Chim Acta 2005;113:1-14. Breivik K, Wania F. Expanding the applicability of multimedia fate models to polar organic chemicals. Environ Sci Technol 2003;37:4934–43. Breivik K, Alcock R, Li Y-F, Bailey RE, Fiedler H, Pacyna JM. Primary sources of selected POPs: regional and global scale emission inventories. Environ Pollut 2004;128: 3-16. Breivik K, Vestreng V, Rozovskaya O, Pacyna JM. Atmospheric emissions of some POPs in Europe: a discussion of existing inventories and data needs. Environ Sci Policy 2006;9:663–74. Brüggemann R, Restrepo G, Voigt K. Structure–fate relationships of organic chemicals derived from the software packages E4CHEM and WHASSE. J Chem Inf Model 2006;46:894–902. Burden FR, Polley MJ, Winkler DA. Toward novel universal descriptors: charge fingerprints. J Chem Inf Model 2009;49:710–5. Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. J Chem Inf Comput Sci 2003;43:1882–9. Citra MJ. Incorporating Monte Carlo analysis into multimedia environmental fate models. Environ Toxicol Chem 2004;23:1629–33. Cohen Y. Pollutants in a multimedia environment. In: Cohen Y, editor. Workshop on pollutant transport and accumulation in a multimedia environment. New York: Plenum Press; 1986. Cohen Y, Cooter EJ. Multimedia environmental distribution of toxics (Mend-Tox). I: hybrid compartmental–spatial modeling framework. Pract Period Hazard Toxic Radioact Waste Manage 2002a;6:70–86. Cohen Y, Cooter EJ. Multimedia environmental distribution of toxics (Mend-Tox). II: software implementation and case studies. Pract Period Hazard Toxic Radioact Waste Manage 2002b;6:87-101. European Commission. Technical guidance document on risk assessment, part III. Institute for Health and Consumer Protection, European Chemicals Bureau, 2003. Cronin MTD, Schultz TW. Pitfalls in QSAR. Theochem J Mol Struct 2003;622:39–51. Cronin MT, Walker JD, Jaworska JS, Comber MH, Watts CD, Worth AP. Use of QSARs in international decision-making frameworks to predict health effects of chemical substances. Environ Health Perspect 2003;111:1376–90. den Hollander HA, van de Meent D. Appendix to SimpleBox 3.0: A multimedia mass balance model for evaluating the environmental fate of Chemicals. RIVM; 2004. den Hollander HA, van Eijkeren JCH, van de Meent D. SimpleBox 3.0. Bilthoven, The Netherlands: RIVM; 2004. 422 I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422 Devillers J. A decade of research in environmental QSAR. SAR QSAR Environ Res 2003;14:1–6. Diudea MV. QSPR/QSAR studies by molecular descriptors. New York: Nova; 2001. Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support vector regression machines. Advances in neural information processing systems 9. Proceedings of the 1996 conference. Denver, CO, USA: MIT Press; 1997. p. 155–61. Dec 2–5. Duca JS, Hopfinger AJ. Estimation of molecular similarity based on 4D-QSAR analysis: formalism and validation. J Chem Inf Comput Sci 2001;41:1367–87. Efroymson RA, Murphy DL. Ecological risk assessment of multimedia hazardous air pollutants: estimating exposure and effects. Sci Total Environ 2001;274(1–3): 219–30. Eisenberg JNS, Bennett DH, McKone TE. Chemical dynamics of persistent organic pollutants: a sensitivity analysis relating soil concentration levels to atmospheric emissions. Environ Sci Technol 1998;32:115–23. Fenner K, Scheringer M, Macleod M, Matthies M, McKone T, Stroebe M, et al. Comparing estimates of persistence and long-range transport potential among multimedia models. Environ Sci Technol 2005;39:1932–42. Fjodorova N, Novich M, Vrachko M, Smirnov V, Kharchevnikova N, Zholdakova Z, et al. Directions in QSAR modeling for regulatory uses in OECD member countries, EU Russia. J Environ Sci Health Pt C Environ Carcinog Ecotoxicol Rev 2008;26:201–36. Fujitsu BGo. CAChe software. Beaverton: BioSciences Group, Fujitsu Computer Systems; 2004. Furusjö E, Svenson A, Rahmberg M, Andersson M. The importance of outlier detection and training set selection for reliable environmental QSAR predictions. Chemosphere 2006;63:99-108. Godavarthy SS, Robinson JRL, Gasem KAM. SVRC-QSPR model for predicting saturated vapor pressures of pure fluids. Fluid Phase Equilib 2006;246:39–51. Golbraikh A, Tropsha A. Beware of q2! J Mol Graph Model 2002;20:269–76. Hall MA. Correlation-based Feature Selection for Machine Learning. Department of Computer Science. Ph.D. thesis. The University of Waikato, Hamilton, New Zealand, 1999. Howard PH, Boethling RS, Jarvis WF, Meylan WM, Michalenko EM. Handbook of environmental degradation rates. Boca Ratón: Lewis Publishers; 1991. Hugo K. From narcosis to hyperspace: the history of QSAR. Quant Struct Act Relat 2002;21:348–56. Johnson SR. The trouble with QSAR (or how i learned to stop worrying and embrace fallacy). J Chem Inf Model 2008;48:25–6. Junge CE. Fate of pollutants in the air and water environment. Wiley-Interscience; 1977a. Junge CE. Basic considerations about trace constituents in the atmosphere is related to the fate of global pollutants. In: Suffet IH, editor. Fate of pollutants in the air and water environment. Advances in Environmental Science and TechnologyNew York: Wiley-Interscience; 1977b. Part I. Kawamoto K, MacLeod M, Mackay D. Evaluation and comparison of multimedia mass balance models of chemical fate: application of EUSES and ChemCAN to 68 chemicals in Japan. Chemosphere 2001;44:599–612. Klöpffer W, Wagner BO. Persistence revisited. Environ Sci Pollut Res 2007;14:141–2. Kohonen T. Self-organizing maps. 3rd ed. Berlin: Springer; 2001. Kohonen T, Oja E, Simula O, Visa A, Kangas J. Engineering applications of the selforganizing map. Proc IEEE 1996;84(10):1358–84. Kühne R, Breitkopf C, Schüürmann G. Error propagation in fucacity level-III models in the case of uncertain physicochemical properties. Environ Toxicol Chem 1997;16: 2067–9. Kühne R, Ebert R-U, Schüürmann G. Estimation of compartmental half-lives of organic compounds—structural similarity versus EPI-suite. QSAR Comb Sci 2007;26:542–9. Kühne R, Ebert RU, Schüürmann G. Chemical domain of QSAR models form atomcentered fragments. J Chem Inf Model 2009;49:2660–9. Lohmann R, Breivik K, Dachs J, Muir D. Global fate of POPs: current and future research directions. Environ Pollut 2007;150:150–65. Mackay D. Multimedia environmental models—the fugacity approach. Boca Ratón: Lewis Publishers; 2001. Mackay D, Webster E. A perspective on environmental models and QSARs. SAR QSAR Environ Res 2003;14:7-16. Mackay D, Webster E. Environmental persistence of chemicals. Environ Sci Pollut Res 2006;13:43–9. Mackay D, Shiu W-Y, Ma KC. Illustrated handbook of physical–chemical properties and environmental fate for organic chemicals. Lewis Publishers Inc.; 1992 Mackay D, Hubbarde J, Webster E. The role of QSARs and fate models in chemical hazard and risk assessment. QSAR Comb Sci 2003;22:106–12. Martínez I, Quantitative structure fate relationships for multimedia environmental analysis. Ph.D. thesis. Universitat Rovira i Virgili, Tarragona, Spain, 2010. Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T. YALE: rapid prototyping for complex data mining tasks. In: Ungar L, Craven M, Gunopulos D, Eliassi-Rad T, editors. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-06). Philadelphia, PA, USA: ACM; 2006. p. 935–40. Modarresi H, Modarress H, Dearden JC. QSPR model of Henry's law constant for a diverse set of organic chemicals based on genetic algorithm-radial basis function network approach. Chemosphere 2007;66:2067–76. Nikolova N, Jaworska J. Approaches to measure chemical similarity—a review. QSAR Comb Sci 2003;22:1006–26. NITE. Chemical risk information platform (CHRIP). National Institute of Technology and Evaluation; 2006. OECD. Guidance document on the validation of (quantitative) structure–activity relationship [(Q)SAR] models. OECD Series on Testing and Assessment, 69; 2007. Pearson K. On lines and planes of closest fit to systems of points in space. Phil Mag 1901;2:559–72. Prinn RG, Huang J, Weiss RF, Cunnold DM, Fraser PJ, Simmonds PG, et al. Evidence for substantial variations of atmospheric hydroxyl radicals in the past two decades. Science 2001;292:1882–8. Raymond JW, Rogers TN, Shonnard DR, Kline AA. A review of structure-based biodegradation estimation methods. J Hazard Mater 2001;84:189–215. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. Bioinformatics 2007;23:2507–17. Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, et al. Estimating the domain of applicability for machine learning QSAR models: a study on aqueous solubility of drug discovery molecules. J Comput Aided Mol Des 2007;21:651–64. Schüürmann G, Ebert R-U, Chen J, Wang B, Kühne R. External validation and prediction employing the predictive squared correlation coefficient—test set activity mean vs training set activity mean. J Chem Inf Model 2008;48:2140–5. Schwarzenbach RP, Gschwend PM, Imboden DM. Environmental organic chemistry. 2nd ed. New Jersey: Wiley; 2003. Senese CL, Duca J, Pan D, Hopfinger AJ, Tseng YJ. 4D-Fingerprints, Universal QSAR and QSPR Descriptors. J Chem Inf Comput Sci 2004;44:1526–39. SRC. Interactive PhysProp Database Demo. Syracuse Research Corporation.SRC. EPI Suite v4.00. SRC, 2008. Stouch TR, Kenyon JR, Johnson SR, Chen X-Q, Doweyko A, Li Y. In silico ADME/Tox: why models fail. J Comput Aided Mol Des 2003;17:83–92. Struijs J, Peijnenburg WJGM. Predictions by the multimedia environmental fate model SimpleBox compared to field data: Intermedia concentration ratios of two phthalate esters. Bilthoven: RIVM; 2002. p. 62. Taskinen J, Yliruusi J. Prediction of physicochemical properties based on neural network modelling. Adv Drug Deliv Rev 2003;55:1163–83. Tickner J, Geiser K, Coffin M. The U.S. experience in promoting sustainable chemistry. Environ Sci Pollut Res 2005;12:115–23. Todeschini R, Consonni V. Handbook of molecular descriptors. Weinheim: Wiley-VCH; 2000. Toose L, Woodfine DG, MacLeod M, Mackay D, Gouin J. BETR-World: a geographically explicit model of chemical fate: application to transport of [alpha]-HCH to the Arctic. Environ Pollut 2004;128:223–40. Toussant M. A scientific milestone. Chem Eng News 2009;87:3. U.S. EPA, Persistence, bioaccumulative and toxic (PBT) profiler. U.S. EPA Office of pollution prevention and toxics; 2001http://www.epa.gov/oppt/sf/tools/pbtprofiler.htm. Washington DC. US-EPA. Inventory Update Rule. Office of Pollution Prevention and Toxics. Washington: Environmental Protection Agency; 2006http://www.epa.gov/oppt/iur. van de Meent D. SIMPLEBOX: a generic multimedia fate evaluation model. Bilthoven, The Netherlands: RIVM; 1993. Vesanto J, Himberg J, Alhoniemi E, Parhankangas J. SOM Toolbox for Matlab 5, 2000; 2000. Walker JD, Carlsen L, Hulzebos E, Simon-Hettich B. Global Government applications of analogues, SARs and QSARs to predict aquatic toxicity, chemical or physical properties, environmental fate parameters and health effects of organic chemicals. SAR QSAR Environ Res 2002;13:607–16. Weaver S, Gleeson MP. The importance of the domain of applicability in QSAR modeling. J Mol Graph Model 2008;26:1315–26. Webster E, Mackay D, Di Guardo A, Kane D, Woodfine D. Regional differences in chemical fate model outcome. Chemosphere 2004;55:1361–76. Weininger D. A chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. Witten IH, Frank E. Data mining: practical machine learning tools and techniques. San Francisco, U.S.: Morgan Kaufmann; 2005 Worth AP, Bassan A, De Bruijn J, Saliner AG, Netzeva T, Patlewicz G, et al. The role of the European Chemicals Bureau in promoting the regulatory use of (Q)SAR methods. QSAR Environ Res 2007;18:111–25. Xu Y, Zomer S, Brereton RG. Support vector machines: a recent method for classification in chemometrics. Crit Rev Anal Chem 2006;36:177–88. Yaffe D, Cohen Y. Neural network based temperature-dependent quantitative structure property relations (QSPRs) for predicting vapor pressure of hydrocarbons. J Chem Inf Comput Sci 2001;41:463–77. Yaffe D, Cohen Y, Espinosa G, Arenas A, Giralt F. A fuzzy ARTMAP based on quantitative structure—property relationships (QSPRs) for predicting aqueous solubility of organic compounds. J Chem Inf Comput Sci 2001;41:1177–207. Yaffe D, Cohen Y, Espinosa G, Arenas A, Giralt F. Fuzzy ARTMAP and back-propagation neural networks based quantitative structure–property relationships (QSPRs) for octanol–water partition coefficient of organic compounds. J Chem Inf Model 2002;42:162–83. Yaffe D, Cohen Y, Espinosa G, Giralt F, Arenas A. A fuzzy ARTMAP-based quantitative structure–property relationship (QSPR) for the Henry's Law constant of organic compounds. J Chem Inf Comput Sci 2003;43:85-112. Zukowska B, Breivik K, Wania F. Evaluating the environmental fate of pharmaceuticals using a level III model based on poly-parameter linear free energy relationships. Sci Total Environ 2006;359:177–87.