Uploaded by jordi.grifoll.t

Multimedia environmental chemical partitioning from molecular information

Science of the Total Environment 409 (2010) 412–422
Contents lists available at ScienceDirect
Science of the Total Environment
j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / s c i t o t e n v
Multimedia environmental chemical partitioning from molecular information
Izacar Martínez a, Jordi Grifoll a, Francesc Giralt a,⁎, Robert Rallo b
a
b
Departament d'Enginyeria Quimica, Universitat Rovira i Virgili, Av. Paisos Catalans, 26, 43007 Tarragona, Catalunya, Spain
Departament d'Enginyeria Informatica i Matematiques, Universitat Rovira i Virgili, Av. Paisos Catalans, 26, 43007 Tarragona, Catalunya, Spain
a r t i c l e
i n f o
Article history:
Received 11 July 2010
Received in revised form 1 October 2010
Accepted 11 October 2010
Available online 6 November 2010
Keywords:
Multimedia environmental model
Uncertainty analysis
Quantitative structure–fate relationships
QSAR molecular descriptors
Support vector regression
Domain of applicability
a b s t r a c t
The prospect of assessing the environmental distribution of chemicals directly from their molecular
information was analyzed. Multimedia chemical partitioning of 455 chemicals, expressed in dimensionless
compartmental mass ratios, was predicted by SimpleBox 3, a Level III Fugacity model, together with the
propagation of reported uncertainty for key physicochemical and transport properties, and degradation rates.
Chemicals, some registered in priority lists, were selected according to the availability of experimental
property data to minimize the influence of predicted information in model development. Chemicals were
emitted in air or water in a fixed geographical scenario representing the Netherlands and characterized by five
compartments (air, water, sediments, soil and vegetation). Quantitative structure–fate relationship (QSFR)
models to predict mass ratios in different compartments were developed with support vector regression
algorithms. A set of molecular descriptors, including the molecular weight and 38 counts of molecular
constituents were adopted to characterize the chemical space. Out of the 455 chemicals, 375 were used for
training and testing the QSFR models, while 80 were excluded from model development and were used as an
external validation set. Training and test chemicals were selected and the domain of applicability (DOA) of the
QSFRs established by means of self-organizing maps according to structural similarity. Best results were
obtained with QSFR models developed for chemicals belonging to either the class [C] and [C; O], or the class
with at least one heteroatom different than oxygen in the structure. These two class-specific models, with
respectively 146 and 229 chemicals, showed a predictive squared coefficient of q2 ≥ 0.90 both for air and
water, which respectively dropped to q2 ≈ 0.70 and 0.40 for outlying chemicals. Prediction errors were of the
same order of magnitude as the deviations associated to the uncertainty of the physicochemical and transport
properties, and degradation rates.
© 2010 Elsevier B.V. All rights reserved.
1. Introduction
Multimedia environmental models (MEMs) are routinely used to
estimate the environmental distribution of chemical pollutants based
on their physicochemical and transport properties, degradation rates,
site-specific geographical parameters and source emission rates
(Cohen, 1986; Mackay, 2001; Cohen and Cooter, 2002a, b; den
Hollander et al., 2004). MEMs serve to screen chemicals with respect
to their persistence in the environment and to provide information
needed to estimate the exposures and associated risks to human and
ecological receptors (Efroymson and Murphy, 2001; Cohen and
Cooter, 2002a; Breivik et al., 2004, 2006; Lohmann et al., 2007; and
references therein). The reliability of predictions of chemical
partitioning from MEMs is affected by model formulation (i.e., system
definition, included environmental processes, calculation methods,
etc.) and the uncertainties introduced via model parameters (Webster
et al., 2004), including estimates of physicochemical parameters
⁎ Corresponding author. Tel.: + 34 977559638; fax: + 34 977559621.
E-mail address: fgiralt@urv.cat (F. Giralt).
0048-9697/$ – see front matter © 2010 Elsevier B.V. All rights reserved.
doi:10.1016/j.scitotenv.2010.10.016
(Cohen and Cooter, 2002a,b; Breivik and Wania, 2003). In particular,
uncertainty in partitioning and degradation parameters can significantly affect predictions of chemical distribution in the environment
(Kühne et al., 1997; Eisenberg et al., 1998; Kawamoto et al., 2001;
Citra, 2004; Toose et al., 2004).
The lack of adequate physicochemical and toxicological information for most commercial chemicals and the risk that they may
represent for human health and the environment have motivated the
development of new regulatory efforts (Tickner et al., 2005) such as
REACH in the European Union and the Inventory Update Rule (USEPA, 2006) in the United States. These rules aim to collect information
about the characteristics, emission rates and existing volumes of
commercial chemicals for facilitating their screening and deciding
whether to authorize or ban their production. Compiling all
mandatory data will be a formidable task given the large number of
chemicals that may be of concern. For example, in September 2009,
the CAS registry, one of the largest substance registry databases,
reported its 50-millionth unique chemical (Toussant, 2009). It is
accepted that the regulatory assessments of the multimedia distribution of chemicals for which physicochemical properties and degradation data are lacking will require the use of estimation methods that
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
rely on quantitative structure–activity relationships (QSARs) (Fjodorova et al., 2008; Worth et al., 2007). In general, partitioning data
(Boethling et al., 2004; Mackay, 2001) are more readily available
(from experiments or estimations) relative to degradation data
(Aronson et al., 2006; Howard et al., 1991; Klöpffer and Wagner,
2007; Raymond et al., 2001).
QSARs are accepted worldwide in standard environmental assessments and decision-making tasks (Walker et al., 2002; Cronin et al.,
2003). They are based on establishing quantitative relations between
the target physicochemical (Hugo, 2002), or toxicological properties
(Devillers, 2003; Mackay et al., 2003; Mackay and Webster, 2003) of
chemicals and their molecular information. Often, even small
structural differences between different molecules can lead to
significant differences in chemical activity (Nikolova and Jaworska,
2003). Therefore, QSARs must be developed with careful considerations of data quality and diversity (Furusjö et al., 2006), and accurate
discrimination of chemical descriptors (Cronin and Schultz, 2003;
Stouch et al., 2003). QSARs are appropriate to use when the chemical
of concern has a molecular structure (or chemical descriptors) similar
to that of the chemicals used in the QSAR development (Taskinen and
Yliruusi, 2003). Selecting appropriate chemical descriptors is crucial
for the development of accurate QSARs as demonstrated, for example,
for vapor pressure (Yaffe and Cohen, 2001; Godavarthy et al., 2006),
water solubility (Yaffe et al., 2001), Henry's law constant (Yaffe et al.,
2003; Modarresi et al., 2007) and octanol–water partition coefficient
(Yaffe et al., 2002). QSAR development must consider the selection of
model input features (Saeys et al., 2007), often from a large number of
descriptors (Todeschini and Consonni, 2000; Duca and Hopfinger,
2001; Senese et al., 2004; Bredow and Jug, 2005; Burden et al., 2009),
the selection and tuning of learning algorithms for building relationships (Basheer and Hajmeer, 2000; Xu et al., 2006), the risk of
overtraining (Byvatov et al., 2003), the external validation of the
models (Golbraikh and Tropsha, 2002; OECD, 2007; Schüürmann et
al., 2008) and the definition of applicability domains (Weaver and
Gleeson, 2008; Kühne et al., 2009).
There are essentially two possible approaches to estimate the set
of chemical properties required for modeling the environmental
multimedia distribution of chemicals (Fig. 1). The first is to estimate
the properties of each required chemical parameter from independent
quantitative structure–property relationship (QSPR) and quantitative
structure–biodegradation relationship (QSBR) models. Indeed, QSPRs
for various chemical families have been proposed in the literature for
an array of different physiochemical properties (Diudea, 2001). In this
case the uncertainty associated with each QSARs is introduced into
413
the MEM analysis with such uncertainties scaled with respect to other
MEM parameters (e.g., areas and volumes of the environmental
compartments, as well as advection rates), thereby affecting the
estimated mass distribution and concentrations of the chemicals in
the various media. The second is to consider a single QSPR/QSBR for
the collective chemical properties whereby, given a set of chemical
descriptors, the various environmentally relevant physicochemical
properties and reaction rate parameters are predicted by the single
QSPR/QSBR model. When a specific regulatory multimedia model is
used with specified emission scenario and geographical and meteorological settings, it may be beneficial to integrate the QSPR/QSBR
approach with the multimedia model. In such an approach, a
quantitative structure–fate relationship (QSFR) model is derived
relating quantitative chemical descriptors information directly with
MEM predictions of chemical distribution in the environment.
Preliminary proposals in the above direction have considered the
implementation of QSPRs in standard MEMs (Breivik and Wania,
2003; Zukowska et al., 2006) or the establishment of structure–fate
relationships by partial orders (Brüggemann et al., 2006). In the
current study, we propose training of machine-learning algorithms
(Witten and Frank, 2005) to map directly the output of MEMs (in
terms of media mass distribution) to relevant chemical descriptors.
The resulting correlation model, which is referred to herein as a
quantitative structure–fate relationship (QSFR), has the advantage of
providing direct information on the environmental distribution of
chemicals using a consistent set of chemical descriptors with respect
to chemically relevant multimedia model properties.
The present paper reports on the prospect of assessing the
environmental fate of chemicals directly from their molecular
information using QSFRs trained based on the MEM model output as
an alternative to using a MEM with chemical properties and
biodegradation rates estimated using QSPRs and QSBR (Fig. 1) models.
The proposed approach is self-consistent because the same molecular
descriptors were used for training the QSFR and their relevance (and as
outcome also the uncertainty) was weighted with respect to
geographical and meteorological parameters of the various media.
QSFR models were developed with support vector regression (SVR)
algorithms (Drucker et al., 1997) and trained with MEMs' chemical
mass distribution outputs as an alternative to using a collection of
QSPRs to estimate physicochemical and reaction rate parameters as
MEM inputs (Fig. 1). The approach was evaluated with a data set of 455
chemicals using SimpleBox 3 (SB3) multimedia model (van de Meent,
1993; Brandes et al., 1996; den Hollander and van de Meent, 2004; den
Hollander et al., 2004) for the specific geographical setting of the
Netherlands, with air, water, sediments, soil, and vegetation compartments, and two air and water-emission scenarios. Chemicals were
characterized by the molecular weight and 38 counts of molecular
constituents. The performance of the QSFR models developed was
compared to the conventional approach of applying MEMs with
experimental and estimated input parameters by taking into consideration, for a statistically significant sample of the selected chemicals
and environmental conditions, the propagation of the uncertainty
associated to the range of errors reported in the literature (Boethling et
al., 2004; Kühne et al., 2007) for the MEM input parameters.
2. Scenario for chemical multimedia distributions
2.1. Multimedia analysis of chemical distribution
Fig. 1. Two approaches for assessing environmental chemical partitioning from
molecular information when experimental input information is incomplete or lacking.
Multimedia environmental simulations were carried out using the
Level III (steady state with mass transfer limitations) SB3 fugacity
model (van de Meent, 1993; Brandes et al., 1996; den Hollander and
van de Meent, 2004; den Hollander et al., 2004) to assess the
multimedia distribution of 455 chemicals (Martínez, 2010; see also
the provided Supplementary information) in the Netherlands as a
model environment (Struijs and Peijnenburg, 2002). Five
414
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
homogeneous compartments, including air, water (including fresh
and sea water), sediments (including fresh water sediments and sea
water sediments), soil (including natural, agricultural and other soil)
and vegetation (including natural and agricultural vegetation) were
considered at the regional scale of this MEM. A total of 375 chemicals
were selected for training and testing QSFR models according to
chemical similarity in the descriptor space by means of self-organizing
maps (SOM), while 80 chemicals were excluded from the modeling
process and set aside for model validation. It is noted that for steadystate SB3 analysis (Level III), the variation of mass partitioning
among the different chemicals is governed only by their physicochemical, transport and degradation constants since the simulations
were carried out for fixed geographical and meteorological
conditions.
Model calculations were carried out for each of the selected
chemicals given their physicochemical and degradation rate parameters (Section 2.2). The chemical concentration in the inflows (i.e.,
air and water) into the air and water compartments was assumed to
be zero. The steady-state compartmental chemical mass concentrations calculated from the SB3 model were expressed as the
dimensionless mass ratio of the chemical mass in each compartment
according to:
wn;g =
Cn;g Vg
ṁt T
ð1Þ
aerosol partition coefficients following the approach of Junge (1977a,b).
The parameters MW, Tm, Pv, and Kow, were obtained from the PHYSPROP
database (SRC, 2008), while the parameters Kaw, Ksw, and kair, were
estimated from data in this database. The mass diffusivities Dair and
Dwater were estimated internally by SB3 considering that they vary
inversely with the square root of the MW and using as reference values
the diffusion coefficient of water in air and the diffusion coefficient of
oxygen in water (Schwarzenbach et al., 2003).
The chemical atmospheric degradation rate constant was estimated
as kair = kOH• COH• (where the rate of chemical degradation is given as
rair = kOH•COH•Cn,air), in which kOH• (m3/g•s) is the second-order reaction
constant (SRC, 2008), COH• (g/m3) is the concentration of hydroxyl
radicals in air and Cn,air(g/m3) is the chemical concentration. In the
present analysis, the global average hydroxyl radical concentration of
2.66× 10−11 g/m3 (Prinn et al., 2001) was assumed for simplicity.
The degradation rate constant in water, kwater, was estimated from
results of MITI-I biodegradability tests (NITE, 2006). The MITI-I tests
are expressed as the degradation fraction of chemical samples over
time periods ranging from 2 to 4 weeks, with sample mass
determined by direct methods (using total organic carbon, high
performance liquid chromatography and gas chromatography) and
indirect methods (measuring biological oxygen demand). In the
current work, kwater values were estimated as follows:
−1 ln 1−fdeg
ð2Þ
kwater =
t
where Cn,g (kg/m3) is the steady-state concentration of chemical n in
compartment g of volume Vg (m3), ṁt is the chemical emission rate
(kg/s) and T is a time reference period (s). In this study this time
reference period has been taken as one year. It should be noted that
the addition of these mass ratios for the different compartments is
usually less than one because, under steady state, the amount of
chemical that remains in the overall system is not necessarily the
amount emitted over the time reference period. Compartment concentrations are proportional to pollutant emission rates under the
assumptions of linear equilibrium relationships, linear mass transfer
coefficients limitations, steady compartment outflows and first-order
degradation kinetics. As a consequence, mass ratios defined in Eq. (1)
are emission-rate independent.
Mackay and Webster (2006) defined persistence as the proportionality constant relating mass in the environment to the emission
rate. According to this definition, the mass ratios given in Eq. (1)
coincide with the persistence in years of the chemical in each
compartment. The addition of the mass ratios for all the compartments defined in the system provides the total persistence (in years)
of the chemical in the overall system.
where t (seconds) is the range period of a test and fdeg is the
degradation fraction determined by the biological oxygen demand
(BOD) methodology. Only compounds for which the BOD and total
organic carbon methods yielded degradation fraction fdeg that
deviated by no more than 10% were included in the chemical data
set. For modeling consistency, all fdeg values that were experimentally
reported to be higher than 1 or lower than 0, (due to error
measurements in the MITI-I tests), have been set to be equal to 0.99
(extremely fast degradability) or 0.01 (extremely low degradability),
respectively. As noted in the literature (Boethling et al., 1995), for
screening purposes the degradation half lives in water can be taken as
similar to those in soil, which in turn tend to degrade 3 to 4 times
faster than an anaerobic flooded soil. Accordingly, in the present
analysis ksoil values were estimated to be equal to kwater values, while
ksed values were taken to be ksoil/3.5. This approach of taking the
kinetic parameters for degradation in soil and sediment as a fraction
of the value for water is common for screening purposes (e.g., Fenner
et al., 2005; U.S. EPA PBT Profiler, 2001).
2.2. Physicochemical properties
Molecular information consisting of 39 molecular parameters was
compiled for each of the 455 chemicals considered. The set of 39
molecular parameters included molecular weight, 10 atom counts (all
atoms, bromine, carbon, chlorine, fluorine, hydrogen, nitrogen,
oxygen, phosphorus, and sulfur), 4 bond counts (all bonds, single
bonds, double bonds and triple bonds), 16 group counts (aldehyde,
amide, amine, sec-amine, carbonyl, carboxyl, cyano, ether, hydroxyl,
methyl, methylene, nitro, nitroso, sulfide, sulfone, and thiol), 8 ring
counts (all rings, aromatic rings, small rings, 5-membered rings, 5membered aromatic rings, 6-membered rings, 6-membered aromatic
rings and 7-to-12-membered rings).
The SB3 model requires a total of 6 physicochemical, 2 transport
and 4 degradation parameters for Level III type simulations. The
physicochemical parameters include molecular weight (MW, g/mol),
melting point (Tm, K), vapor pressure (Pv, Pa), octanol–water partition
coefficient (Kow, dimensionless), air–water partition coefficient (Kaw,
dimensionless), and the solid–water partition coefficient (Ksw,
dimensionless). The chemical degradation rate parameters in air
(kair, 1/s), water (kwater, 1/s), sediment (ksed, 1/s), and soil (ksoil, 1/s)
media were all for first-order kinetics, and the fundamental transport
coefficients were the mass diffusivity of the chemical in air (Dair, m2/s)
and water (Dwater, m2/s).
Kaw values (Mackay, 2001) were calculated as H/RT (H is the
Henry's law constant, R is the ideal gas constant and T is the absolute
temperature). Ksw was estimated internally by SB3 from Kow (European
Commission, 2003) assuming an average organic carbon content of 2%
for the sediment solids and solid soil density of 2.5 kg/L. It is noted that
for particle-bound chemicals, SB3 uses Tm and Pv to estimate the air–
2.3. Molecular information
3. Methods
3.1. Uncertainty assessment of the MEM
To simulate the impact that uncertainties of physicochemical
properties and degradation rates estimated from QSPRs or QSBRs have
on the outputs of MEMs (thus affecting any assessment of the
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
environment distribution of chemicals), a series of SB3 model simulations was carried out for all 455 chemicals applying 1000 random
combinations (Monte Carlo simulations) of the following independent
chemical properties: Tm, Pv, H, Kow, kair and kwater. Since this set of
independent properties was used to derive a set of dependent properties
(Kaw, Ksw, ksed, and ksoil), as described in Section 2.2, variations in the
independent set generated variations in the dependent set as well.
Uncertainty in MW was not considered and, thus, uncertainty in its
dependent properties (Dair and Dwater) was also neglected.
The uncertainty sources, in terms of statistical distributions,
assigned to the varying independent properties are listed in Table 1.
For Tm, Pv, H, and Kow standard deviations were taken from statistics of
widely recommended QSPRs (Boethling et al., 2004), considering
results for external validation chemicals wherever possible. For kair
and kwater the statistical distributions were taken from QSBRs (Kühne
et al., 2007). It was assumed that the mean value of every distribution
coincided with the property value compiled as described in Section 2.2.
Finally, it was assumed that a variable followed a normal distribution if
the standard deviation given by Boethling et al. (2004) was in unit
variables. A lognormal distribution was considered when the standard
deviation was reported in logarithmic units. Although the standard
deviation of Pv is given in terms of mm Hg, a lognormal distribution
was used to avoid negative values in chemicals with very low Pv.
The outputs (dimensionless mass ratios) of the SB3 model from
the 1000 random combinations for each chemical, schematized in
Fig. 2 for endrin, were used to generate a database. This database
provided an estimation of the output distribution that one can expect
when using recommended QSPRs and QSBRs to estimate the
environmental distribution of chemicals. This database was used as
a reference to evaluate the direct QSFR approach depicted in Fig. 1.
3.2. QSFR model development
QSFRs were developed to estimate directly from the chemical's
molecular descriptors [d1,...,dL] the chemical mass ratios wg predicted
by SB3 for each environmental compartment of the reference
pollution scenario. It is expected that these QSFR models will perform
better than or at least similarly to the SB3 model when fed with
properties estimated from several QSPRs and QSBRs.
3.2.1. Fundamentals
Given N chemicals (characterized by K properties) emitted in a
geographic region described by G compartments, a reference MEM
can be considered to be a multivariate function of the form,
C = f ðP;E;SÞ
ð3Þ
415
where C is a matrix of mass ratio predictions of size N × G, P is a matrix
of physicochemical properties of size N × K, E is a matrix of emission
rates of size N × G and S is a matrix of site-specific parameters. When E
and S remain constant, the chemical distribution in the environment
can be solely analyzed in terms of P, the collection of physicochemical
properties and degradation rates of the chemicals to assess.
When key physicochemical properties and degradation rates are
unavailable for chemicals of concern (P is unknown), alternative
multimedia environmental models can be developed from L molecular
descriptors in a matrix D (of size N × L) by means of QSFRs of the form,
C≈fQSFR ðDÞ:
ð4Þ
To develop the QSFR model given by Eq. (4), a set of Ntr training
chemicals (with Ntr b N) is required for which all properties and
molecular structures are known. The model is then adjusted to
emulate the output of the reference MEM (Eq. (3)), by tuning its
internal parameters with respect to a set of Nte test chemicals. Its QSFR
performance for new chemicals is later evaluated with a set of Nval
validation chemicals.
3.2.2. Data preprocessing
All input and output variables with values that span more than two
orders of magnitude were logarithmically (base 10) scaled and then
normalized in the range [−1,1] according to,
yi −ymin
−1
N½−1;1 ðyi Þ = 2
ymax −ymin
ð5Þ
where yi is a value to be normalized and N[−1,1](yi) is its normalized
counterpart and ymin and ymax respectively are the minimum and
maximum values in the working data set. Since the available
molecular information spans less than two orders of magnitude, all
molecular descriptors were normalized in the range [−1,1] with no
prior logarithmic scaling.
3.2.3. Training, test and validation data sets
To build a QSFR model, the original set of 375 working chemicals
was split into a training data set and a test data set. About 80% of the
working chemicals were dedicated to training every QSFR model,
while the remaining 20% of working chemicals were allocated for
testing model performance. The data selection scheme, based on the
self-organizing map (SOM) algorithm (Kohonen et al., 1996), has to
assure that models capture the diversity of chemical structures
present in the data set during the training process and that the test set
is also well represented in the training set. The SOM is a procedure for
Table 1
Uncertainty distribution parameters assigned to independent properties affecting the chemical environmental partitioning.
Input
Tm
Melting point
Pv
Vapor pressure
H
Henry's law constant
Kow
Octanol–water partition coefficient
kair
Kinetic constant for degradation in air
kwater
Kinetic constant for degradation in water
a
Assumed
distribution
Typical uncertainty distribution of parameters predicted by QSPRs
Data set
Statistic parametersa,b
Statistic parameters
units
Source
Normal
Validation
SD = 58.
K
Boethling et al. (2004)
Log-normal
Validation
SD = 96.
Pa
Boethling et al. (2004)
3
Log-normal
Training
SD = 0.440
log10(Pa m /mol)
Boethling et al. (2004)
Log-normal
Validation
SD = 0.427
log10(–)
Boethling et al. (2004)
Discrete
Training
–
Kühne et al. (2007)
Discrete
Training
P(0) = 0.48, P(± 1) = 0.37,
P(± 2) = 0.13, P(±N2) = 0.02
P(0) = 0.52, P(± 1) = 0.35,
P(± 2) = 0.08, P(±N2) = 0.05
–
Kühne et al. (2007)
The parameters have been reported for QSPRs in standard deviations, SD.
The reported parameters for QSBRs are probabilities, P(C), that indicate if a chemical has been classified as member of a degradation class C (0 = correct class, ± 1 = neighbor
category predicted, ± 2 = two categories differing and ±N2 = more than two categories differing) in the 9-class scale proposed by Mackay et al. (1992).
b
416
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
Fig. 2. Random realizations of the Monte Carlo approach on SB3 for endrin.
mapping and clustering high-dimensional data by fitting an optimal
number of units (also called neurons, cells or nodes) to the data. It
minimizes distances (e.g., the Euclidean distance) between units and
data points (i.e., minimum average quantization error) while
preserving the vicinity of units in both the map and the data space
(i.e., minimum average topological error). The procedure for selecting
the training and test data sets is as follows.
First, SOMs of sizes in the range of 10 to 150% of the number of
work chemicals (375), were trained to fit these chemicals in the
input-target space of the desired QSFR model using the SOM toolbox 5
for Matlab (Vesanto et al., 2000). All SOMs were set to have toroidal
shape to avoid border effects. Lattices were hexagonal to minimize
both the mean quantization error (qerror ) and the mean topological
error (terror ). Chemicals were represented by vectors whose elements
included the selected of counts of molecular constituents and the
P
P
target chemical partitioning variable. Note that qerror and terror are
defined by (Kohonen, 2001),
qerror =
1 Nwk
∑ ‖x −mxi ‖
Nwk n = 1 i
ð6Þ
1 Nwk
∑ uðxi Þ
Nwk n = 1
ð7Þ
and
terror =
where Nwk is the number of work data vectors, mxi is the best
matching unit (BMU) for the data vector xi , and u(xi ) is a function
that takes the value of 1 when the BMU and the next BMU of xi are
adjacent, and 0 otherwise.
Second, for each trained SOM, chemicals were included into the
training data set when showing the lowest or highest quantization
error with respect to the vector (weights) of their corresponding
BMUs. Chemicals with the lowest or highest values in the descriptor
and target spaces were also included in the training data set. All
remaining working chemicals not following the above rules were
allocated to the test set. This procedure was designed to ensure that
the test sets belonged to the applicability domain of the
corresponding training sets. The number of training chemicals was
kept at about 80% (±5%) of the number of working chemicals to
ensure good generalization of the QSFR models. This also determines
the dimension of the SOM since a higher dimension implies less
populated SOM clusters and the selection of more training chemicals
and less test chemicals. The 80 validation chemicals were not used in
any stage of the development of QSFRs.
3.2.4. Supervised learning algorithms
QSFR models were built with support vector regression (SVR)
algorithms (Drucker et al., 1997) using RBF kernel functions. The εSVR implementation in the software package RapidMiner 4.4
(Mierswa et al., 2006) was used. For any environmental compartment
g, the QSFR can be defined as,
N½−1;1 log10 wg
= fQSFR N½−1;1 ðd1 Þ;…;N½−1;1 ðdL Þ
ð8Þ
where the function fQSFR represents the SVR linking normalized
molecular descriptors to normalized logarithmic mass ratios.
An iterative evaluation of 4000 models was implemented to
optimize the parameters of the SVR model (Mierswa et al., 2006) for
every compartment and set of input features considered. The SVRbased QSFR model was developed with the training data set and
evaluated on the test and validation data sets for every combination of
parameters. The optimal SVR model in terms of generalization
capabilities was the one with lowest mean absolute error (MAE)
selected among the 10 with highest squared correlation (R2) values
on the test data set. The MAE and R2 values measure the performance
of the SVR-based models by comparing the target (SB3 MEM output
tn) and predicted (SVR output pn) normalized logarithmic mass ratios
in a given compartment for the N chemicals in the data set (tr =
training, te = test or val = validation),
Nset
∑ jtn −pn j
MAEset = n = 1
; set = tr; te or val:
Nset
ð9Þ
and
!2
Nset ∑ tn −t ðpn −p Þ
n=1
2
Rset =
Nset
∑ tn −t
n=1
2
!
Nset
! ; set = tr; te or val:
2
∑ ðpn −p Þ
n=1
ð10Þ
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
The overbar indicates averages over the Nset chemicals in the data
set (set = tr, te or val).
The accuracy of an optimized SVR model was estimated by means
of both a 10-fold cross-validation (CV) and a leave-one-out (LOO)
validation procedure for all 375 working chemicals. Note that the
evaluation of the SVRs is based on normalized logarithms of mass
ratios. Normalized QSFR predictions (Eq. (8)) for all chemicals were
denormalized (Eq. (5)) to yield logarithmic mass ratios. To avoid
model overfitting, the predictive performance of all models was
assessed in terms of the predictive squared coefficient (q2), as
suggested by Schüürmann et al. (2008),
q2set
Nset 2
∑ log10 wpredicted
−log10 wtarget
n;g
n;g
= 1−
!
Nset
1 Nset
2 ; set = tr; te or val:
target
target
− ∑ log10 wn;g
∑ log10 wn;g
N n=1
n=1
n=1
ð11Þ
This q2 coefficient varies in the range (−∞,1). Models with q2
values closer to 1 have a high predictive performance, while models
having q2 values equal or lower than zero perform worst than simply
averaging all targets.
4. Results and discussion
Mass ratios predicted by the QSFR models emulating the reference
scenario are presented and discussed in Section 4.1 for emissions in
the water and air compartments. Section 4.2 explores how model
performance can be improved by clustering chemicals and developing
class-specific QSFRs models.
groups and rings) listed in Table 2 with their minimum and maximum
values in current data sets, together with the MW were used to
develop the current QSFR models (Eq. (8)) to predict the environmental distribution of chemicals by following the direct approach
depicted in Fig. 1. In this case the performance for the 80 validation
chemicals increased to q2 = 0.64 and q2 = 0.68 for the air and water
compartments, respectively. Thus, all QSFR models presented and
discussed thereafter were built using MW and counts of molecular
constituents.
4.1.2. Selection of training and test chemicals
Chemicals, which were represented by vectors of dimension 40 in
terms of their mass ratios in a single compartment, the molecular
weight (MW) and 38 non-zero counts of molecular constituents, were
clustered in a two-dimensional SOM representation of the input
space.
The selection of the training and test chemicals is illustrated in
Fig. 3, where three SOM units arbitrarily labeled as K15, O2 and U8 are
used as examples. The higher the similarities between chemicals
within a single SOM unit in terms of structure and chemical
partitioning, the lower the differences between their qerror values
are; between dieldrin (training) and endrin (test) in unit K15,
between propanal (training) and butanal (test), ethanal (test), and
isobutylraldehide (test) in unit U8. On the other hand, cinnamyl
alcohol (training) and hydroxymethyl benzene (training) encompass
Table 2
Molecular constituents used in QSFR model development.
Count
Symbol
Molecular weight (g/mol)
Count of all atoms
Count of bromine atoms
Count of carbon atoms
Count of chlorine atoms
Count of fluorine atoms
Count of hydrogen atoms
Count of nitrogen atoms
Count of oxygen atoms
Count of phosphorus atoms
Count of suplhur atoms
Count of all bonds
Count of single bonds
Count of double bonds
Count of triple bonds
Count of aldehyde groups
Count of amide groups
Count of amine groups
Count of sec-amine groups
Count of carbonyl groups
Count of carboxyl groups
Count of cyano groups
Count of ether groups
Count of hydroxyl groups
Count of methyl groups
Count of methylene groups
Count of nitro groups
Count of nitroso groups
Count of sulfide groups
Count of sulfone groups
Count of thiol groups
Count of all rings
Count of aromatic rings
Count of small rings
Count of 5-membered rings
Count of aromatic 5-membered rings
Count of 6-membered rings
Count of aromatic 6-membered rings
Count of (7–12)-membered rings
MW
44.05 959.17 85.11 402.49
ACall
5
89
10
81
ACbromine
0
10
0
3
ACcarbon
1
32
3
26
ACchlorine
0
8
0
3
ACfluorine
0
27
0
3
AChydrogen
0
60
3
54
ACnitrogen
0
6
0
3
ACoxygen
0
8
0
8
ACphosphorus
0
1
0
1
ACsulphur
0
4
0
2
BCall
4
88
10
80
BCsingle
4
88
9
80
BCdouble
0
18
0
8
BCtriple
0
2
0
2
GCaldehyde
0
1
0
1
GCamide
0
2
0
2
GCamine
0
2
0
2
GCsec-amine
0
2
0
2
GCcarbonyl
0
2
0
2
GCcarboxyl
0
2
0
2
GCcyano
0
2
0
2
GCether
0
4
0
3
GChydroxyl
0
4
0
2
GCmethyl
0
9
0
7
GCmethylene
0
3
0
0
GCnitro
0
3
0
1
GCnitroso
0
1
0
0
GCsulfide
0
4
0
2
GCsulfone
0
1
0
1
GCthiol
0
1
0
1
RCall
0
12
0
2
RCaromatic
0
4
0
2
RCsmall
0
7
0
0
RC5-m
0
4
0
1
RCa-5-m
0
2
0
0
RC6-m
0
4
0
2
RCa-6-m
0
4
0
2
RC7–12-m
0
2
0
1
4.1. Chemical distribution
4.1.1. Molecular feature selection
Two types of molecular parameters were tested as input for the
QSFR models: molecular descriptors calculated from a semiempirical approximation of molecular orbital (MO) theory (Bredow
and Jug, 2005) and simple counts of molecular constituents. Even
though the former have been successfully used in the past to
estimate physicochemical properties (Raymond et al., 2001; Devillers, 2003; Taskinen and Yliruusi, 2003), preliminary QSFR models
developed with several combinations of the MW and 22 semiempirical descriptors estimated with the Parameterized Model 3
(PM3) software CACHE (Fujitsu, 2004) did not perform adequately.
For example, the performance of these preliminary QSFR models was
always q2 ≤ 0.15 and q2 ≤ 0.49 for the air and water compartments,
respectively, for the 80 validation chemicals. In this preliminary
screening of QSFR models, descriptors were selected by means of the
CFS filtering algorithm (Hall, 1999) from an initial set of molecular
information that included the heat of formation (ΔHf, kcal/mol),
molar refractivity (MR, m3/mol), polarizability (PO, Å 3), total
hybridization dipole moment (μhyb, debye), total point charge dipole
moment (μpc, debye), total sum dipole moment (μ, debye), area
(Area, Å2), volume (Vol, Å3), number of filled levels (NFL), highest
occupied molecular orbital energy (HOMO, eV), lowest occupied
molecular orbital energy (LUMO, eV), ionization potential (IP, eV),
electron affinity (EA, eV), connectivity indexes (0χ, 1χ, 2χ), valence
connectivity indexes (0χv, 1χv, 2χv) and kappa alpha shape indexes
(1κ, 2κ, 3κ).
Counts of molecular constituents were also tested as input to QSFR
models since fragment contributions have provided adequate structural information in the development of QSPRs (Boethling et al., 2004)
and QSBRs (Raymond et al., 2001) for a wide range of chemicals. This
is also the case for the models traditionally included in EPI suiteTM
(SRC, 2008). The molecular constituents (atoms, bonds, functional
417
Working data
set
Validation
data set
Min
Min
Max
Max
418
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
Fig. 3. Example of work chemicals distribution within three SOM units arbitrarily labeled as K15, O2 and U8. The SOM clustered the 375 work chemicals based on their 39 counts of
molecular constituents and mass ratios in the air phase. Distances from the unit center are measured in terms of the quantization error (qerror).
2-phenylethanol (test) and hydroxybenzene (test) in unit O2. Thus,
by selecting the training chemicals in each SOM unit, as the ones with
the lowest and highest qerror, ensures diversity in the training set
while keeping the vicinity (similarity) between the test and the
training chemicals. Units that cluster two or more work chemicals
contribute with a maximum of two training chemicals; while, SOM
units clustering one chemical only make one contribution. The SOMs
yielded 300 (80.0%) and 299 (79.7%) training chemicals for the air and
water, respectively.
The domain of applicability (DOA) of a QSFR is defined here as the
aggregation of the domains covered by each non-empty SOM unit, i.e.,
the set of qerror ranges covered within all non-empty BMUs. The
training chemical with the largest qerror in each non-empty SOM unit
is taken as the domain border of the unit and chemicals with larger
qerror are considered not well represented in their BMU and, thus, are
outside of the DOA.
4.1.3. Multimedia chemical partitioning
SVR-based QSFRs models for multimedia environmental chemical
portioning in air and water compartments of the reference scenario
have been developed and their generalization capabilities validated
with 80 chemicals not seen at all by the models. Fig. 4a and b
compares the target partitioning values (reference mass ratios)
respectively generated for the air and water compartments by the
MEM with the predictions obtained from the two specific (QSFRair)
and (QSFRwater) models with chemicals characterized by the MW and
38 non-zero counts of molecular constituents. These figures also
include the ranges and 95% confidence interval limits of the MEM
output obtained from the Monte Carlo simulation.
Fig. 4 shows that mass ratios predicted by the QSFR models both
for the air and water compartments deviate from the MEMs' estimates
within the envelope of variability obtained with the Monte Carlo
realizations of MEM (MC-MEM). QSFR predictions with the highest
deviations tend to be close to the extreme values delimited by the
variation ranges of the MC-MEM, especially for the air compartment
where mass ratios are very small and sensitive to models' input
uncertainties. The variations in property values generated by the
statistical distributions of standard property estimation methods
(Table 1) produced variations in the outputs of MC-MEM of up to 12
logarithmic units. These results suggest that the outputs of standard
MEMs should undergo a similar variability if input variables were
estimated from available QSPR and QSBR models.
Table 3 shows that the 1000 realizations of 455 chemicals of the
MC-MEM yield q2mean = 0.88 and MAEmean = 0.80 for the air compartment and q2mean = 0.86 and MAEmean = 0.17 for the water
compartment. Table 3 also shows that the predictive performances
of QSFRair and QSFRwater per data set are reasonably high. The overall
performances of these models, including all the 455 chemicals (i.e.,
the training, test and validation sets simultaneously) are q2 = 0.82
and MAE = 0.91 for air and q2 = 0.81 and MAE = 0.32 for water.
Table 3 also indicates that the predictive capacity of the QSFR models
in Fig. 4 is high for chemicals located within the boundaries of the
DOA, which is the case of the training and test chemicals previously
selected with the SOM. QSFR model performance drops for the
validation set since some chemicals fall out of the DOA.
QSFR models using simple counts of molecular constituents, as the
ones we propose here, cannot distinguish between isomers that have
the exact same number of bonds, functional groups and ring
structures. With these characteristics, there are 81 working and 10
validation isomeric chemicals out of the 375 working and 80
validation chemicals, respectively. This is not a serious drawback
because transport and degradation properties for these isomers in the
current working and validation data sets are not significantly
different. On the other hand, molecular constituent counts have the
advantage that they can be easily retrieved or calculated given the
molecular formula or the SMILES (Weininger, 1988), making them
suitable for simple and rapid screenings of chemical partitioning.
Since the constituents of a chemical (atoms, bonds, groups and rings)
can be counted without errors and the SVR algorithm would always
yield the same model for the same training data and parameters
(unlike ANNs, which adjust internal parameters in search of a local
minimum error), current SVR-based QSFR models can easily be
reproduced. This represents an additional advantage over models
using as input semi-empirical descriptors since their values vary
depending on the specific MO method (Bredow and Jug, 2005) used to
estimate them.
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
419
Table 3
Evaluation of different approaches to estimate mass ratios in air and water
compartments when the chemical is emitted in the water compartment.
Compartment Estimation Performance Performances per data seta
approach
measure
Training Test
Validation
set
set
set
Air
MC-MEM
Air
QSFRair
Air
QSFRair,X/Y
Water
MC-MEM
Water
QSFRwater
Water
QSFRwater,
X/Y
q2
MAE
q2
MAE
q2
MAE
q2
MAE
q2
MAE
q2
MAE
0.88
0.80
0.85
0.81
0.92
0.54
0.89
0.16
0.86
0.30
0.94
0.11
0.87
0.79
0.86
0.81
0.91
0.59
0.79
0.15
0.60
0.34
0.78
0.19
0.86
0.82
0.64
1.34
0.68
1.30
0.78
0.22
0.68
0.39
0.62
0.36
All
sets
0.88
0.80
0.82
0.91
0.88
0.68
0.86
0.17
0.81
0.32
0.87
0.17
a
The number of chemicals per data set varies per compartment. For the air
compartment there are 300 training, 75 test and 80 validation chemicals. For the water
compartment there are 299 training, 76 test and 80 validation chemicals.
4.2. Assessment of the chemical domain
4.2.1. QSFRs models for two broad chemical classes
The development of QSFR models for more homogeneous classes
of chemicals was undertaken with the aim of assessing ways of
improving the performances. This implies (i) the definition of
chemical classes (families), (ii) the labeling of chemicals according
to the characteristics of the two classes, and (iii) the development and
use of specialized QSFR models (one per chemical class).
Different criteria could be proposed for creating chemical families
with respect to molecular structure, but the performance of any classtailored QSFR is hampered by the availability of sufficient training
data. In a preliminary screening of the 375 work chemicals in the
reference scenario, it was observed that 39 chemicals were solely
formed by carbon and hydrogen atoms, while the remaining 336
chemicals have at least one heteroatom (bromine, chlorine, fluorine,
nitrogen, oxygen, phosphorus or sulfur atoms). These two classes
would constitute a reasonable starting point for developing specialized QSFR models if they were equally populated. In the current data
sets there is an unbalanced distribution of chemicals with only 39
chemicals in the first class, which is not sufficient to generate
appropriate training and test data sets. Thus, an adjustment was made
to create two meaningful chemical families, one with 146 chemicals
containing solely carbon, hydrogen and oxygen (Class X) and another
with 229 chemicals containing at least one heteroatom different than
oxygen (Class Y).
Chemicals in class X cover broad and meaningful ranges of sizes
(44.05 ≤ MW ≤ 434.58) and hydrophobicity (1.05 × 10−2 ≤ Kow ≤
3.80 × 1014). Chemicals in this class X can be described by solely MW,
4 atom counts (all atoms, carbon, hydrogen and oxygen), 3 bond counts
(all bonds, single bonds and double bonds), 7 functional group counts
(aldehyde, carbonyl, carboxyl, ether, hydroxyl, methyl and methylene)
and 8 ring counts (all rings, aromatic rings, small rings, 5-membered,
aromatic 5-membered, 6-membered, aromatic 6-membered and 7–12membered). On the other hand, chemicals in class Y also cover
appropriate ranges of sizes (50.49 ≤ MW ≤ 959.17) and hydrophobicity
(5.62 × 10−3 ≤ Kow ≤ 4.57 × 1012). They are described by the MW and
the 38 constituent counts listed in Table 2 (like in the QSFR models of
Section 4.1).
For optimal results, specific training and test data sets should be
used every time a new SVR-based QSFR is developed. Nevertheless, the
same training and test chemicals previously selected for the models
QSFRair and QSFRwater were maintained, for comparison purposes,
when developing class-tailored QSFRs (i.e., for classes X and Y). In this
Fig. 4. Comparison between the logarithmic mass ratios determined by SB3 and those
estimated by (a) QSFRair for the air compartment (q2all = 0.82 and MAE = 0.91) and
(b) QSFRwater for the water compartment (q2all = 0.81 and MAE = 0.32). Emissions are in
the water compartment.
case, four class-tailored models were developed: QSFRair,X, QSFRair,Y,
QSFRwater,X and QSFRwater,Y. Logarithmic mass ratios were predicted for
each chemical, according to its chemical class (X or Y) by using the
appropriate model per compartment. The results for both classes
(X and Y) are presented in the following discussion for each
compartment with the acronyms QSFRair,X/Y (i.e., models QSFRair,X or
QSFRair,Y) and QSFRwater,X/Y (i.e., models QSFRwater,X or QSFRwater,Y).
Fig. 5 depicts predictions of logarithmic mass ratios for the air
compartment (Fig. 5a) and the water compartment (Fig. 5b) obtained
with the models QSFRair,X/Y and QSFRwater,X/Y, respectively. An improvement in model performance has been achieved with the class-tailored
models in Fig. 5 with respect to the overall simple QSFRair and QSFRwater
models in Fig. 4. Table 3 shows that the class-tailored QSFRair,X/Y and
QSFRwater,X/Y, plotted in Fig. 5, yield higher q2 and lower MAE values for
air (q2all = 0.88 and MAE= 0.68 for QSFRair,X/Y) and water (q2all =0.87 and
MAE =0.17 for QSFRwater,X/Y), compared to the overall simple models
QSFRair (q2all = 0.82 and MAE = 0.91) and QSFRwater (q2all = 0.81 and
MAE =0.32), plotted in Fig. 4, when all data sets (training, test and
420
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
tailored models. This is discussed in the next subsection. Also, the
training and test data sets selected from SOMs for the general QSFR
models (Fig. 4) were kept unchanged when training the classspecialized QSFRs (Fig. 5) to facilitate the comparison of performances
among the different models (Table 3).
4.2.2. Domain of applicability
QSFR predictions outside the DOA of the models should be
avoided (Johnson, 2008). The DOA of any model is primarily defined
by its training chemicals (Weaver and Gleeson, 2008). Thus, by
identifying the DOA of an existing QSFR model it is possible to
approximately assess how appropriate it is for a new chemical
(Kühne et al., 2009).
Reasonable estimations of the DOA of a model can be performed
(Schroeter et al., 2007) by measuring distances or probability density
distributions of training data vectors (training chemicals) to new data
vectors (validation chemicals or new chemicals of concern). Since the
SOM algorithm is based on the distances between data vectors in a
multivariate space (Kohonen et al., 1996), it has been currently used
to define the DOA of the QSFR models developed. Three different
SOM-based approaches have been tested:
(i) DOA defined by the same SOM as in the selection of training
and test data sets. The training chemical with the highest qerror
in each non-empty SOM units defines the DOA border.
Chemicals clustered in the original SOM were represented by
vector of dimension 40 (MW, 38 constituent counts and a mass
ratio). When presenting new chemicals to the SOM to evaluate
whether or not they belong to the DOA, their mass ratios are
unknown and only 39 out of 40 elements of the vector are used
for classification purposes. Nevertheless, the error in the
determination of the distance used to identify the BMUs with
only 39 elements out of 40 is negligible.
(ii) DOA defined by the principal component analysis (Pearson,
1901) of the 39 input variables. It was found that the first five
principal components (eigenvectors) accounted for about 59%
of the cumulative variance. The SOM was trained with these
five principal components and, again, the DOA was defined by
the highest qerror of the training chemicals in each non-empty
SOM unit.
(iii) DOA defined by the intersection of the two approaches above.
Fig. 5. Comparison between the logarithmic mass ratios determined by SB3 and those
estimated by (a) QSFRair,X/Y for the air compartment (q2all = 0.88 and MAE = 0.68) and
(b) QSFRwater,X/Y for the water compartment (q2all = 0.87 and MAE = 0.17). Emissions
are in the water compartment.
validation) are considered. These results of QSFRair,X/Y and QSFRwater,X/Y
are on average very close to those from MC-MEM when considering all
data sets simultaneously, as indicated also in Table 3. The same
improvement pattern is observed in Table 3 for the test sets. QSFRair,X/Y
and QSFRwater,X/Y respectively yield q2all = 0.91 and q2all = 0.78 compared
to the lower q2all = 0.86 and q2all = 0.60 obtained with QSFRair and
QSFRwater, respectively. Thus, the discrimination of chemicals with
respect to their structure (i.e., by using classes X and Y) improves the
generalization capability of the SVR-based QSFRs by establishing better
relationships between chemical distribution and molecular structure.
On the other hand, the performance of any QSFR model with the
validation set is significantly poorer both in terms of q2 and MAE
values for the air and water compartments, as shown in Figs. 4 and 5,
and Table 3. Also, no improvement is observed when two chemical
classes are considered. It should be noted that the number of training
chemicals per SVR-based QSFR is reduced to about half when
implementing chemical classes. This in turn increases the chances of
having some validation chemicals outside the DOA of the class-
Table 4 summarizes the performances of models QSFRair,X/Y and
QSFRwater,X/Y in terms of q2 and MAE for the test and validation
chemicals emitted in water, both for the complete data sets or only
for those chemicals belonging to the DOAs defined previously. In
the first two approaches (i) and (ii), test or validation chemicals
with quantization errors higher than those of the upper bounding
training chemicals in each BMU are considered to be out the DOA
of the models. Since the number of chemicals within the (i) and (ii)
DOA definitions differs because of the different variables considered
and the errors in each SOM, their intersection (iii) is preferred
because more restrictive conditions are imposed. Table 4 shows
that by using this third approach, the mass ratios of 48 and 50
“new” chemicals (test plus validation) in the air and water compartments, respectively, when emitted in water and belonging to the DOA
can be optimally predicted by QSFRair,X,Y (q2te + val =0.92 and
MAE = 0.54) and QSFRwater,X,Y (q2te + val = 0.94 and MAE = 0.15).
Table 4 also shows that model performances remain as high for
emissions in air when using the same training, test and validation data
sets already used in the water-emission models and for test and
validation chemicals belong to the DOA defined according to approach
(iii). The lowest performances (q2val ≈ 0.75) are attained for the 12
validation chemicals in the air compartment when emitted in either
water or air.
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
421
Table 4
Effect of belonging or not to the domain of applicability (DOA) for various DOA definitions when using specialized QSFR for the air and water compartments. The DOAs are
established by SOM classes generated with (i) 39 descriptors, (ii) by the first five principal components of the PCA of these 39 descriptors and (iii) by the intersection of the two
previous SOM-based DOAs.
Model
DOA
Parameters
Emissions in water
Emissions in air
Chemicals within DOA
QSFRair,X/Y
(i)
(ii)
(iii)
QSFRwater,X/Y
(i)
(ii)
(iii)
# Chem
q2
MAE
# Chem
q2
MAE
# Chem
q2
MAE
# Chemi
q2
MAE
# Chem
q2
MAE
# Chemi
q2
MAE
Chemicals out DOA
Chemicals within DOA
te
val
te, val
te
val
te, val
te
val
te, val
62
0.93
0.55
36
0.95
0.45
36
0.95
0.45
56
0.84
0.15
44
0.86
0.13
40
0.91
0.12
29
0.69
0.99
15
0.64
1.03
12
0.78
0.79
21
0.88
0.30
16
0.81
0.38
10
0.95
0.25
91
0.89
0.69
51
0.89
0.62
48
0.92
0.54
77
0.87
0.19
60
0.84
0.20
50
0.94
0.15
13
0.70
0.79
39
0.86
0.73
39
0.86
0.73
20
0.57
0.29
32
0.69
0.26
36
0.66
0.26
51
0.67
1.47
65
0.68
1.36
68
0.66
1.39
59
0.28
0.39
64
0.21
0.36
70
0.25
0.38
64
0.68
1.33
104
0.75
1.12
107
0.73
1.15
79
0.38
0.36
96
0.43
0.33
106
0.40
0.34
62
0.95
0.21
36
0.97
0.16
36
0.97
0.16
56
0.90
0.29
44
0.92
0.27
40
0.93
0.25
29
0.54
0.46
15
0.65
0.41
12
0.76
0.34
21
0.88
0.30
16
0.73
0.58
10
0.93
0.32
91
0.89
0.29
51
0.92
0.23
48
0.94
0.20
77
0.90
0.29
60
0.84
0.35
50
0.93
0.27
5. Conclusions
Environmental concentrations of chemical pollutants have been
assessed by developing quantitative structure–fate relationship
(QSFR) models for 455 chemicals, 375 for training and testing and
80 for validation purposes, in a model scenario representing the
Netherlands. Multimedia chemical partitioning for chemical emissions
in air and water has been predicted by the Level III Fugacity model
SimpleBox 3 by using mostly experimental information of partition
coefficients and degradation rates as inputs to the models to minimize
the influence of predicted information in QSFR model development.
Best results have been obtained when dividing the chemical space into
two classes, one with chemicals containing only carbon, hydrogen and
oxygen, and the other with chemicals containing at least one
heteroatom different than oxygen in their structure. Logarithmic
mass ratios in the air and water compartments have been predicted
within the uncertainty envelope of SimpleBox 3 (i.e., within deviations
associated to the uncertainty of physicochemical and transport
properties, and degradation rates) for the training, test and validation
sets. The two-class QSFR models have shown good performance with
predictive square coefficients q2 ≥ 0.87 for all data sets of 455
chemicals combined. This predictive performance increases to
q2 ≈ 0.93 for the combination of test and validation data sets when
chemicals outside the DOA of the models have been excluded. These
results obtained demonstrate the feasibility of predicting multimedia
chemical partitioning directly from molecular information.
Supplementary data associated with this article can be found, in
the online version, at doi: 10.1016/j.scitotenv.2010.10.016.
Acknowledgements
The authors are grateful to Professor Y. Cohen of UCLA for many
fruitful discussion held during the course of this investigation.
The research was financially supported by the European Union
(NOMIRACLE, European Commission, FP6 contract No. 003956) and
the Generalitat de Catalunya (2009SGR-01529). Francesc Giralt
acknowledges the Distinguished Researcher Award, Generalitat de
Catalunya.
References
Aronson D, Boethling R, Howard P, Stiteler W. Estimating biodegradation half-lives for
use in chemical screening. Chemosphere 2006;63:1953–60.
Basheer IA, Hajmeer M. Artificial neural networks: fundamentals, computing, design,
and application. J Microbiol Meth 2000;43:3-31.
Boethling RS, Howard PH, Beauman AJ, Lanosche ME. Factors in intermedia
extrapolation in biodegradability assessment. Chemosphere 1995;30:741–52.
Boethling RS, Howard PH, Meylan WM. Finding and estimating chemical property data
for environmental assessment. Environ Toxicol Chem 2004;23:2290–308.
Brandes LJ, den Hollander H, van de Meent D. SimpleBox 2.0: a nested multimedia fate
model for evaluating the environmental fate of chemicals. Bilthoven, The
Netherlands: RIVM; 1996. p. 156.
Bredow T, Jug K. Theory and range of modern semiempirical molecular orbital methods.
Theor Chim Acta 2005;113:1-14.
Breivik K, Wania F. Expanding the applicability of multimedia fate models to polar
organic chemicals. Environ Sci Technol 2003;37:4934–43.
Breivik K, Alcock R, Li Y-F, Bailey RE, Fiedler H, Pacyna JM. Primary sources of selected
POPs: regional and global scale emission inventories. Environ Pollut 2004;128:
3-16.
Breivik K, Vestreng V, Rozovskaya O, Pacyna JM. Atmospheric emissions of some POPs in
Europe: a discussion of existing inventories and data needs. Environ Sci Policy
2006;9:663–74.
Brüggemann R, Restrepo G, Voigt K. Structure–fate relationships of organic chemicals
derived from the software packages E4CHEM and WHASSE. J Chem Inf Model
2006;46:894–902.
Burden FR, Polley MJ, Winkler DA. Toward novel universal descriptors: charge
fingerprints. J Chem Inf Model 2009;49:710–5.
Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector machine
and artificial neural network systems for drug/nondrug classification. J Chem Inf
Comput Sci 2003;43:1882–9.
Citra MJ. Incorporating Monte Carlo analysis into multimedia environmental fate
models. Environ Toxicol Chem 2004;23:1629–33.
Cohen Y. Pollutants in a multimedia environment. In: Cohen Y, editor. Workshop on
pollutant transport and accumulation in a multimedia environment. New York:
Plenum Press; 1986.
Cohen Y, Cooter EJ. Multimedia environmental distribution of toxics (Mend-Tox). I:
hybrid compartmental–spatial modeling framework. Pract Period Hazard Toxic
Radioact Waste Manage 2002a;6:70–86.
Cohen Y, Cooter EJ. Multimedia environmental distribution of toxics (Mend-Tox). II:
software implementation and case studies. Pract Period Hazard Toxic Radioact
Waste Manage 2002b;6:87-101.
European Commission. Technical guidance document on risk assessment, part III.
Institute for Health and Consumer Protection, European Chemicals Bureau, 2003.
Cronin MTD, Schultz TW. Pitfalls in QSAR. Theochem J Mol Struct 2003;622:39–51.
Cronin MT, Walker JD, Jaworska JS, Comber MH, Watts CD, Worth AP. Use of QSARs in
international decision-making frameworks to predict health effects of chemical
substances. Environ Health Perspect 2003;111:1376–90.
den Hollander HA, van de Meent D. Appendix to SimpleBox 3.0: A multimedia mass
balance model for evaluating the environmental fate of Chemicals. RIVM; 2004.
den Hollander HA, van Eijkeren JCH, van de Meent D. SimpleBox 3.0. Bilthoven, The
Netherlands: RIVM; 2004.
422
I. Martínez et al. / Science of the Total Environment 409 (2010) 412–422
Devillers J. A decade of research in environmental QSAR. SAR QSAR Environ Res
2003;14:1–6.
Diudea MV. QSPR/QSAR studies by molecular descriptors. New York: Nova; 2001.
Drucker H, Burges CJC, Kaufman L, Smola A, Vapnik V. Support vector regression
machines. Advances in neural information processing systems 9. Proceedings of the
1996 conference. Denver, CO, USA: MIT Press; 1997. p. 155–61. Dec 2–5.
Duca JS, Hopfinger AJ. Estimation of molecular similarity based on 4D-QSAR analysis:
formalism and validation. J Chem Inf Comput Sci 2001;41:1367–87.
Efroymson RA, Murphy DL. Ecological risk assessment of multimedia hazardous air
pollutants: estimating exposure and effects. Sci Total Environ 2001;274(1–3):
219–30.
Eisenberg JNS, Bennett DH, McKone TE. Chemical dynamics of persistent organic
pollutants: a sensitivity analysis relating soil concentration levels to atmospheric
emissions. Environ Sci Technol 1998;32:115–23.
Fenner K, Scheringer M, Macleod M, Matthies M, McKone T, Stroebe M, et al. Comparing
estimates of persistence and long-range transport potential among multimedia
models. Environ Sci Technol 2005;39:1932–42.
Fjodorova N, Novich M, Vrachko M, Smirnov V, Kharchevnikova N, Zholdakova Z, et al.
Directions in QSAR modeling for regulatory uses in OECD member countries, EU
Russia. J Environ Sci Health Pt C Environ Carcinog Ecotoxicol Rev 2008;26:201–36.
Fujitsu BGo. CAChe software. Beaverton: BioSciences Group, Fujitsu Computer Systems;
2004.
Furusjö E, Svenson A, Rahmberg M, Andersson M. The importance of outlier detection
and training set selection for reliable environmental QSAR predictions. Chemosphere 2006;63:99-108.
Godavarthy SS, Robinson JRL, Gasem KAM. SVRC-QSPR model for predicting saturated
vapor pressures of pure fluids. Fluid Phase Equilib 2006;246:39–51.
Golbraikh A, Tropsha A. Beware of q2! J Mol Graph Model 2002;20:269–76.
Hall MA. Correlation-based Feature Selection for Machine Learning. Department of
Computer Science. Ph.D. thesis. The University of Waikato, Hamilton, New Zealand,
1999.
Howard PH, Boethling RS, Jarvis WF, Meylan WM, Michalenko EM. Handbook of
environmental degradation rates. Boca Ratón: Lewis Publishers; 1991.
Hugo K. From narcosis to hyperspace: the history of QSAR. Quant Struct Act Relat
2002;21:348–56.
Johnson SR. The trouble with QSAR (or how i learned to stop worrying and embrace
fallacy). J Chem Inf Model 2008;48:25–6.
Junge CE. Fate of pollutants in the air and water environment. Wiley-Interscience;
1977a.
Junge CE. Basic considerations about trace constituents in the atmosphere is related to
the fate of global pollutants. In: Suffet IH, editor. Fate of pollutants in the air and
water environment. Advances in Environmental Science and TechnologyNew York:
Wiley-Interscience; 1977b. Part I.
Kawamoto K, MacLeod M, Mackay D. Evaluation and comparison of multimedia mass
balance models of chemical fate: application of EUSES and ChemCAN to 68
chemicals in Japan. Chemosphere 2001;44:599–612.
Klöpffer W, Wagner BO. Persistence revisited. Environ Sci Pollut Res 2007;14:141–2.
Kohonen T. Self-organizing maps. 3rd ed. Berlin: Springer; 2001.
Kohonen T, Oja E, Simula O, Visa A, Kangas J. Engineering applications of the selforganizing map. Proc IEEE 1996;84(10):1358–84.
Kühne R, Breitkopf C, Schüürmann G. Error propagation in fucacity level-III models in
the case of uncertain physicochemical properties. Environ Toxicol Chem 1997;16:
2067–9.
Kühne R, Ebert R-U, Schüürmann G. Estimation of compartmental half-lives of organic
compounds—structural similarity versus EPI-suite. QSAR Comb Sci 2007;26:542–9.
Kühne R, Ebert RU, Schüürmann G. Chemical domain of QSAR models form atomcentered fragments. J Chem Inf Model 2009;49:2660–9.
Lohmann R, Breivik K, Dachs J, Muir D. Global fate of POPs: current and future research
directions. Environ Pollut 2007;150:150–65.
Mackay D. Multimedia environmental models—the fugacity approach. Boca Ratón:
Lewis Publishers; 2001.
Mackay D, Webster E. A perspective on environmental models and QSARs. SAR QSAR
Environ Res 2003;14:7-16.
Mackay D, Webster E. Environmental persistence of chemicals. Environ Sci Pollut Res
2006;13:43–9.
Mackay D, Shiu W-Y, Ma KC. Illustrated handbook of physical–chemical properties and
environmental fate for organic chemicals. Lewis Publishers Inc.; 1992
Mackay D, Hubbarde J, Webster E. The role of QSARs and fate models in chemical hazard
and risk assessment. QSAR Comb Sci 2003;22:106–12.
Martínez I, Quantitative structure fate relationships for multimedia environmental
analysis. Ph.D. thesis. Universitat Rovira i Virgili, Tarragona, Spain, 2010.
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T. YALE: rapid prototyping for
complex data mining tasks. In: Ungar L, Craven M, Gunopulos D, Eliassi-Rad T,
editors. Proceedings of the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-06). Philadelphia, PA, USA: ACM;
2006. p. 935–40.
Modarresi H, Modarress H, Dearden JC. QSPR model of Henry's law constant for a
diverse set of organic chemicals based on genetic algorithm-radial basis function
network approach. Chemosphere 2007;66:2067–76.
Nikolova N, Jaworska J. Approaches to measure chemical similarity—a review. QSAR
Comb Sci 2003;22:1006–26.
NITE. Chemical risk information platform (CHRIP). National Institute of Technology and
Evaluation; 2006.
OECD. Guidance document on the validation of (quantitative) structure–activity
relationship [(Q)SAR] models. OECD Series on Testing and Assessment, 69; 2007.
Pearson K. On lines and planes of closest fit to systems of points in space. Phil Mag
1901;2:559–72.
Prinn RG, Huang J, Weiss RF, Cunnold DM, Fraser PJ, Simmonds PG, et al. Evidence for
substantial variations of atmospheric hydroxyl radicals in the past two decades.
Science 2001;292:1882–8.
Raymond JW, Rogers TN, Shonnard DR, Kline AA. A review of structure-based
biodegradation estimation methods. J Hazard Mater 2001;84:189–215.
Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics.
Bioinformatics 2007;23:2507–17.
Schroeter T, Schwaighofer A, Mika S, Ter Laak A, Suelzle D, Ganzer U, et al. Estimating
the domain of applicability for machine learning QSAR models: a study on aqueous
solubility of drug discovery molecules. J Comput Aided Mol Des 2007;21:651–64.
Schüürmann G, Ebert R-U, Chen J, Wang B, Kühne R. External validation and prediction
employing the predictive squared correlation coefficient—test set activity mean vs
training set activity mean. J Chem Inf Model 2008;48:2140–5.
Schwarzenbach RP, Gschwend PM, Imboden DM. Environmental organic chemistry.
2nd ed. New Jersey: Wiley; 2003.
Senese CL, Duca J, Pan D, Hopfinger AJ, Tseng YJ. 4D-Fingerprints, Universal QSAR and
QSPR Descriptors. J Chem Inf Comput Sci 2004;44:1526–39.
SRC. Interactive PhysProp Database Demo. Syracuse Research Corporation.SRC. EPI
Suite v4.00. SRC, 2008.
Stouch TR, Kenyon JR, Johnson SR, Chen X-Q, Doweyko A, Li Y. In silico ADME/Tox: why
models fail. J Comput Aided Mol Des 2003;17:83–92.
Struijs J, Peijnenburg WJGM. Predictions by the multimedia environmental fate model
SimpleBox compared to field data: Intermedia concentration ratios of two
phthalate esters. Bilthoven: RIVM; 2002. p. 62.
Taskinen J, Yliruusi J. Prediction of physicochemical properties based on neural network
modelling. Adv Drug Deliv Rev 2003;55:1163–83.
Tickner J, Geiser K, Coffin M. The U.S. experience in promoting sustainable chemistry.
Environ Sci Pollut Res 2005;12:115–23.
Todeschini R, Consonni V. Handbook of molecular descriptors. Weinheim: Wiley-VCH;
2000.
Toose L, Woodfine DG, MacLeod M, Mackay D, Gouin J. BETR-World: a geographically
explicit model of chemical fate: application to transport of [alpha]-HCH to the
Arctic. Environ Pollut 2004;128:223–40.
Toussant M. A scientific milestone. Chem Eng News 2009;87:3.
U.S. EPA, Persistence, bioaccumulative and toxic (PBT) profiler. U.S. EPA Office of
pollution prevention and toxics; 2001http://www.epa.gov/oppt/sf/tools/pbtprofiler.htm. Washington DC.
US-EPA. Inventory Update Rule. Office of Pollution Prevention and Toxics. Washington:
Environmental Protection Agency; 2006http://www.epa.gov/oppt/iur.
van de Meent D. SIMPLEBOX: a generic multimedia fate evaluation model. Bilthoven,
The Netherlands: RIVM; 1993.
Vesanto J, Himberg J, Alhoniemi E, Parhankangas J. SOM Toolbox for Matlab 5, 2000; 2000.
Walker JD, Carlsen L, Hulzebos E, Simon-Hettich B. Global Government applications of
analogues, SARs and QSARs to predict aquatic toxicity, chemical or physical
properties, environmental fate parameters and health effects of organic chemicals.
SAR QSAR Environ Res 2002;13:607–16.
Weaver S, Gleeson MP. The importance of the domain of applicability in QSAR
modeling. J Mol Graph Model 2008;26:1315–26.
Webster E, Mackay D, Di Guardo A, Kane D, Woodfine D. Regional differences in
chemical fate model outcome. Chemosphere 2004;55:1361–76.
Weininger D. A chemical language and information system. 1. Introduction to
methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6.
Witten IH, Frank E. Data mining: practical machine learning tools and techniques. San
Francisco, U.S.: Morgan Kaufmann; 2005
Worth AP, Bassan A, De Bruijn J, Saliner AG, Netzeva T, Patlewicz G, et al. The role of the
European Chemicals Bureau in promoting the regulatory use of (Q)SAR methods.
QSAR Environ Res 2007;18:111–25.
Xu Y, Zomer S, Brereton RG. Support vector machines: a recent method for classification
in chemometrics. Crit Rev Anal Chem 2006;36:177–88.
Yaffe D, Cohen Y. Neural network based temperature-dependent quantitative structure
property relations (QSPRs) for predicting vapor pressure of hydrocarbons. J Chem
Inf Comput Sci 2001;41:463–77.
Yaffe D, Cohen Y, Espinosa G, Arenas A, Giralt F. A fuzzy ARTMAP based on quantitative
structure—property relationships (QSPRs) for predicting aqueous solubility of
organic compounds. J Chem Inf Comput Sci 2001;41:1177–207.
Yaffe D, Cohen Y, Espinosa G, Arenas A, Giralt F. Fuzzy ARTMAP and back-propagation
neural networks based quantitative structure–property relationships (QSPRs) for
octanol–water partition coefficient of organic compounds. J Chem Inf Model
2002;42:162–83.
Yaffe D, Cohen Y, Espinosa G, Giralt F, Arenas A. A fuzzy ARTMAP-based quantitative
structure–property relationship (QSPR) for the Henry's Law constant of organic
compounds. J Chem Inf Comput Sci 2003;43:85-112.
Zukowska B, Breivik K, Wania F. Evaluating the environmental fate of pharmaceuticals
using a level III model based on poly-parameter linear free energy relationships. Sci
Total Environ 2006;359:177–87.