Data Quality Issues in Toxicological

advertisement
From: AAAI Technical Report SS-99-01. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved.
Data Quality Issues
in Toxicological
Christoph
Helma
Institute for TumorBiology-Cancer Research
Borschkegasse 8a, A-1090 Vienna
Christoph.Helma@univie.ac.at
Knowledge Discovery
Eva Gottmann
Institute for Environmental Hygiene
Kinderspitalgasse 15, A-1095 Vienna
Stefan
Kramer
and Bernhard
Pfahringer
Austrian Research Institute
for Artificial
Intelligence
Schottengasse
3, A-1010 Vienna
Abstract
Every SARtechnique for toxicity prediction relies on
the exact estimation and representation of chemical
and toxicological properties. Wewill present potential sources of errors associated with the utilization of
l~trge, noncongenericdatasets and complextoxicologi(:al endpoints (e.g. carcinogenicity). Accordingto
(~xperience we have identified the major problems in
the areas of compound
identification, descriptor calculation and toxicity data. Generally, we consider the
chemical data as more reliable than the results from
toxicity experiments.As it is impossible to tackle the
data quality problemon a case by case basis for a large
mlmberof compounds,we will propose somepossibilitie.~ for routine quality control of large datasets.
Introduction
The (tevelopment of Structure Activity Relationships
(SARs) relies on the comparison of chemical structures
or their properties (descriptors) with their toxicological
effects. Although it is generally accepted that the exact estimation and representation of chemical and toxicological properties is a prerequisite for a good SAR
mo(lel, this topic has been rarely addressed in a systematic manner.
In this paper we will present our experience resulting
fi’om the application of Machine Learning techniques
to large, noncongeneric datasets and complex toxicological endpoints (e.g. carcinogenicity).
Our source
of toxicological information is the Carcinogenic Pot, ency Database (CPDB)(GoldZeiger 19 97). It con
tains very detailed information from long-term in vivo
car(:inogenicity experiments and consists of two major
parts. ()he dataset contains the results of carcinogenicit). eXl)eriments performed within the National Toxicolog:q Program (NTP). These studies were conducted in
c()mpliance with FDAGood Laboratory Practice Regulations. The second dataset contains data from the
general literature which meet a set of standard criteria
(details in (Gold & Zeiger 1997)).
Du(~to 1)u(lget restrictions, we obtained structural
(tarsi almost exclusively from free sources: the NCI
Datal)ase t, Chemical Structures from NTPTechnical
t NCI Database: http://epnusl,
ncifcrf,
gov: 2345/
Reports 2 and ChemFinder from CambridgeSoft 3 (Table 3).
NTP
Compounds
393
Literature
whole CPDB
1028
1299
Table 1: Summary of the Carcinogenic
Database (CPDB) (Gold & Zeiger 1997).
Identification
of
Potency
Compounds
The correct assignment of chemical structures and their
toxicity data is crucial for the development of a valid
SARmodel, because faulty structures in the training
set will prohibit the correct detection of features responsible for toxic action.
As we use data from different sources, we had to rely
on Chemical Abstracts Registry (CAS) Numbers for
their identification. Wewere able to find CASRegistry
numbers for 1257 of 1299 CPDBcompounds, 11 CAS
numbers were associated with more than one compound
and 14 compounds were marked as mixtures in the
CPDB. Although the CAS was designed as an unique
identification of chemicals, it is sometimes not a good
identifier for toxicological purposes. Toxicologically irrelevant differences (e.g. crystal water) lead for example
to different CASnumbers for similar structures. Typos
may cause a wrong assignment of structures.
On the
other hand it is sometimes difficult to assign a CASif
only nonstandard nomenclatures are available: For the
CPDBwe were unable to find a CASfor 27 compounds.
Table 2 lists some possible errors associated with CAS
numbers and possibilities to check them.
Another source of variability is associated with finding the "correct" structures for a given CASNumber.
dis3d/Sddatabase/NCl_SMIL.HTML,237,771
structures in
SMILESformat
2NTPSpecial Reports: http://n%p-db.niehs.nih.
gov/Main_Pages/pub-Structures.html, Structural data for
489 NTPTechnical Reports
3ChemFinder:
http://chemf inder, camsoft, corn/,
WWW-based
chemical search engine
CAS Error
typos
I’
mixtures
missing
equivocal
Solution
Nr. Compounds
Check Digit
aVerification
onfit
manual search
0
14
62
24
" seehttp://info.
cas.org/EO/checkdig,
htmlfor
the algorithm
r, including mixtures with CASRegistry numbers
Tal)le 2: Possible errors associated with the identification of CPDBcompounds by their CAS Registry Numl)er
Tal)le 3 lists the databases we used for our work. Structures wet(; obtained in different formats (NCI: SMILES,
NTP: MDLMolfiles,
ChemFinder: ChemDraw Binaries). Weconverted them with Babel to SMILESnotati()n, a compact, linear representation of 2-dimensional
structures.
SMILESstrings are a commonlyaccepted
inlmt tbrmat for computational chemistry programs.
Som’ce"
Nr. Structures
NCI Database
NTP Special Reports
ChelnFinder
678
109
360
1147
b bNr. CAS
693
110
366
1169
" Structures were retrieved sequentially in the
order of this table
i, There is no 1:1 matching between structures and
CAS,see preceding table
Tal)lc 3: Sources of structural
data for the CPDB
SMILESstrings may be checked for correct syntax,
1)y using them as an input for various programs. Although in manycases errors are caused by restrictions
and bugs of computational chemistry programs rather
than real SMILESsyntax errors, this is a good procedm’e tbr checking their suitability for real world applications. Our experiments with five different computational (:hemistry programs (Helmaet al. 1999) indicate
that the SMILESstrings for the CPDBcompounds are
synta(:tically correct, although wrongvalences were dete(:ted in some compounds. There were several probl(~ms associated with the inability to read large structures, to interpret structures with heavy metals (As,
Ti, Cd, Bi, Hg) and most importantly, to deal with
dis(:omm(:tcd structures. It was therefore necessary
edit the SMILESstrings manually, to connect (e.g. covalent bonds instead of ions) or remove (e.g. crystal
water) s(~.parate(l structures. This procedure was very
laborious, especially for the NCI Database, where alkali
metals were stripped from organic structures.
A correct SMILESsyntax is still not a guarantee
for a "correct" structure. Especially in our situation
where we are using large, noncongeneric datasets and
do not have any a priori knowledge of molecular mechanisms it is impossible to estimate the structure under
physiological conditions (apart from alterations due to
metabolism, etc). It is therefore more important to find
a consistent representation instead of a "correct" structure for each compound.
Another source of variability is arising from the possible presence of impurities or the incomplete knowledge about the structures (e.g. isomers) in the chemical
tested. Wetried to eliminate all compoundscontaining
more than one structure from the CPDB(54 mixtures
in 1169 identified compounds).This procedure relies to
a large extent on correct information in the underlying
database and the correct assignment of CASRegistry
Numbers. In many cases it is virtually impossible to
check the correctness of this data.
Calculation
of Descriptors
In the development and particularly in the application
of SARsit is essential to identify the structural or chemical properties that are predictive to the endpoint of interest. Presently the choice of structural and property
descriptors for complextoxicological effects strongly relies on the intuition of the individual researcher, especially if no detailed knowledgeof the underlying molecular mechanismsis available.
Measured properties
Descriptors
based on measured properties (e.g. lipophilicity, elctrophilicity)
have been historically
the most favored approach
when generating SARs. Their determination is expensive and time-consuming and they are therefore
not very suitable for large datasets.
Presence of substructures Every chemist is used to
think in terms of functional groups which compose
a molecule. SAR models based on substructures
are therefore very intuitive.
Apart from predicting untested compounds they can be used to gain
a deeper understanding of molecular mechanisms of
toxic effects and they can provide good guidelines
for the construction of new compounds. There are
basically two approaches in using substructures as
molecular descriptors.
The classical method checks whether predefined substructures are present in a molecule. Their presence
or absence may be represented in tabular form. The
disadvantage of this procedure is that substructures
have to be defined in advance, and fragments not defined are consequently not considered.
The other possibility is to generate structural fragments automatically,
and search for those which
occur frequently in toxic compounds (Klopman &:
Rosenkranz 1994). This procedure is computationally very expensive, but it has the advantage that
10
fragments (:all be detected in an unbiased way. As
this procedure results in different numbersof descript{}rs for every compound
it is advisable to use a relat.ional representation of the data.
Connectivity indices Molecular connectivity indices
are ~ compact representation of the topological information of a molecular graph. In our opinion their
application is becoming more and more obsolete, because more intuitive connectivity and topological intbrmation can be derived from higher level strucrural representations (Pfahringer et al. 1999) and
3-dimensional models (e.g. with NACCESS
or VolSurf ). The use of molecular connectivity indices was
{~xt{ulsiwfly discussed by Kier and Hall (Kier & Hall
t!)8(i).
Calculated structural
and electronic descriptors
Chemicals are toxic because they interact with biol{)f~ic:al ma{:romoleculesand reactivity is determined
by electronic and steric properties. With the increasing speed of computers it is possible to calculate the
"electronic nature" and three-dimensional structure
of chemicals within a reasonable amount of time.
As these properties can be calculated for almost
{,very molecule, it should, at least theoretically, be
possible to make predictions for compounds with
novel substructures.
Programs fbr the calculation of 3D-structures and
ele{:trolfiC descriptors were not designed for batch
processing. It is often necessary to try different settings, if a calculation does not converge. As they use
an iterative process to calculate the final structure,
the results may depend on the initial structure, end
in a local minimumor the solution mayoscillate between different states (Clark 1985). In light of these
results it maybe advantageous to use rule based systerns (e.g. CORINA,PETRA)for the calculation
3-dimensional structures and electronic properties of
large and diverse datasets. Once more it seems to be
mor(: important to obtain consistent instead of "corr{~c:t" {lescriptors.
Calculated chemical properties
Recent developmeats <}f Quantitative Structure-Property
Relationships (QSPRs) have enabled the calculation
()f physico-ehemical properties (e.g. logP (Meylan
,~: Howard 1995)) which can be utilized in SAR
models. They are usually very predictive and can
I}e readily interpreted in terms of chemical and
t(}xicological knowledge. Problems may arise from
{wr(}r t}r{}t}agation, if the results are not accurate
(ul(}ugh or calculations fail for certain types
{’oml)ouuds. We use in our work the logP as
lil)ol}tfilicity
indicator. A review of algorithms for
the cal(:ulation of lipophilicity parameters can be
f(}un{l in (Mannhold& Dross 1996).
[n l}ra{:tice, the choice of descriptors will vary
str{)ngly with the scope of the desired model and the
(’;q}al}ilities (}f the learning algorithm. Within a drug
design process it is sensible to work with substructures,
SARsfor the elucidation of molecular mechanismswill
contain structural, electronic or physico-chemical descriptors and high-predictivity models will use descriptors even if they are not instantly intuitive. Generally,
it is advantageous to use different types of descriptors
to enhance the predicitivity
of a SARmodel (Helma,
Kramer, & Pfahringer in press).
Despite the problems associated with descriptor selection and calculation, we consider chemical data as
generally more reliable than toxicological data.
Toxicological
Data
Most biological effects are a complex expression of several mechanisms happening in sequential and/or parallel order leading to highly variable endpoints.
Heterogeneous datasets are caused by the limited
availability of good, validated data. Large standardized
testing programs (e.g. NCI/NTP)were not designed for
the development of SARmodels, therefore they cover
only a limited set of possible structures.
For rodent carcinogenicity assays the reproducibility
was reported to be approximately 80% (Gold & Zeiger
1997) under standardized conditions. A comparison of
the compounds present in both (Literature
and NTP)
parts of the CPDBshows, that 71%are classified similarly (Table 4). Figure 1 depicts a comparison of the
tumorigenic doses TDso for 43 compoundsclassified as
carcinogens in both CPDBparts.
This comparison may underestimate
the reproducibility of carcinogenicity experiments, because only
NTPexperiments were conducted with a standardized
protocol. But it resembles closely the real world situation of many SARmodelers who have to aggregate data
from different sources to obtain enough data for their
investigations.
concordant discordant
classifications
73
30
Nr. compounds present
in both parts
103
Table4: A comparison
oftheclassifications
in theLiteratureand NTP partsof the CPDB
It is beyondthe scope of the present article to present
a detailed discussion about the sources of variability in
toxicity experiments, and in manycases it is impossible
for the SARmodeler to check the validity of the data.
Nevertheless, we want to stress the importance of using preferably results of standardized experiments. The
definition of classifications derived from original data
(e.g. rodent carcinogen/noncarcinogen from a variety of
sex, strain, organ specific effects) should be documented
to enable a comparison with other research groups.
Whenadding data from non-standardized sources, it
should be carefully considered, if the increased amount
of data outweights the additional variability. Including
11
O
o
¢0o o
°8 o~oo
~Oo
0
0
0
!
-15-10-5
0 5
logTD50-NTP
I
10
Figure 1: Correlation of carcinogenicity TDs0’S from
the NTPand the Literature Part of the CPDB
indicators ibr data quality and variability in SARmod(’,Is is certainly highly desirable and an important topic
for future research.
Acknowledgements
This research is part of the project "CarcinogenicityDetection by MachineLearning" supported by the Austrian Fed(~ral Ministry of Science and Transport. Partial support is
also provided by the "JubilEumsfondder Osterreichischen
Nationalbank" under grant number 6930. The Austrian
Frdcral Ministry of Science and Transport provides general
financial support ibr the Austrian ResearchInstitute for AI.
References
Clark, T. 1985. A Handbook of Computational Chemistry. Wiley-Interscience.
Gold, L. S., and Zeiger, E. 1997. Handbook of Carcinogenic Potency and Genotoxicity Databases. CRC
Press.
Hehna, C.; Gottmann, E.; Kramer, S.; and Pfahringer,
B.
1999.
Data quality issues in toxicologi(:al lCnowledge discovery. Technical report, Aust.rian II.(;se, ar(:h Institute for Artificial Intelligence,
htt.l): / /www.ai.univie.ac.at/cgi-bin/tr-online , Vienna.
Hehna, C.; Kramer, S.; and Pfahringer,
B. in
press. Car(:inogenicity prediction for noncongeneric
(:()ml)ounds: Experiments with the Machine Learning
Program SRTand various sets of chemical descriptors.
In Moh’.cular Modelling and Prediction of Bioactivity.
Kier, L. B., and Hall, L. H. 1986. Molecular Connectivity in Structure-Activity Analysis. Letchworth, UK:
Res. Studio Press Ltd.
Klopman, G., and Rosenkranz, H. 1994. Approaches
to SARin carcinogenesis and mutagenesis. Prediction
of carcinogenicity/mutagenicity
using MULTI-CASE.
Mutation Res. 305:33-46.
Mannhold,R., and Dross, K. 1996. Calculation procedures for molecularlipophilicity: A comparativestudy.
Quant. Struct.-Act. Relat. 15:403-409.
Meylan, W. M., and Howard, P. H.
1995.
Atom/Fragmentcontribution method for estimating
octanol-water partition coefficients. J. Pharm.Sci.
84:83-92.
Pfahringer, B.; Gottmann, E.; Helma, C.; and
Kramer,S. 1999. Efficiency/representational issues in
toxicological knowledgediscovery. In Predictive Toxicology of Chemicals: Experiences and Impact of AI
Tools.
Download