Quantitative Structure-Activity Relationships by GUHA Method

advertisement
Quantitative Structure-Activity Relationships by GUHA Method
Premysl Zaka, Jaroslava Halovab
Academy of Sciences of The Czech Republic
a
Institute of Computer Science, Pod vodarenskou vezi 2, CZ-182 07 Prague 8, Czech
Republic
e-mail: zak@cs.cas.cz
b
Institute of Inorganic Chemistry, CZ 250 68 Rez, Czech Republic
e-mail: halova@iic.cas.cz
Abstract
We can show that if we describe a compound structure by nominal variables, we can get interesting results. We
have processed a structure–activity data set of antimelanoma analogs successfully and compared the results with
the Molecular Simulation specialized program Catalyst RTM results. The GUHA method was also used for
determination of relationships between structure and mutagenicity in the famous mutagenicity data set of nitroaromatic hydrocarbons. This is the first successful application of fingerprint descriptors in SAR.
Fingerprint Descriptors in Structure Coding
Fingerprint descriptors are used for coding of simple data or for coding of data in a simple way. The name Fingerprint - originates from the description of fingerprint shapes of papilar lines in dactyploscopy. They can
describe such a complex shape as a fingerprint is. The main principle of describing by fingerprint descriptors is
based on the choice of suitable variables (mostly nominal). For example, if I want to describe trees, I have to
choose signs that can differentiate particular species.
Our task is to describe the signs of structural formulas that could be connected with investigated properties. The
better our choice of descriptors is, the higher is the chance that our effort will be successful. If the data were
simple, no selection would be necessary. We could use all the possible descriptors. But our data are complex, so
our approach must include the added knowledge.
In the case of the mutagenity, the knowledge of the information transfer mechanism among the cells is useful. To
affect the cell, a molecule of the compound must get into the vicinity of the cell, and must successfully get
through its surface. The polarity of individual parts of the molecule and the interesting shapes of its surface can
play an important role in the attack of the molecule. But the shape and the physical or chemical properties are
connected to the presence of interposition of the atoms and groups of atoms in a given molecule.
And this is the application field for fingerprint description.
Aromatic rings are the main building blocks in our data set. Their arrangement produces different shapes. And
the shapes of structural formulas reflect some of the structural patterns of this real molecule [Fig. 1]. Examples
of such shape patterns are in Fig. 2. The presence of a particular pattern or its number can be used as a
fingerprint descriptor of the shape of the structural formula.
Shapes of
Compounds
Coding
data
The part contributing
to the mutagenity
Total
yield
The part reflecting the shape
Figure 1. Relation between the coding data and the reality
Processing the data made from such simple variables (descriptors) is very simple. But the number of variables to
describe the shape must be high. Another problem is that many of the patterns cannot affect mutagenity. And
among them there are phantoms that seem to be of some importance. A big number of variables and a big
number of results need fast and effective implementation to select all interesting patterns.
groove
Figure 2. Example of shape patterns
bulge
channel
hill
GUHA method – a tool for nominal data processing
The basic ideas of GUHA (General Unary Hypotheses Automaton) method were given in [1] already in 1966.
The starting notion of the method is an object. The object has properties expressed by variables assigned to this
object. For example, the object can be a man with properties given by variables such as sex, age, color of eyes,
etc. In order to make a reasonable knowledge discovery we need to have a set of objects of the same kind which
differ in values of the variables defined on them.
The aim of the GUHA method is to generate hypotheses on relations among properties of the objects which are
in some sense interesting. This generation is processed systematically; the machine generates in a sense all
possible hypotheses and collects the interesting ones. A hypothesis is generally composed of two parts: of an
antecedent and a succedent. The antecedent and the succedent are tied together by a so-called generalized
quantifier describing the relation between them. The antecedents and succedents are propositions on the object in
the sense of the classical propositional logic, so they are true or false on the particular object. These propositions
can be simple or compound similarly to the propositional logic. Compound propositions (literals) are usually
composed by conjunction connective. A formulation of these propositions is enabled by the original variable
categorization. Given an antecedent and a succedent, frequencies of the four possible combinations can be
computed and expressed in compressed form as a so-called four-fold table (ff-table). A general ff-table looks like
this:
ff-table
Antecedent
Non(antecedent)
Succedent
Non(succedent)
a
b
c
d
Here a is the number of the objects satisfying the antecedent and the succedent, b is the number of the objects
satisfying the antecedent but not the succedent (i.e. countercases of the hypotheses examined), etc.
A generalized quantifier is a decision procedure assigning 1 or 0 to each ff-table. If the value is 1 then we accept
the hypothesis with a corresponding ff-table, if it is 0 then we reject the hypothesis. The basic generalized
quantifier defined and used in GUHA is given by Fischer as an exact test known from mathematical statistics. In
this case the decision for accepting it is made on the base of a statistical test on the particular level alpha. For
each hypothesis, the value of Fisher statistic given by values a, b, c, and d of the ff-table is computed. Simply
said, its value describes the measure of association between the antecedent and the succedent . More precisely,
the value of the Fisher statistics refers to the level on which we can accept the hypothesis (in a statistical sense),
that conditional probability of the succedent on condition of the antecedent P(S/A) is greater than single
probability of the presence of the succedent. Accepting or rejecting the hypothesis tested is determined by the
value of the Fisher statistics in comparison to the alpha value. Note that the Fisher statistics has the following
symmetry property, the level for accepting P(S/A)P(S) is the same that the level for accepting P(A/S)P(A). A
further property of the Fisher statistics is negation symmetry property: Fisher (A,S) = Fisher (non(A), non(S)).
Several other quantifiers are used in GUHA.[2,3]
How to use GUHA in QSAR
A description of the shape and of the structural features of compounds is quite a difficult task. And so it is
suitable to use either good knowledge or a huge amount of simple data. GUHA can be used for such data
processing. But we must preprocess them, first.
The coding of the compound structure is one of the main problems of this approach. We can do it manually or
we can use some kind of automatic preprocessing. Information is the richest in the first, the manual way. And we
can lose information in the following automated coding. But manual coding of big databases would take too
much time and effort. So we have to look for a golden midway.
Some descriptors can be cardinal ones or include too many values. We divide each variable into several
intervals. Among the variables, there can be many features and indexes which are mostly unknown to us, so we
divided them into n intervals equifrequently (Low, Medium, High) automatically That means, one interval – one
variable category - involves about 1/n cases. On the other hand, two other values can be merged into one
category.
(Here, some more sophisticated dividing or merging is also possible.)
Some variables include only one value (0), and cannot be useful anyhow, so they were omitted. Furthermore, the
data could be used directly as the input of GUHA+/-.[3]
Data preprocessing is necessary for GUHA work. But the output of GUHA procedures includes many
hypotheses and we must choose only those interesting from the users' point of view. GUHA +/- software has
tools for processing the output data but there is still a large space for search for new effective methods in this
respect.
Success of GUHA in QSAR
We have presented the processing of two data sets by the GUHA method in QSAR. First, a set of 40
antimelanoma catechol analogs was studied.[4] In spite of the simple fingerprint coding, the results were
practically the same as in the Catalyst RTM (very complex and expensive knowledge-based software).[5]
Next, we have used the GUHA method for the famous mutagenity data set (230 nitro-aromatic hydrocarbons).[6]
The structure formulas were encoded by fingerprint descriptors and new toxicological knowledge as well as facts
known in toxicology were found.[7] Later, the same data set was described by approximately 200 topological
indexes. There was a huge redundancy in the data and we can understand hardly any of them. However the
results were good and interesting from the toxicological point of view again.[8]
At the present time data mining is in the center of interest of both theory and application. Many methods are used
for processing different data sets. And GUHA classes with successful data mining methods also in QSAR.
References
1. Chytil, M., Hajek, P., Havel, I.: The GUHA method of automated hypotheses generation, Computing, 293308, 1966
2. Hajek, P., Sochorova, A., Zvarova, J.: GUHA for personal computers. Computational Statistics and data
analysis 19, 149-151, 1995
3. Honzikova, Z.: GUHA +/- User's guide, manual for GUHA +/- software package 1999
4. Halova J., et al: Quant. Struct.-Act. Relat., 1998, 17, 37
5. CATALYST RTM, Release 2.3 Manual, Molecular Simulations Inc., Burlington, MA (1995)
6. Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Schusterman, A. J., Hansch, C: Structure-Activity
Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds. Correlation with molecular orbital
energies and hydrophobicity, Journal of Medicinal Chemistry 34 (2), 786-797, 1991
7. Zak, P., Halova, J.: Mutagenes Discovery Using PC GUHA Software System In: S.Arikawa, K. Furukawa
(Eds): Proceedings of The Second International Conference on Discovery Science 1999-Tokio (Waseda
University), Lecture Notes in Computer Science, LNCS 1721 Springer Verlag Berlin, Heidelberg, New York,
Tokio 1999
8. Zak, P., Halova, J.: Coping Discovery Challenge of Mutagenes Discovery GUHA +/- for Windows, KDD
Challenge 2000 technology spotlight paper, 55-60, The Fourth Pacific-Asia Conference on Knowledge
Discovery and Data Mining, International Workshop on KDD Challenge on Real-world Data, Kyoto 2000
Download