Quantitative Structure-Activity Relationships by GUHA Method Premysl Zaka, Jaroslava Halovab Academy of Sciences of The Czech Republic a Institute of Computer Science, Pod vodarenskou vezi 2, CZ-182 07 Prague 8, Czech Republic e-mail: zak@cs.cas.cz b Institute of Inorganic Chemistry, CZ 250 68 Rez, Czech Republic e-mail: halova@iic.cas.cz Abstract We can show that if we describe a compound structure by nominal variables, we can get interesting results. We have processed a structure–activity data set of antimelanoma analogs successfully and compared the results with the Molecular Simulation specialized program Catalyst RTM results. The GUHA method was also used for determination of relationships between structure and mutagenicity in the famous mutagenicity data set of nitroaromatic hydrocarbons. This is the first successful application of fingerprint descriptors in SAR. Fingerprint Descriptors in Structure Coding Fingerprint descriptors are used for coding of simple data or for coding of data in a simple way. The name Fingerprint - originates from the description of fingerprint shapes of papilar lines in dactyploscopy. They can describe such a complex shape as a fingerprint is. The main principle of describing by fingerprint descriptors is based on the choice of suitable variables (mostly nominal). For example, if I want to describe trees, I have to choose signs that can differentiate particular species. Our task is to describe the signs of structural formulas that could be connected with investigated properties. The better our choice of descriptors is, the higher is the chance that our effort will be successful. If the data were simple, no selection would be necessary. We could use all the possible descriptors. But our data are complex, so our approach must include the added knowledge. In the case of the mutagenity, the knowledge of the information transfer mechanism among the cells is useful. To affect the cell, a molecule of the compound must get into the vicinity of the cell, and must successfully get through its surface. The polarity of individual parts of the molecule and the interesting shapes of its surface can play an important role in the attack of the molecule. But the shape and the physical or chemical properties are connected to the presence of interposition of the atoms and groups of atoms in a given molecule. And this is the application field for fingerprint description. Aromatic rings are the main building blocks in our data set. Their arrangement produces different shapes. And the shapes of structural formulas reflect some of the structural patterns of this real molecule [Fig. 1]. Examples of such shape patterns are in Fig. 2. The presence of a particular pattern or its number can be used as a fingerprint descriptor of the shape of the structural formula. Shapes of Compounds Coding data The part contributing to the mutagenity Total yield The part reflecting the shape Figure 1. Relation between the coding data and the reality Processing the data made from such simple variables (descriptors) is very simple. But the number of variables to describe the shape must be high. Another problem is that many of the patterns cannot affect mutagenity. And among them there are phantoms that seem to be of some importance. A big number of variables and a big number of results need fast and effective implementation to select all interesting patterns. groove Figure 2. Example of shape patterns bulge channel hill GUHA method – a tool for nominal data processing The basic ideas of GUHA (General Unary Hypotheses Automaton) method were given in [1] already in 1966. The starting notion of the method is an object. The object has properties expressed by variables assigned to this object. For example, the object can be a man with properties given by variables such as sex, age, color of eyes, etc. In order to make a reasonable knowledge discovery we need to have a set of objects of the same kind which differ in values of the variables defined on them. The aim of the GUHA method is to generate hypotheses on relations among properties of the objects which are in some sense interesting. This generation is processed systematically; the machine generates in a sense all possible hypotheses and collects the interesting ones. A hypothesis is generally composed of two parts: of an antecedent and a succedent. The antecedent and the succedent are tied together by a so-called generalized quantifier describing the relation between them. The antecedents and succedents are propositions on the object in the sense of the classical propositional logic, so they are true or false on the particular object. These propositions can be simple or compound similarly to the propositional logic. Compound propositions (literals) are usually composed by conjunction connective. A formulation of these propositions is enabled by the original variable categorization. Given an antecedent and a succedent, frequencies of the four possible combinations can be computed and expressed in compressed form as a so-called four-fold table (ff-table). A general ff-table looks like this: ff-table Antecedent Non(antecedent) Succedent Non(succedent) a b c d Here a is the number of the objects satisfying the antecedent and the succedent, b is the number of the objects satisfying the antecedent but not the succedent (i.e. countercases of the hypotheses examined), etc. A generalized quantifier is a decision procedure assigning 1 or 0 to each ff-table. If the value is 1 then we accept the hypothesis with a corresponding ff-table, if it is 0 then we reject the hypothesis. The basic generalized quantifier defined and used in GUHA is given by Fischer as an exact test known from mathematical statistics. In this case the decision for accepting it is made on the base of a statistical test on the particular level alpha. For each hypothesis, the value of Fisher statistic given by values a, b, c, and d of the ff-table is computed. Simply said, its value describes the measure of association between the antecedent and the succedent . More precisely, the value of the Fisher statistics refers to the level on which we can accept the hypothesis (in a statistical sense), that conditional probability of the succedent on condition of the antecedent P(S/A) is greater than single probability of the presence of the succedent. Accepting or rejecting the hypothesis tested is determined by the value of the Fisher statistics in comparison to the alpha value. Note that the Fisher statistics has the following symmetry property, the level for accepting P(S/A)P(S) is the same that the level for accepting P(A/S)P(A). A further property of the Fisher statistics is negation symmetry property: Fisher (A,S) = Fisher (non(A), non(S)). Several other quantifiers are used in GUHA.[2,3] How to use GUHA in QSAR A description of the shape and of the structural features of compounds is quite a difficult task. And so it is suitable to use either good knowledge or a huge amount of simple data. GUHA can be used for such data processing. But we must preprocess them, first. The coding of the compound structure is one of the main problems of this approach. We can do it manually or we can use some kind of automatic preprocessing. Information is the richest in the first, the manual way. And we can lose information in the following automated coding. But manual coding of big databases would take too much time and effort. So we have to look for a golden midway. Some descriptors can be cardinal ones or include too many values. We divide each variable into several intervals. Among the variables, there can be many features and indexes which are mostly unknown to us, so we divided them into n intervals equifrequently (Low, Medium, High) automatically That means, one interval – one variable category - involves about 1/n cases. On the other hand, two other values can be merged into one category. (Here, some more sophisticated dividing or merging is also possible.) Some variables include only one value (0), and cannot be useful anyhow, so they were omitted. Furthermore, the data could be used directly as the input of GUHA+/-.[3] Data preprocessing is necessary for GUHA work. But the output of GUHA procedures includes many hypotheses and we must choose only those interesting from the users' point of view. GUHA +/- software has tools for processing the output data but there is still a large space for search for new effective methods in this respect. Success of GUHA in QSAR We have presented the processing of two data sets by the GUHA method in QSAR. First, a set of 40 antimelanoma catechol analogs was studied.[4] In spite of the simple fingerprint coding, the results were practically the same as in the Catalyst RTM (very complex and expensive knowledge-based software).[5] Next, we have used the GUHA method for the famous mutagenity data set (230 nitro-aromatic hydrocarbons).[6] The structure formulas were encoded by fingerprint descriptors and new toxicological knowledge as well as facts known in toxicology were found.[7] Later, the same data set was described by approximately 200 topological indexes. There was a huge redundancy in the data and we can understand hardly any of them. However the results were good and interesting from the toxicological point of view again.[8] At the present time data mining is in the center of interest of both theory and application. Many methods are used for processing different data sets. And GUHA classes with successful data mining methods also in QSAR. References 1. Chytil, M., Hajek, P., Havel, I.: The GUHA method of automated hypotheses generation, Computing, 293308, 1966 2. Hajek, P., Sochorova, A., Zvarova, J.: GUHA for personal computers. Computational Statistics and data analysis 19, 149-151, 1995 3. Honzikova, Z.: GUHA +/- User's guide, manual for GUHA +/- software package 1999 4. Halova J., et al: Quant. Struct.-Act. Relat., 1998, 17, 37 5. CATALYST RTM, Release 2.3 Manual, Molecular Simulations Inc., Burlington, MA (1995) 6. Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Schusterman, A. J., Hansch, C: Structure-Activity Relationship of Mutagenic Aromatic and Heteroaromatic Nitro Compounds. Correlation with molecular orbital energies and hydrophobicity, Journal of Medicinal Chemistry 34 (2), 786-797, 1991 7. Zak, P., Halova, J.: Mutagenes Discovery Using PC GUHA Software System In: S.Arikawa, K. Furukawa (Eds): Proceedings of The Second International Conference on Discovery Science 1999-Tokio (Waseda University), Lecture Notes in Computer Science, LNCS 1721 Springer Verlag Berlin, Heidelberg, New York, Tokio 1999 8. Zak, P., Halova, J.: Coping Discovery Challenge of Mutagenes Discovery GUHA +/- for Windows, KDD Challenge 2000 technology spotlight paper, 55-60, The Fourth Pacific-Asia Conference on Knowledge Discovery and Data Mining, International Workshop on KDD Challenge on Real-world Data, Kyoto 2000