Application of data mining techniques in the classification of anti-HIV compounds Maria Seyagh, Mazouz El Mostapha, Abdellah Jarid, Driss Cherqaoui* Faculté des Sciences Semlalia BP 2390, Université Cadi Ayyad, Marrakech, Morocco. cherqaoui@ucam.ac.ma Abstract— Support vector machines, artificial neural networks and decision trees were utilized to develop structure-anti-HIV models for a set of organic compounds. The aim of this work is to establish reliable models to distinguish high active compounds from low active ones and, furthermore, to seek for the structural features related to the activity. This approach uses the occurrence of substructural fragments as descriptors encoding molecular structures. The results obtained show that all methods were good classifiers. The percentage of right classified compounds ranges from 87.7% to 100% and from 85.7% to 100% for the training and test sets, respectively. This study shows that data mining provide medicinal chemists valuable information that is useful for drug design and prediction of drug activity. Keywords-SVM; ANN; Decision Trees; data mining; QSAR I. INTRODUCTION The purpose of the research in chemistry is the development of new products that have advantage in terms of effectiveness, tolerance or price. In medicinal chemistry, despite considerable work, it doesn’t seem to exist, a method to provide certainly the pharmacological actions of new molecule. The drug design methods based on trial and error process aided by intuition and organic synthesis are becoming more expensive and misdirected. In other words, the drug discovery process is time consuming and expensive. Often it can take 12 to 15 years for a drug to reach the market from the laboratory [1]. Thus, pharmaceutical industries have turned to new methods for prediction of biological activities of molecules using chemoinformatics tools. They are based on the use of computer and informational techniques to help in the process of drug discovery. Quantitative structure activity relationship (QSAR) models are a solution to the problem of directly calculating physical and biological properties of molecules from their physical structure. QSAR is based on the representation of molecular structures as objects in a chemical space defined by molecular descriptors and attempt to determine relationships between compound’s biological activity and its descriptors. Didier Villemin Ecole Nationale Supérieure d'Ingénieurs (E.N.S.I.) I. S. M. R. A., LCMT, UMR CNRS n° 6507, 6 boulevard Maréchal Juin, 14050 Caen, France Data mining methods are used in many scientific areas. Actually, they become an essential key for the pharmaceutical industries. They have been used in QSAR investigations to predict the biological activity of organic molecules from their molecular structures [2]. These relationships are provided by several computational techniques such as support vector machines (SVM), decision trees (DT), artificial neural networks (ANN), k-nearest neighbors (k-NN)…etc. Human immunodeficiency virus (HIV), the causative agent of acquired immunodeficiency syndrome (AIDS) has become the center of interest of several studies due to its massive spread all over the world, and the current strategies for treating AIDS depend essentially on disrupting the replication of HIV. The objective of this work is to predict the class (low or high anti-HIV) of a molecule from its molecular structure. This paper is an application of three computational techniques (SVM, ANN and DT) in the investigation of structure-anti-HIV relationships. This approach is performed using substructural fragments as molecular descriptors for encoding chemical compounds. II. CHEMICAL DATA A. Compounds studied The data set in this investigation consists of 79 high and low anti-HIV molecules. They were taken from papers published by Tanaka et al. [3-6]. These compounds are part of hydroxyethoxy-methyl-(phenylthio) thymine (HEPT) derivatives. Based on the anti-HIV activity (Y), the data set was divided into two classes: class H (Y ≥ 6.11)) represented high active compounds and class L (Y < 6.11) represented the low active ones. The number 6.11 represents the average of minimum and maximum values of the anti-HIV activity. There were 38 H and 41 L compounds in all data. Based on the molecular similarity’s criterion, the data set was divided into two subsets: a training set and a test set containing 65 compounds (31 H and 34 L) and 14 compounds (7 H and 7 L), respectively. The established models, from training set, are used to predict the anti-HIV class of the remaining 14 compounds. B. Molecular desciptors The main step in a QSAR study consists in parametrizing the variation in the chemical structure. The performance of QSAR models depends mostly on the parameters used to describe the molecular structures. In this study molecular descriptors (input variables) are determined as the frequencies of selected substructures (or fragments). These nonnegative integer entities determine numbers of appearance of the given substructure in a molecule. By using ISIDA program [7], 180 substructural fragments were obtained for each molecule. The stepwise MLR procedure based on the forward-selection and backward-elimination methods was used to select the powerful variables. Finally fourteen fragments were chosen. These selected fragments were used as input variables for the establishment of classification models. III. RESULTS AND DISCUSSION A. Artificial Neural Networks ANN have their origins in efforts to reproduce computer models of the information processing that takes place in the brain. They have found application in a wide variety of fields such as image analysis of facial features, stock market predictions, etc. Application of the ANN methods to problems in chemistry and biochemistry has rapidly gained popularity in recent years. The architecture of an ANN is determined by the way in which the outputs of the neurons are connected to the other neurons. While there are a number of different ANN architectures, the most frequently used type of ANN in QSAR [8], and the one we use in this paper, is the 3-layered feed-forward network. In this type of network the neurons are arranged in layers (an input layer, one hidden layer and an output layer), and the connections are unidirectional from the input to the output. Adjacent layers are fully connected and no connections between neurons within the same layer exist. The ANN architecture we employed consists of: * an input layer (fed with molecular descriptors) composed of 14 neurons, * a hidden layer encoding the interaction in the data; various runs were carried out to determine the best number of units in that hidden layer, * an output layer of one neuron delivering the predicted class of anti-HIV activity. The NN used in this study was trained with a BP algorithm. The specific algorithm adopted in this research has been described previously with a simple example of application [9] and a detailed description of this algorithm is given elsewhere [10]. The learning rate was initially set to 1 and was gradually decreased until the error function could no longer be minimized. We used the sigmoid function as the transformation function and the delta rule as the error correction formula [10]. ANN was performed using Weka data mining software [11]. Many ANN architectures of 14-x-1 (x represents the number of hidden neurons) have been tested. Each architecture was trained with four different initial random sets of weights and with the number of cycles limited to 1,000. Among all architectures of ANN, the best one is 14-4-1. The overall performances of classification for training and test sets are 98.5% (96.8% for the class H and 100% for the class L) and 100%, respectively. B. Support vector machines SVM are supervised learning technique from the field of machine learning applicable to both classification and regression. SVM was originally developed by Vapnik and co-workers [12] and has shown promising capability for solving a number of pharmacological and biological classification problems [13]. A detailed description of the theory of SVM can be easily found in some excellent books and literatures [12,14]. Thus, only a brief description is given here. For linearly separable cases, SVM performs the classification tasks by constructing a hyperplane (w.xi +b=0) in a multi-dimensional space to separate two different classes of feature vectors with a maximum margin. This is assigned to find a vector w and a parameter b which can minimize ||w||2 and satisfy the following conditions: w.xi + b ≥ + 1 (1) w.xi + b ≤−1 for for yi = + 1 positive class yi = −1 negative class (2) where w is a vector normal to the hyperplane, xi is a feature vector, yi is the class label, b w is the perpendicular distance from the hyperplane to the origin, and ||w||2 is the Euclidean norm of w. After determining w and b, a given vector x can be classified by using: sign(w.x + b) (3) For nonlinearly separable cases, SVM projects the input feature vectors into a high-dimensional feature space using a kernel function K(x,xi). Then a linear SVM is applied to this feature space and the decision function is given by: l f ( x ) sign( y K( x, x ) b ) i i i (4) i 1 where sign is simply a sign function which returns +1 for positive argument and -1 for a negative argument; and K ( x, xi ) 2 ( x) * ( xi ) whose value is equal to the inner product of two vectors x and xi in the feature space (x) and ( xi ) . Any function that satisfies Mercer’s condition [15] can be used as the kernel function and i is the Lagrangian multiplier. The performances of SVM for classification depend on the combination of several parameters. The kernel function should be decided first. There are a number of kernel functions, which have been found to provide good generalization capabilities. One has several possibilities for the choice of this function including linear, polynomial, and radial basis function etc. However, for classification tasks, a commonly used kernel function is the radial basis function because of its good generalization performance and few parameters to be optimized [16]. This function is formulated as below: exp( ) 2 (5) Where is the parameter of the kernel and are two independent variables. This function is used in the present work. Once the kernel function has been decided, width ( ) of radial basis function and capacity parameter C should be optimized. C is a regularization parameter that controls the tradeoff between maximizing the margin and minimizing the training error. , the parameter of the kernel, controls the amplitude of the radial basis function and further more, controls the generalization ability of SVM. In the present study, grid search feature of LIBSVM was used to find the optimal values of the C and parameters. Series of lg2C values ranging from 5 to 10 with incremental steps of 1 and lg2 in the range from 0 to -5 with incremental steps of -1 have been exploited. The optimal parameters of C and for training set were found to be 512 and 0.03125, respectively. Once the parameters are optimized, we have introduced them in LIBSVM software to establish the SVM classification model. The results obtained give a total accuracy of 100% for both the training and test sets. to reduce the size of the tree by avoiding near-useless tests. This can increase its generalisation ability when the data is noisy. • The pruning confidence level CDT controls the level of pruning. Pruning is performed after tree induction, and removes branches that are likely to reduce generalisation ability of the tree. A smaller value of CDT leads to more aggressive pruning. DT was performed using Weka data mining software. By varying CDT = 0.01, 0.05, 0.1, . . ., 0.5 and m = 2, 3, . . . , 10 the optimal values of CDT and m are identified to be 0.3 and 5, respectively. The obtained DT (Fig. 1) shows that only two descriptors (F1 and F2) can separate the leaning dataset. They represent the occurrences of two fragments containing two and three carbon atoms, respectively. The rules are defined as follows: Rule 1: if F1 = 0 then Class L (92.8%); Rule 2: if F1 0 and F2 = 0 then Class H (89.6%); Rule 3: if F1 0 and F2 0 then Class L (62.5%). F1 0 =0 92.8% L F2 =0 89.6% H 0 62.5% L Figure1. Decision tree C. Decision trees The DT is one of the most popular classification algorithms in current use in data mining and pattern recognition. Each node in such a tree is associated with a test on values of an attribute. Each edge from the node is labelled with a particular value of the attribute, and each leaf of the tree is associated with a value of the class. Each path in the tree is associated with a production rule, where premises are composed by the tests associated with nodes, and the conclusion of the rule is composed by the class associated with the leaf of the path. Many classification and DT generation methods have been researched. C4.5 is an algorithm used to generate a DT developed by Quinlan [17]. C4.5 algorithm, used in the present study, has two important parameters: • The minimum test weight m is the minimum number of cases that must be covered by at least two of the outcomes of every test. A larger value of m can be useful The overall classification accuracies were 87.7% ((83.9% for the class H and 91.2% for the class L) and 85.7% (71.4% for the class H and 100% for the class L) for training and test sets, respectively. The obtained results are encouraging and show that SVM model slightly outperforms those given by ANN and DT. However, it should be noted that the DT results are obtained from only two molecular descriptors. The main advantage of SVM is that it adopts the structure risk minimization principle, which has been shown to be superior to the traditional empirical risk minimization principle. The two descriptors selected by the DT algorithm give the main fragments, containing only carbon atoms, in molecule. DT eliminated all fragments containing heteroatoms like oxygen, sulfur, nitrogen, etc. Thus, these fragments can be considered as a set of structural features in a molecule that are responsible for that molecule's anti-HIV activity. Using SVM, ANN or DT, chemist can select high active molecule from a pool of molecular compounds. The reduction of molecular descriptors obtained by the DT can guide the chemist in the synthesis of new biological active molecules. IV. CONCLUSION SVM, ANN and DT were applied to classify 79 compounds using substructural fragments calculated from molecular structure. The classification results are satisfying. The established classification models can be used in biological screening processes and in prediction of the anti-HIV activities (or other molecular properties) of untested molecules. DT model propose fragments that can differentiate between high and low anti-HIV classes. They can also be used as filters in high throughput screening process. REFERENCES C. Shekhar, “In Silico Pharmacology: Computer-Aided Methods Could Transform Drug Development,” Chem. Biol., vol. 15, pp. 413–414, 2008. [2] T. Scior, J. L. Medina-Franco, Q.- T. Do, K. Martínez-Mayorga, J. A. Yunes Rojas and P. Bernard, “How to Recognize and Workaround Pitfalls in QSAR Studies: A Critical,” Curr. Med. Chem., vol. 16, pp. 4297–4313, 2009. [3] H. Tanaka , M. Baba, H. Hayakawa, T. Sakamaki, T. Miyasaka, M. Ubasawa, H. Takashima, K. Sekiya, I. Nitta and S. Shigeta, “A new class of HIV-1-specific 6-substituted acyclouridine derivatives: synthesis and anti-HIV-1 activity of 5- or 6-substituted analogues of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT),” J. Med. Chem., vol. 34, pp. 349-357, 1991. [4] H. Tanaka, M. Baba, H. Hayakawa, T. Sakamaki, T. Miyasaka, M. Ubasawa, H. Takashima, K. Sekiya, I. Nitta, S. Shigeta, R.T. Walker, J. Balzarini and E. De Clercq, “Synthesis and anti-HIV activity of 2-, 3-, and 4-substituted analogues of 1-[(2hydroxyethoxy) methyl]-6-(phenylthio)thymine (HEPT),” J. Med. Chem., vol. 34, pp. 1394-1399, 1991. [5] H. Tanaka, H. Takashima, M. Ubasawa, K. Sekiya, I. Nitta, M. Baba, S. Shigeta, R. T. Walker, E. De Clercq and T. Miyasaka, “Structure-activity relationships of 1-[(2 hydroxyethoxy)methyl]6-(phenylthio)thymine analogues: effect of substitutions at the C-6 phenyl ring and at the C-5 position on anti-HIV-1 activity,” J. Med. Chem., vol. 35, pp. 337-345, 1992. [6] H. Tanaka, H. Takashima, M. Ubasawa, K. Sekiya, I. Nitta, M. Baba, S. Shigeta, R. T. Walker, E. De Clercq and T. Miyasaka, “Synthesis and antiviral activity of deoxy analogs of 1-[(2hydroxyethoxy)methyl]-6-(phenylthio))thymine (HEPT) as potent and selective anti-HIV-1 agents,” J. Med. Chem., vol. 35, pp. 4713-4719, 1992. [7] ISIDA (In Silico Design and Data Analysis) informational system http://infochim.u-strasbg.fr/recherche/isida/index.php. [8] J. Zupan, J. Gasteiger, Neural Networks for Chemists: An Introduction, Weinheim: Wiley-VCH, 1993. [9] D. Cherqaoui and D. Villemin, “Use of a neural network to determine boiling point of alkanes,” J. Chem. Soc. Faraday. Trans., vol. 90, pp. 97–102, 1994. [10] J. A Freeman and D. M. Skapura, Neural Networks Algorithms, Applications, and Programming Techniques, Reading: Addition Wesley Publishing Company, 1991. [1] [11] http://www.cs.waikato.ac.nz/ml/weka/. [12] V. N. Vapnik, The Nature of Statistical Learning Theory, Berlin: Springer, 1995. [13] X. G. Yang, D. Chen, M. Wang, Y. Xue, Y. .Z. Chen, “Prediction of antibacterial compounds by machine learning approaches,” J. Comput. Chem., vol. 30, pp. 1202–1211, 2009. [14] C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Min. Knowl. Disc., vol. 2, pp. 127–167, 1998. [15] J. Mercer “Functions of positive and negative type and their connection with the theory of integral equations,” Philos. Trans. Roy. Soc. London Ser., vol. 209, pp. 415–446, 1909. [16] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge: Cambridge University Press, 2000. [17] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, pp. 86-106, 1986.