I. Introduction

Application of data mining techniques in the
classification of anti-HIV compounds
Maria Seyagh, Mazouz El Mostapha, Abdellah Jarid,
Driss Cherqaoui*
Faculté des Sciences Semlalia BP 2390, Université Cadi
Marrakech, Morocco.
Abstract— Support vector machines, artificial neural
networks and decision trees were utilized to develop
structure-anti-HIV models for a set of organic compounds.
The aim of this work is to establish reliable models to
distinguish high active compounds from low active ones and,
furthermore, to seek for the structural features related to the
activity. This approach uses the occurrence of substructural
fragments as descriptors encoding molecular structures. The
results obtained show that all methods were good classifiers.
The percentage of right classified compounds ranges from
87.7% to 100% and from 85.7% to 100% for the training
and test sets, respectively. This study shows that data mining
provide medicinal chemists valuable information that is
useful for drug design and prediction of drug activity.
Keywords-SVM; ANN; Decision Trees; data mining; QSAR
The purpose of the research in chemistry is the
development of new products that have advantage in terms
of effectiveness, tolerance or price. In medicinal
chemistry, despite considerable work, it doesn’t seem to
exist, a method to provide certainly the pharmacological
actions of new molecule. The drug design methods based
on trial and error process aided by intuition and organic
synthesis are becoming more expensive and misdirected.
In other words, the drug discovery process is time
consuming and expensive. Often it can take 12 to 15 years
for a drug to reach the market from the laboratory [1].
Thus, pharmaceutical industries have turned to new
methods for prediction of biological activities of molecules
using chemoinformatics tools. They are based on the use
of computer and informational techniques to help in the
process of drug discovery. Quantitative structure activity
relationship (QSAR) models are a solution to the problem
of directly calculating physical and biological properties of
molecules from their physical structure. QSAR is based on
the representation of molecular structures as objects in a
chemical space defined by molecular descriptors and
attempt to determine relationships between compound’s
biological activity and its descriptors.
Didier Villemin
Ecole Nationale Supérieure d'Ingénieurs (E.N.S.I.)
n° 6507, 6 boulevard Maréchal Juin, 14050
Caen, France
Data mining methods are used in many scientific areas.
Actually, they become an essential key for the
pharmaceutical industries. They have been used in QSAR
investigations to predict the biological activity of organic
molecules from their molecular structures [2]. These
relationships are provided by several computational
techniques such as support vector machines (SVM),
decision trees (DT), artificial neural networks (ANN),
k-nearest neighbors (k-NN)…etc.
Human immunodeficiency virus (HIV), the causative
agent of acquired immunodeficiency syndrome (AIDS)
has become the center of interest of several studies due to
its massive spread all over the world, and the current
strategies for treating AIDS depend essentially on
disrupting the replication of HIV. The objective of this
work is to predict the class (low or high anti-HIV) of a
molecule from its molecular structure. This paper is an
application of three computational techniques (SVM, ANN
and DT) in the investigation of structure-anti-HIV
relationships. This approach is performed using
substructural fragments as molecular descriptors for
encoding chemical compounds.
A. Compounds studied
The data set in this investigation consists of 79 high
and low anti-HIV molecules. They were taken from papers
published by Tanaka et al. [3-6]. These compounds are
part of hydroxyethoxy-methyl-(phenylthio) thymine
(HEPT) derivatives. Based on the anti-HIV activity (Y),
the data set was divided into two classes: class H (Y ≥
6.11)) represented high active compounds and class L (Y <
6.11) represented the low active ones. The number 6.11
represents the average of minimum and maximum values
of the anti-HIV activity. There were 38 H and 41 L
compounds in all data. Based on the molecular similarity’s
criterion, the data set was divided into two subsets: a
training set and a test set containing 65 compounds (31 H
and 34 L) and 14 compounds (7 H and 7 L), respectively.
The established models, from training set, are used to
predict the anti-HIV class of the remaining 14 compounds.
B. Molecular desciptors
The main step in a QSAR study consists in
parametrizing the variation in the chemical structure. The
performance of QSAR models depends mostly on the
parameters used to describe the molecular structures. In
this study molecular descriptors (input variables) are
determined as the frequencies of selected substructures (or
fragments). These nonnegative integer entities determine
numbers of appearance of the given substructure in a
molecule. By using ISIDA program [7], 180 substructural
fragments were obtained for each molecule. The stepwise
MLR procedure based on the forward-selection and
backward-elimination methods was used to select the
powerful variables. Finally fourteen fragments were
chosen. These selected fragments were used as input
variables for the establishment of classification models.
A. Artificial Neural Networks
ANN have their origins in efforts to reproduce
computer models of the information processing that takes
place in the brain. They have found application in a wide
variety of fields such as image analysis of facial features,
stock market predictions, etc. Application of the ANN
methods to problems in chemistry and biochemistry has
rapidly gained popularity in recent years. The architecture
of an ANN is determined by the way in which the outputs
of the neurons are connected to the other neurons. While
there are a number of different ANN architectures, the
most frequently used type of ANN in QSAR [8], and the
one we use in this paper, is the 3-layered feed-forward
network. In this type of network the neurons are arranged
in layers (an input layer, one hidden layer and an output
layer), and the connections are unidirectional from the
input to the output. Adjacent layers are fully connected and
no connections between neurons within the same layer
exist. The ANN architecture we employed consists of:
* an input layer (fed with molecular descriptors)
composed of 14 neurons,
* a hidden layer encoding the interaction in the data;
various runs were carried out to determine the best number
of units in that hidden layer,
* an output layer of one neuron delivering the
predicted class of anti-HIV activity.
The NN used in this study was trained with a BP
algorithm. The specific algorithm adopted in this research
has been described previously with a simple example of
application [9] and a detailed description of this algorithm
is given elsewhere [10]. The learning rate was initially set
to 1 and was gradually decreased until the error function
could no longer be minimized. We used the sigmoid
function as the transformation function and the delta rule
as the error correction formula [10].
ANN was performed using Weka data mining software
[11]. Many ANN architectures of 14-x-1 (x represents the
number of hidden neurons) have been tested. Each
architecture was trained with four different initial random
sets of weights and with the number of cycles limited to
1,000. Among all architectures of ANN, the best one is
14-4-1. The overall performances of classification for
training and test sets are 98.5% (96.8% for the class H and
100% for the class L) and 100%, respectively.
B. Support vector machines
SVM are supervised learning technique from the field
of machine learning applicable to both classification and
regression. SVM was originally developed by Vapnik and
co-workers [12] and has shown promising capability for
solving a number of pharmacological and biological
classification problems [13]. A detailed description of the
theory of SVM can be easily found in some excellent
books and literatures [12,14]. Thus, only a brief
description is given here.
For linearly separable cases, SVM performs the
classification tasks by constructing a hyperplane
(w.xi +b=0) in a multi-dimensional space to separate two
different classes of feature vectors with a maximum
margin. This is assigned to find a vector w and a parameter
b which can minimize ||w||2 and satisfy the following
w.xi + b ≥ + 1
w.xi + b ≤−1 for
yi = + 1 positive class
yi = −1 negative class
where w is a vector normal to the hyperplane, xi is a
feature vector, yi is the class label,
is the
perpendicular distance from the hyperplane to the origin,
and ||w||2 is the Euclidean norm of w. After determining w
and b, a given vector x can be classified by using:
sign(w.x + b)
For nonlinearly separable cases, SVM projects the
input feature vectors into a high-dimensional feature space
using a kernel function K(x,xi). Then a linear SVM is
applied to this feature space and the decision function is
given by:
f ( x )  sign(
 y  K( x, x )  b )
i i
i 1
where sign is simply a sign function which returns +1
for positive argument and -1 for a negative argument; and
K ( x, xi ) 2   ( x) *  ( xi ) whose value is equal to the
inner product of two vectors x and xi in the feature space
 (x) and  ( xi ) . Any function that satisfies Mercer’s
condition [15] can be used as the kernel function and  i is
the Lagrangian multiplier.
The performances of SVM for classification depend on
the combination of several parameters. The kernel function
should be decided first. There are a number of kernel
functions, which have been found to provide good
generalization capabilities. One has several possibilities
for the choice of this function including linear, polynomial,
and radial basis function etc. However, for classification
tasks, a commonly used kernel function is the radial basis
function because of its good generalization performance
and few parameters to be optimized [16]. This function is
formulated as below:
exp(     )
Where  is the parameter of the kernel  and  are
two independent variables.
This function is used in the present work. Once the
kernel function has been decided, width (  ) of radial basis
function and capacity parameter C should be optimized. C
is a regularization parameter that controls the tradeoff
between maximizing the margin and minimizing the
training error.  , the parameter of the kernel, controls the
amplitude of the radial basis function and further more,
controls the generalization ability of SVM. In the present
study, grid search feature of LIBSVM was used to find the
optimal values of the C and  parameters. Series of lg2C
values ranging from 5 to 10 with incremental steps of 1
and lg2  in the range from 0 to -5 with incremental steps
of -1 have been exploited. The optimal parameters of C
and  for training set were found to be 512 and 0.03125,
respectively. Once the parameters are optimized, we have
introduced them in LIBSVM software to establish the
SVM classification model. The results obtained give a
total accuracy of 100% for both the training and test sets.
to reduce the size of the tree by avoiding near-useless tests.
This can increase its generalisation ability when the data is
The pruning confidence level CDT controls the
level of pruning. Pruning is performed after tree induction,
and removes branches that are likely to reduce
generalisation ability of the tree. A smaller value of CDT
leads to more aggressive pruning.
DT was performed using Weka data mining software.
By varying CDT = 0.01, 0.05, 0.1, . . ., 0.5 and m = 2, 3, . . .
, 10 the optimal values of CDT and m are identified to be
0.3 and 5, respectively.
The obtained DT (Fig. 1) shows that only two
descriptors (F1 and F2) can separate the leaning dataset.
They represent the occurrences of two fragments
containing two and three carbon atoms, respectively. The
rules are defined as follows:
Rule 1: if F1 = 0 then Class L (92.8%);
Rule 2: if F1  0 and F2 = 0 then Class H (89.6%);
Rule 3: if F1  0 and F2  0 then Class L (62.5%).
92.8% L
89.6% H
62.5% L
Figure1. Decision tree
C. Decision trees
The DT is one of the most popular classification
algorithms in current use in data mining and pattern
recognition. Each node in such a tree is associated with a
test on values of an attribute. Each edge from the node is
labelled with a particular value of the attribute, and each
leaf of the tree is associated with a value of the class. Each
path in the tree is associated with a production rule, where
premises are composed by the tests associated with nodes,
and the conclusion of the rule is composed by the class
associated with the leaf of the path. Many classification
and DT generation methods have been researched. C4.5 is
an algorithm used to generate a DT developed by Quinlan
[17]. C4.5 algorithm, used in the present study, has two
important parameters:
The minimum test weight m is the minimum
number of cases that must be covered by at least two of the
outcomes of every test. A larger value of m can be useful
The overall classification accuracies were 87.7%
((83.9% for the class H and 91.2% for the class L) and
85.7% (71.4% for the class H and 100% for the class L) for
training and test sets, respectively.
The obtained results are encouraging and show that
SVM model slightly outperforms those given by ANN and
DT. However, it should be noted that the DT results are
obtained from only two molecular descriptors. The main
advantage of SVM is that it adopts the structure risk
minimization principle, which has been shown to be
superior to the traditional empirical risk minimization
principle. The two descriptors selected by the DT
algorithm give the main fragments, containing only carbon
atoms, in molecule. DT eliminated all fragments
containing heteroatoms like oxygen, sulfur, nitrogen, etc.
Thus, these fragments can be considered as a set of
structural features in a molecule that are responsible for
that molecule's anti-HIV activity.
Using SVM, ANN or DT, chemist can select high
active molecule from a pool of molecular compounds. The
reduction of molecular descriptors obtained by the DT can
guide the chemist in the synthesis of new biological active
SVM, ANN and DT were applied to classify 79
compounds using substructural fragments calculated from
molecular structure. The classification results are
satisfying. The established classification models can be
used in biological screening processes and in prediction of
the anti-HIV activities (or other molecular properties) of
untested molecules. DT model propose fragments that can
differentiate between high and low anti-HIV classes. They
can also be used as filters in high throughput screening
C. Shekhar, “In Silico Pharmacology: Computer-Aided Methods
Could Transform Drug Development,” Chem. Biol., vol. 15, pp.
413–414, 2008.
[2] T. Scior, J. L. Medina-Franco, Q.- T. Do, K. Martínez-Mayorga, J.
A. Yunes Rojas and P. Bernard, “How to Recognize and
Workaround Pitfalls in QSAR Studies: A Critical,” Curr. Med.
Chem., vol. 16, pp. 4297–4313, 2009.
[3] H. Tanaka , M. Baba, H. Hayakawa, T. Sakamaki, T. Miyasaka, M.
Ubasawa, H. Takashima, K. Sekiya, I. Nitta and S. Shigeta, “A
new class of HIV-1-specific 6-substituted acyclouridine
derivatives: synthesis and anti-HIV-1 activity of 5- or 6-substituted
analogues of 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine
(HEPT),” J. Med. Chem., vol. 34, pp. 349-357, 1991.
[4] H. Tanaka, M. Baba, H. Hayakawa, T. Sakamaki, T. Miyasaka, M.
Ubasawa, H. Takashima, K. Sekiya, I. Nitta, S. Shigeta, R.T.
Walker, J. Balzarini and E. De Clercq, “Synthesis and anti-HIV
activity of 2-, 3-, and 4-substituted analogues of 1-[(2hydroxyethoxy) methyl]-6-(phenylthio)thymine (HEPT),” J. Med.
Chem., vol. 34, pp. 1394-1399, 1991.
[5] H. Tanaka, H. Takashima, M. Ubasawa, K. Sekiya, I. Nitta, M.
Baba, S. Shigeta, R. T. Walker, E. De Clercq and T. Miyasaka,
“Structure-activity relationships of 1-[(2 hydroxyethoxy)methyl]6-(phenylthio)thymine analogues: effect of substitutions at the C-6
phenyl ring and at the C-5 position on anti-HIV-1 activity,” J.
Med. Chem., vol. 35, pp. 337-345, 1992.
[6] H. Tanaka, H. Takashima, M. Ubasawa, K. Sekiya, I. Nitta, M.
Baba, S. Shigeta, R. T. Walker, E. De Clercq and T. Miyasaka,
“Synthesis and antiviral activity of deoxy analogs of 1-[(2hydroxyethoxy)methyl]-6-(phenylthio))thymine (HEPT) as potent
and selective anti-HIV-1 agents,” J. Med. Chem., vol. 35, pp.
4713-4719, 1992.
[7] ISIDA (In Silico Design and Data Analysis) informational system
[8] J. Zupan, J. Gasteiger, Neural Networks for Chemists: An
Introduction, Weinheim: Wiley-VCH, 1993.
[9] D. Cherqaoui and D. Villemin, “Use of a neural network to
determine boiling point of alkanes,” J. Chem. Soc. Faraday. Trans.,
vol. 90, pp. 97–102, 1994.
[10] J. A Freeman and D. M. Skapura, Neural Networks Algorithms,
Applications, and Programming Techniques, Reading: Addition
Wesley Publishing Company, 1991.
[11] http://www.cs.waikato.ac.nz/ml/weka/.
[12] V. N. Vapnik, The Nature of Statistical Learning Theory, Berlin:
Springer, 1995.
[13] X. G. Yang, D. Chen, M. Wang, Y. Xue, Y. .Z. Chen, “Prediction
of antibacterial compounds by machine learning approaches,” J.
Comput. Chem., vol. 30, pp. 1202–1211, 2009.
[14] C. J. C. Burges, “A tutorial on support vector machines for pattern
recognition,” Data Min. Knowl. Disc., vol. 2, pp. 127–167, 1998.
[15] J. Mercer “Functions of positive and negative type and their
connection with the theory of integral equations,” Philos. Trans.
Roy. Soc. London Ser., vol. 209, pp. 415–446, 1909.
[16] N. Cristianini and J. Shawe-Taylor, An Introduction to Support
Vector Machines, Cambridge: Cambridge University Press, 2000.
[17] J. R. Quinlan, “Induction of decision trees,” Machine Learning,
vol. 1, pp. 86-106, 1986.