Hierarchical Multi-Label Classification for Protein Function Prediction

advertisement
Hierarchical Multi-Label Classification for
Protein Function Prediction: A Local Approach
based on Neural Networks
Ricardo Cerri, Rodrigo C. Barros, and André C. P. L. F. de Carvalho
Department of Computer Science, ICMC
University of São Paulo (USP)
São Carlos - SP, Brazil
{cerri,rcbarros,andre}@icmc.usp.br
Abstract—In Hierarchical Multi-Label Classification
problems, each instance can be classified into two or
more classes simultaneously, differently from conventional
classification. Additionally, the classes are structured in a
hierarchy, in the form of either a tree or a directed acyclic
graph. Hence, an instance can be assigned to two or more
paths from the hierarchical structure, resulting in a complex
classification problem with possibly hundreds of classes.
Many methods have been proposed to deal with such
problems, some of them employing a single classifier to
deal with all classes simultaneously (global methods), and
others employing many classifiers to decompose the original
problem into a set of subproblems (local methods). In this
work, we propose a novel local method named HMC-LMLP,
which uses one Multi-Layer Perceptron per hierarchical
level. The predictions in one level are used as inputs to the
network responsible for the predictions in the next level. We
make use of two distinct Multi-Layer Perceptron algorithms:
Back-propagation and Resilient Back-propagation. In
addition, we make use of an error measure specially
tailored to multi-label problems for training the networks.
Our method is compared to state-of-the-art hierarchical
multi-label classification algorithms, in protein function
prediction datasets. The experimental results show that
our approach presents competitive predictive accuracy,
suggesting that artificial neural networks constitute a
promising alternative to deal with hierarchical multi-label
classification of biological data.
Index Terms—Machine learning; neural networks;
hierarchical multi-label classification; protein function
prediction
I. I NTRODUCTION
In typical classification problems, a classifier assigns
a given instance to just one class, and the classes
involved in the problem are not hierarchically structured.
However, in many real-world classification problems
(e.g., classification of biological data), one or more classes
can be divided in subclasses or grouped in superclasses.
In these cases, the classes form a hierarchical structure,
usually in the form of a tree or of a Directed Acyclic
Graph (DAG). These problems are known in the Machine
Learning (ML) literature as hierarchical classification
problems, in which new instances are assigned to classes
associated to nodes belonging to a hierarchy [1].
Two main approaches have been used to deal with
hierarchical problems: the local (top-down) and global
(one-shot, big-bang) approaches. In the local approach,
conventional classification algorithms are trained for
producing a tree of classifiers, which are in turn used
in a top-down fashion for the classification of new
instances. Initially, the most generic class (located at the
first hierarchical level) is predicted, and then it is used
to reduce the set of possible classes for the next level. A
disadvantage of this approach is that, as the hierarchy
is traversed toward the leaves, classification errors are
propagated to the deeper levels, unless some procedure
is adopted to avoid this problem.
Hierarchical problems can be structured in a more
complex manner. For example, there are problems in
which the classes are not only structured in a hierarchy,
but an instance can be assigned to more than one
class in the same hierarchical level. These problems are
known as Hierarchical Multi-Label Classification (HMC)
problems, and are very common in tasks of protein and
gene function prediction [2]–[9]. In HMC problems, an
instance can be assigned to two or more paths in a class
hierarchy. Given a space of instances X, the objective of
the training process is to find a function which maps
each instance xi into a set o classes, respecting the
constraints of the hierarchical structure, and optimizing
some quality criterion. An example of an HMC problem
structured as a tree is depicted in Figure 1, in which
an instance is assigned to three paths of the hierarchy,
formed by the classes 11.02.03.01, 11.02.03.04, 11.06.01
and all their superclasses.
In this paper, we propose a novel method named
HMC-LMLP (Hierarchical Multi-Label Classification
with Local Multi-Layer Perceptron). It is a local HMC
method where a neural network is associated to a
hierarchical level and responsible for the predictions in
that level. The predictions of a level are then used as
inputs for the neural network associated to the next level.
We investigate both the use of Multi-Layer Perceptrons
(MLPs) with the Back-propagation algorithm [10] and
11
11.02
11.04
11.06
Leaf Class 1
11.02.02
11.02.03
Leaf Class 3
11.02.03.01
11.04.01
11.04.02
11.04.03
11.06.01
11.06.02
Leaf Class 2
11.02.03.04
Fig. 1.
11.04.03.01
11.04.03.11
11.04.03.03
11.04.03.05
Example of a tree hierarchical structure.
the Resilient back-propagation algorithm [11]. We train
the MLPs with an additional error measure proposed
specifically for multi-label problems [12].
This paper is organized as follows. In Section II, we
briefly review some works related to our approach. Our
novel local method for HMC which employs artificial
neural networks is described in Section III. We detail
the experimental methodology in Section IV, and we
present the experimental analysis in Section V, in which
our method is compared with state-of-the-art decision
trees for HMC problems, in protein function prediction
datasets. Finally, we summarize our conclusions and
point to future research steps in Section VI.
II. R ELATED W ORK
Many works have been proposed to deal with
HMC problems. This section presents some of these
works, organized according to the taxonomy presented
in [13], which describes each algorithm as a 4-tuple
< ∆, Ξ, Ω, Θ >, where: ∆ indicates if the algorithm is
hierarchical single-label (SPP - Single Path Prediction) or
hierarchical multi-label (MPP - Multiple Path Prediction);
Ξ indicates the prediction depth of the algorithm MLNP (Mandatory Leaf-Node Prediction) or NMLNP
(Non-Mandatory Leaf-Node Prediction); Ω indicates the
hierarchy structure the algorithm can handle - T (Tree
structure) or D (DAG structure); and Θ indicates the
categorization of the algorithm under the proposed
taxonomy - LCN (Local Classifier per Node), LCL (Local
Classifier per Level), LCPN (Local Classifier per Parent
Node) or GC (Global Classifier).
In [14], a method based on the LCN approach is
proposed. It uses a hierarchy of SVM classifiers which
are trained for each class separately, and the predictions
are combined using a bayesian network model [15]. This
method is applied for gene function prediction using the
Gene Ontology (GO) hierarchy [16]. It is categorized as
< M P P, N M LN P, D, LCN >.
In [17], the authors propose an ensemble of classifiers,
extending the method proposed in [14]. The ensemble
is based on three different methods: (i) the training of a
single SVM for each GO node; (ii) the combination of the
SVMs using bayesian networks to correct the predictions
according to the GO hierarchical relationships; and
(iii) the induction of a Naive Bayes [18] classifier for
each GO term to combine the results provided by the
independent SVM classifiers. This method is categorized
as < M P P, N M LN P, D, LCN >.
An ensemble of classifiers based on the LCN approach
is proposed in [19]. The method was applied to gene
datasets annotated according to the FunCat scheme
developed by MIPS [20]. Each classifier is trained to
become specialized in the classification of a single
class, by estimating the local probability p̂i (x) of
a given instance x to be assigned to a class ci .
The ensemble phase estimates the consensual global
probability pi (x). Such method is categorized as <
M P P, N M LN P, T, LCN >.
Artificial neural networks are used as base classifiers
in a method named HMC-Label-Powerset [8]. In each
hierarchical level, the HMC-LP method combines the
classes assigned to an instance to form a new and
unique class, transforming the original HMC problem
into a hierarchical single-label problem. This approach
is categorized as < M P P, M LN P, T, LCP N >.
In [5], three methods based on the concept of
Predictive Clustering Trees (PCT) are compared over
functional genomics datasets. The authors make use
of the Clus-HMC [21] method, which induces a
unique decision tree, and two other methods named
Clus-HSC and Clus-SC. The Clus-SC method trains a
decision tree for each class, ignoring the hierarchical
relationships, and the Clus-HSC method exploits the
hierarchical relationships between the classes to induce
decision trees for each hierarchical node. Clus-HMC
is categorized as <
M P P, N M LN P, D, GC
>,
whilst Clus-HSC and Clus-SC are categorized as
< M P P, N M LN P, D, LCN >.
Another global based method, named HMC4.5, was
proposed by [2]. It is based on the C4.5 algorithm [22]
and was applied to the prediction of gene functions. The
authors modified the entropy formula of the original
C4.5 algorithm, using the sum of the entropies of all
classes and also information about the hierarchy. The
entropy is used to decide the best data split in the
decision tree, i.e., the best attribute to be put in a
node of the decision tree. The method is categorized as
< M P P, N M LN P, T, GC >.
The work of Otero et al. [7] extends a global
method named hAnt-Miner [23], a swarm intelligence
based technique originally proposed to hierarchical
single-label classification. The original method discovers
classification rules using two ant colonies, one for
the antecedents and one for the consequents of
the rules. A rule is constructed by the pairing of
ants responsible for constructing the antecedent and
consequent of the rule, respectively. It is categorized as
< M P P, N M LN P, G, GC >.
III. HMC-LMLP
The
HMC-LMLP
(Hierarchical
Multi-Label
Classification with Local Multi-Layer Perceptron)
method (< M P P, N M LN P, D, LCL >) incrementally
trains a MLP neural network for each hierarchical level.
First, a MLP is trained for the first hierarchical level.
This network consists of an input layer, a hidden layer
and an output layer. After the end of the training
process, two new layers are added to the first MLP for
the training in the second hierarchical level. Thus, the
outputs of the network responsible for the prediction
in the first level are given as inputs to the hidden
layer of the network responsible for the predictions
in the second level. This procedure is repeated for
all hierarchical levels. Recall that each output layer
has as many neurons as the number of classes in the
corresponding level. In other words, each neuron is
responsible for the prediction of one class, according to
its activation state.
Figure 2 presents an illustration of the neural network
architecture of HMC-LMLP for a two-level hierarchy. As
it can be seen, the network is fully connected. When
the network associated to a specific hierarchical level
is being trained, the synaptic weights of the networks
associated to the previous levels are not adjusted,
because their adjustment has already occurred in the
earlier training phases. In the test phase, to classify
an instance, a neuron activation threshold is applied
to each output layer corresponding to a hierarchical
level. The output neurons with values higher than
the given threshold are activated, indicating that their
corresponding classes are being predicted.
It is expected that different threshold values result
in different classes being predicted. As the activation
function used is the logistic sigmoid function, the
outputs of the neurons range from 0 to 1. The higher the
threshold value used, the lower the number of predicted
classes. Conversely, the lower the threshold value used,
the larger the number of predicted classes.
After the final predictions for new instances are
provided by HMC-LMLP, a post-processing phase is
used to correct inconsistencies which may have occurred
during the classification, i.e., when a subclass is
predicted but its superclass is not. These inconsistencies
may occur because every MLP is trained using the same
set of instances. In other words, the instances used for
training at a given level were not filtered according
to the classes predicted at the previous level. The
post-processing phase guarantees that only consistent
X1
X2
X3
XN
X4
Input Instance x
Layers
corresponding
to the first
hierarchical level
Hidden neurons
Outputs of the
first level (classes)
Layers
corresponding
to the second
hierarchical level
Hidden neurons
Outputs of the
second level (classes)
Fig. 2.
Architecture of the HMC-LMLP for a two-level hierarchy.
predictions are made by removing those predicted
classes which do not have predicted superclasses.
Any training algorithm can be used to induce the base
neural networks in HMC-LMLP. In this work, we make
use of both the conventional Back-propagation [10] and
the Resilient back-propagation [11] algorithms. The latter
tries to eliminate the influence of the size of the partial
derivative by considering only the sign of the derivative
to indicate the direction of the weight update.
Additionally, we investigate the performance of
the neural networks by evaluating them during the
training process through two distinct error measures:
the conventional Back-propagation error of a neuron
(desired output − obtained output) and an error measure
tailored for the training of neural networks in multi-label
problems, proposed by Zhang and Zhou [12], given by:
E=
N X
i=1
1
X
bi |
|Ci ||C
exp(oim
−
oil ))
(1)
bi
(l,m)∈Ci ×C
where N is the number of instances, Ci the set of positive
bi its complement, and ok is
classes of the instance xi , C
th
the output of the k neuron, which corresponds to a
class ck . The error (ej ) of neuron j is defined as:
 X

1

exp(o
−
o
)

m
j

bi |
|Ci ||C



b
m∈C
if cj ∈ Ci
i
ej =





−


1
bi |
|Ci ||C
X
(2)
bi
exp(oj − ol )
if cj ∈ C
l∈Ci
This multi-label error measure considers the
correlation between the different class labels for a
given instance. The main feature of Equation 1 is that
it focuses on the difference between the MLPs outputs
on class labels belonging and not belonging to a given
TABLE I
N UMBER OF ATTRIBUTES (|A|), NUMBER OF CLASSES (|C|), TOTAL NUMBER OF INSTANCES (T OTAL ) AND NUMBER OF MULTI - LABEL INSTANCES
( MULTI - LABEL ) OF THE FOUR DATASETS USED DURING EXPERIMENTATION .
Dataset
|A|
|C|
Training
Total
Training
Multi-Label
Valid
Total
Valid
Multi-Label
Testing
Total
Testing
Multi-Label
Cellcycle
Church
Derisi
Eisen
77
27
63
79
499
499
499
461
1628
1630
1608
1058
1323
1322
1309
900
848
844
842
529
673
670
671
441
1281
1281
1275
837
1059
1057
1055
719
instance, i.e., it captures the nuances of the multi-label
problem at hand.
Our novel method is motivated by the fact that
neural networks can be naturally considered multi-label
classifiers, as their output layers can predict two or more
classes simultaneously. This is particularly interesting
in order to use just one classifier per hierarchical
level. The majority of the methods try to use a single
classifier to distinguish between all classes, employing
complex internal mechanisms, or decomposing the
original problems in many subproblems by the use of
many classifiers per level, losing important information
during this process [13].
IV. E XPERIMENTAL M ETHODOLOGY
Four freely available1 datasets regarding protein
functions of the Saccharomyces cerevisiae organism are
used in our experiments. The datasets are related
to Bioinformatics, such as phenotype data and gene
expression levels. They are organized in a tree structure
according to the FunCat scheme of classification. Table I
shows the main characteristics of the training, validation
and testing datasets used.
The performance of our method is compared with
three state-of-the-art decision tree algorithms for HMC
problems introduced in [5]: Clus-HMC, a global method
that induces a single decision tree for the whole set of
classes; Clus-HSC, a local method which explores the
hierarchical relationships to build a decision tree for
each hierarchical node; and Clus-SC, a local method
which builds a binary decision tree for each class of the
hierarchy. These methods are based on the concept of
Predictive Clustering Trees (PCT) [24].
We base our evaluation analysis on Precision-Recall
curves (P R curves), which reflect the precision of
a classifier as a function of its recall, and give an
informative description of the performance of each
method when dealing with highly skewed datasets [25]
(remember that this is the case of HMC problems). The
hierarchical precision (hP ) and hierarchical recall (hR)
measures (Equations 3 and 4) used to construct the P R
curves assume that an instance belongs not only to a
class, but also to all ancestor classes of this class [26].
Therefore, given an instance (xi , Ci0 ), with xi belonging
1 http://www.cs.kuleuven.be/˜dtai/clus/hmcdatasets.html
to the space X of instances, Ci0 the set of its predicted
classes, and Ci the set of its real classes, Ci0 and Ci
can be modified in order to contain their corresponding
b
b0 = S
ancestor classes: C
i
ck ∈Ci0 Ancestors(ck ) and Ci =
S
Ancestors(c
),
where
Ancestors(c
)
is
the
set
of
l
k
cl ∈Ci
ancestors of class ck .
P b
b0 |
|Ci ∩ C
i
hP = iP
b0 |
|
C
i
i
(3)
P b
b0 |
|Ci ∩ C
i
hR = iP
bi |
|C
(4)
i
A P R curve is obtained by varying different threshold
values, which are applied to the methods’ output,
generating different hP and hR values. The outputs of
the methods are represented by vectors of real values,
where each value denotes the pertinence degree of a
given instance to a given class. For each threshold, a
point in the P R curve is obtained, and final curves are
then plotted by the interpolation of these points [25]. The
areas under these curves (AU (P RC)) are approximated
by summing the trapezoidal areas between each point.
These areas are then used to compare the performances
of the methods, where the higher the AU (P RC) of a
method, the better its predictive performance.
To verify the statistical significance of the results, we
have employed the well-known Friedman and Nemenyi
tests, recommended for comparisons involving distinct
datasets and several classifiers [27]. As in [5], 2/3 of
each dataset were used for training and validation of
the algorithms, and 1/3 for testing.
Our method is executed with the number of neurons
of each hidden layer equals to 50% of the number of
neurons of the corresponding input layer. As each MLP
is comprised of three layers (input, hidden and output),
the learning rate values used in the Back-propagation
algorithm are 0.2 and 0.1 in the hidden and output
layers, respectively. In the same manner, the momentum
constant values 0.1 and 0.05 are used in each of these
layers. These values were chosen based on preliminary
non-exhaustive experiments. In the Rprop algorithm, the
parameter values used were suggested in [11]: initial
Delta (∆0 ) = 0.1, max Delta (∆max ) = 50.0, min Delta
(∆min ) = 1e−6 , increase factor (η + ) = 1.2 and decrease
factor (η − ) = 0.5. No attempt was made to tune these
0.4
0.4
0.2
0.4
0.6
0.8
0.6
0.4
0.2
0.4
0.6
0.8
0.0
0.0
1.0
0.2
0.4
0.6
0.0
0.0
0.4
0.2
0.2
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
Recall
0.2
0.4
0.6
0.8
1.0
Recall
Recall
0.8
0.6
0.4
(c) Derisi dataset
Fig. 3.
0.0
0.0
1.0
Clus-HMC
Clus-HSC
Clus-SC
Bp-CE
Bp-ZZE
Rprop-CE
Rprop-ZZE
0.8
0.6
0.4
0.2
0.2
0.0
0.0
0.0
0.0
1.0
P recision
0.4
0.6
1.0
P recision
0.6
Bp-CE
Bp-ZZE
Rprop-CE
Rprop-ZZE
0.8
P recision
P recision
0.8
0.8
(b) Church dataset
1.0
Clus-HMC
Clus-HSC
Clus-SC
0.4
Recall
(a) Cellcycle dataset
1.0
0.6
0.2
Recall
Recall
Bp-CE
Bp-ZZE
Rprop-CE
Rprop-CE
Rprop-ZZE
Rprop-CE
0.8
0.2
0.0
0.0
1.0
1.0
Clus-HMC
Clus-HSC
Clus-SC
0.8
0.2
0.2
0.0
0.0
0.6
1.0
P recision
0.6
Bp-CE
Bp-ZZE
Rprop-CE
Rprop-ZZE
0.8
P recision
0.8
P recision
1.0
Clus-HMC
Clus-HSC
Clus-SC
P recision
1.0
0.2
0.4
0.6
0.8
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Recall
Recall
(d) Eisen dataset
Examples of PR curves obtained by the methods.
parameter values. The training process lasts a period
of 1000 epochs, and at every 10 epochs, P R curves are
calculated over the validation dataset. The model which
achieves the best performance on the validation dataset
is then evaluated in the testing set. For each dataset,
the algorithm is executed 10 times, each time initializing
the synaptic weights randomly. The averages of the
AU (P RC) obtained in the individual executions are
then calculated. The Clus-HMC, Clus-HSC and Clus-SC
methods are executed one time each, with their default
configurations, like described in [5].
V. E XPERIMENTAL A NALYSIS
Figure 3 shows examples of P R curves resulting
from the experiments with HMC-LMLP and the
three state-of-the-art methods. Table II presents their
respective AU (P RC) values, together with the number
of times each method was within the top-three best
AU (P RC) (bottom part of the table). Table II also shows,
for HMC-LMLP, the standard deviation and number of
training epochs needed for the networks to provide their
results. Recall that the neural networks are executed
several times with randomly defined weights, and that
the P R curves depicted in Figure 3 were achieved after
a random execution of the method. Thence, they are
shown for exemplification purposes and do not represent
the average AU (P RC) values showed in Table II.
In the curves of Figure 3 and in the AU (P RC)
values in Table II, four variations of HMC-LMLP
are
represented:
conventional
Back-propagation
and Resilient back-propagation with conventional
error (Bp-CE and Rprop-CE), and conventional
back-propagation and Resilient back-propagation
with the multi-label error measure proposed by Zhang
and Zhou [12] (Bp-ZZE and Rprop-ZZE).
According to Table II, the best results in all datasets
were obtained by Clus-HMC, followed by Bp-CE, which
obtained the second rank position three times, and
Rprop-CE, which was three times within the top-three
best AU (P RC) values, according to the ranking
provided by the Friedman test. Notwithstanding, the
pairwise Nemenyi test identified statistically significant
differences only between Clus-HMC and Bp-ZZE, and
Clus-HMC and Rprop-ZZE. For the remaining pairwise
comparisons, no statistically significant results were
detected, which means the methods presented a similar
behavior regarding the AU (P RC) analysis.
Table II also shows that the performances obtained
by HMC-LMLP, specially when using the conventional
error measure, were competitive with the PCT-based
methods (Clus-HMC, Clus-HSC and Clus-SC). This
is quite motivating, since traditional MLPs trained
with back-propagation were employed without any
attempt of tuning their parameter values. In other
words, HMC-LMLP has performed quite consistently
even though it has a good margin for improvement.
According to the experiments, competitive performances
could be obtained with few training epochs and
not many hidden neurons (50% of the corresponding
number of input units).
The HMC-LMLP method achieved unsatisfying results
when employed with the multi-label error measure.
This could be explained by the fact that the number
of predicted classes is much higher than when using
TABLE II
AU (P RC) OF THE 4 DATASETS . ( AVERAGE ± S . D ).
Dataset
Cellcycle
Church
Derisi
Eisen
#Rank 1
#Rank 2
#Rank 3
Bp-CE
0.14
0.14
0.14
0.17
±
±
±
±
0.009
0.002
0.010
0.007
0
3
1
Rprop-CE
(20)
(10)
(30)
(60)
0.13
0.13
0.14
0.15
±
±
±
±
0.012
0.010
0.005
0.014
0
1
2
(30)
(40)
(30)
(70)
Bp-ZZE
0.08
0.07
0.08
0.09
±
±
±
±
0.005
0.008
0.008
0.006
(10)
(10)
(10)
(10)
0
0
0
the conventional error measure. Such an assumption is
confirmed by looking at the behavior of the P R curves
from Bp-ZZE and Rprop-ZZE (Figure 3). In these curves,
the precision values remain always between 0.0 and 0.2
as the recall values vary. In the curves resulting from
the other methods, the precision values have a tendency
to increase as the recall values decrease. Usually, lower
precision values are an indicative of predictions at
deeper levels of the hierarchy (more predictions), while
higher precision values are an indicative of predictions
at shallower levels (less predictions).
The unsatisfying results obtained with the use of
the multi-label measure were not particularly expected,
specially because satisfactory results were achieved
when it was first used with non-hierarchical multi-label
data [12]. Nevertheless, for the datasets tested in here,
its use seems to have a harmful effect, maybe due to the
much more difficult nature of the classification problem
considered (hundreds of classes to be predicted).
It is also interesting to notice that, differently
from the other HMC-LMLP variations, the Resilient
Back-propagation algorithm with the Zhang and Zhou
multi-label error measure achieved its best performance
only after 980/1000 training epochs. Originally, the
multi-label error measure was meant to be used with
the conventional back-propagation algorithm, with an
online training mode. The Resilient Back-propagation
algorithm, however, works in batch mode, which may
have influenced the way the error measure captures
the characteristics of multi-label learning. On the other
hand, the fact that the AU (P RC) values obtained by
Rprop-ZZE keep increasing with the training process
execution may indicate that this variation is more robust
to local optima, and further experiments using more
training epochs may lead to better results.
A deeper analysis regarding the HMC-LMLP
predictions shows the tendency it has of predicting more
classes at the first hierarchical levels. This behavior is a
consequence of the top-down local strategy employed
by the method, which first classifies instances according
to the classes located in the first hierarchical level,
and then tries to predict their subclasses. Also, as the
hierarchy becomes deeper, the datasets become sparser
(very few positive instances), making the classification
task quite difficult. Nevertheless, classes located at
Rprop-ZZE
Clus-HMC
Clus-HSC
Clus-SC
0.07 ± 0.008 (990)
0.07 ± 0.004 (1000)
0.07 ± 0.004 (980)
0.09 ± 0.004 (1000)
0.17
0.17
0.17
0.20
0.11
0.13
0.09
0.13
0.11
0.13
0.09
0.13
0
0
0
4
0
0
0
0
1
0
0
0
deeper levels of the hierarchy can be predicted using
proper threshold values. Usually, the use of lower
threshold values increases recall, which reflects the
predictions at deeper levels of the hierarchy, whereas
the use of larger threshold values increases precision,
which reflects the predictions in the shallower levels.
HMC-LMLP is not short of disadvantages. In
comparison to the PCT-based methods, HMC-LMLP
does not produce classification rules. It works as a
black-box model, and it may be undesirable in domains
in which the specialist is interesting to understand
the causes for each prediction. Notwithstanding, the
investigation of traditional MLPs applied to HMC
problems seems to be a very promising field, because a
MLP network can be naturally considered a multi-label
classifier, as their output layers can simultaneously
predict more than one class. In addition, neural networks
are regarded as robust classifiers which are able to find
approximate solutions for very complex problems, which
is clearly the case of HMC.
VI. C ONCLUSIONS AND F UTURE W ORK
This work presented a novel local method for
HMC problems that uses Multi-Layer Perceptrons
as base classifiers. The proposed method, named
HMC-LMLP, trains a separated MLP neural network
for each hierarchical level. The outputs of a network
responsible for the predictions at a given level are
used as inputs to the network associated to the next
level, and so forth. Two algorithms were employed
for training the base MLPs: Back-propagation [10]
and Resilient back-propagation [11]. In addition, an
extra error measure proposed for multi-label problems
[12] was investigated. Experimental results suggested
that HMC-LMLP achieves competitive predictive
performance when compared to state-of-the-art decision
trees for HMC problems [5]. These results are quite
encouraging, specially considering that we employed
conventional MLPs with no specific design modifications
to deal with multi-label problems, and that we made
no attempt to tune the MLP parameter values. The
PCT-based method, conversely, has been investigated
and tuned for more than a decade [4], [24], [28].
For future research, we plan to investigate the use
of other neural network approaches, such as Radial
Basis Function [29], to serve as base classifiers to our
method. Moreover, we plan to test our approach in
other domains such as multi-label hierarchical text
categorization [30], [31]. Hierarchies structured as DAGs
will also be investigated, requiring modifications during
the evaluation of the results provided by the method.
A CKNOWLEDGEMENTS
We would like to thank the Brazilian research agencies
Fundação de Amparo à Pesquisa do Estado de São Paulo
(FAPESP), Conselho Nacional de Desenvolvimento
Cientı́fico e Tecnológico (CNPq), and Coordenação de
Aperfeiçoamento de Pessoal de Nı́vel Superior (CAPES).
We would also like to thank Dr. Celine Vens for
providing support with the PCT-based methods.
R EFERENCES
[1] A. Freitas and A. C. Carvalho, “A tutorial on hierarchical
classification with applications in bioinformatics,” in Research and
Trends in Data Mining Technologies and Applications. Idea Group,
2007, ch. VII, pp. 175–208.
[2] A. Clare and R. D. King, “Predicting gene function in
saccharomyces cerevisiae,” Bioinformatics, vol. 19, pp. 42–49, 2003.
[3] J. Struyf, H. Blockeel, and A. Clare, “Hierarchical
multi-classification with predictive clustering trees in functional
genomics,” in Workshop on Computational Methods in Bioinformatics,
ser. LNAI, vol. 3808. Springer, 2005, pp. 272–283.
[13] C. Silla and A. Freitas, “A survey of hierarchical classification
across different application domains,” Data Mining and Knowledge
Discovery, vol. 22, pp. 31–72, 2010.
[14] Z. Barutcuoglu, R. E. Schapire, and O. G. Troyanskaya,
“Hierarchical multi-label prediction of gene function,”
Bioinformatics, vol. 22, pp. 830–836, 2006.
[15] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network
classifiers,” Machine Learning, vol. 29, no. 2-3, pp. 131–163, 1997.
[16] M. Ashburner et al., “Gene ontology: tool for the unification of
biology. The Gene Ontology Consortium.” Nature Genetics, vol. 25,
pp. 25–29, 2000.
[17] Y. Guan, C. Myers, D. Hess, Z. Barutcuoglu, A. Caudy, and
O. Troyanskaya, “Predicting gene function in a hierarchical
context with an ensemble of classifiers,” Genome Biology, vol. 9,
p. S3, 2008.
[18] P. Langley, W. Iba, and, and K. Thompson, “An analysis of
bayesian classifiers,” in National conference on Artificial intelligence,
1992, pp. 223–228.
[19] G. Valentini, “True path rule hierarchical ensembles,” in
International Workshop on Multiple Classifier Systems, 2009, pp.
232–241.
[20] H. W. Mewes et al., “Mips: a database for genomes and protein
sequences.” Nucleic Acids Research, vol. 30, pp. 31–34, 2002.
[21] H. Blockeel, M. Bruynooghe, S. Dzeroski, J. Ramon, and J. Struyf,
“Hierarchical multi-classification,” in Workshop on Multi-Relational
Data Mining, 2002, pp. 21–35.
[22] J. R. Quinlan, C4.5: programs for machine learning. San Francisco,
CA, USA: Morgan Kaufmann Publishers Inc., 1993.
[4] H. Blockeel, L. Schietgat, J. Struyf, S. Dzeroski, and A. Clare,
“Decision trees for hierarchical multilabel classification: A
case study in functional genomics.” in Knowledge Discovery in
Databases, 2006, pp. 18–29.
[23] F. Otero, A. Freitas, and C. Johnson, “A hierarchical classification
ant colony algorithm for predicting gene ontology terms,” in
European Conference on Evolutionary Computation, Machine Learning
and Data Mining in Bioinformatics, vol. LNCS. Springer, 2009, pp.
68–79.
[5] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel,
“Decision trees for hierarchical multi-label classification,” Machine
Learning, vol. 73, pp. 185–214, 2008.
[24] H. Blockeel, L. De Raedt, and J. Ramon, “Top-down induction of
clustering trees,” in International Conference on Machine Learning,
1998, pp. 55–63.
[6] R. Alves, M. Delgado, and A. Freitas, “Knowledge discovery
with artificial immune systems for hierarchical multi-label
classification of protein functions,” in International Conference on
Fuzzy Systems, 2010, pp. 2097–2104.
[25] J. Davis and M. Goadrich, “The relationship between
precision-recall and roc curves,” in International Conference
on Machine Learning, 2006, pp. 233–240.
[7] F. Otero, A. Freitas, and C. Johnson, “A hierarchical multi-label
classification ant colony algorithm for protein function
prediction,” Memetic Computing, vol. 2, pp. 165–181, 2010.
[26] S. Kiritchenko, S. Matwin, and A. F. Famili, “Hierarchical text
categorization as a tool of associating genes with gene ontology
codes,” in European Workshop on Data Mining and Text Mining in
Bioinformatics, 2004, pp. 30–34.
[8] R. Cerri and A. C. P. L. F. Carvalho, “Hierarchical multilabel
classification using top-down label combination and artificial
neural networks,” in Brazilian Symposium on Artificial Neural
Networks, 2010, pp. 253–258.
[27] J. Demšar, “Statistical comparisons of classifiers over multiple
data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30,
2006.
[9] R. Cerri, A. C. P. de Leon Ferreira de Carvalho, and A. A. Freitas,
“Adapting non-hierarchical multilabel classification methods for
hierarchical multilabel classification,” Intelligent Data Analysis, p.
To appear, 2011.
[10] D. E. Rumelhart and J. L. McClelland, Parallel distributed processing:
explorations in the microstructure of cognition, D. E. Rumelhart and
J. L. McClelland, Eds. Cambridge, MA: MIT Press, 1986, vol. 1.
[11] M. Riedmiller and H. Braun, “A Direct adaptive method for faster
backpropagation learning: The RPROP algorithm,” in International
Conference on Neural Networks, 1993, pp. 586–591.
[12] M.-L. Zhang and Z.-H. Zhou, “Multilabel neural networks with
applications to functional genomics and text categorization,”
IEEE Transactions on Knowledge and Data Engineering, vol. 18, pp.
1338–1351, 2006.
[28] L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev,
and S. Dzeroski, “Predicting gene function using hierarchical
multi-label decision tree ensembles,” BMC Bioinformatics, vol. 11,
p. 2, 2010.
[29] M. J. D. Powell, Radial basis functions for multivariable interpolation:
a review. New York, NY, USA: Clarendon Press, 1987, pp. 143–167.
[30] S. Kiritchenko, S. Matwin, and A. Famili, “Functional annotation
of genes using hierarchical text categorization,” in Proc. of the ACL
Workshop on Linking Biological Literature, Ontologies and Databases:
Mining Biological Semantics, 2005.
[31] A. Esuli, T. Fagni, and F. Sebastiani, “Boosting multi-label
hierarchical text categorization,” Inf. Retr., vol. 11, no. 4, pp.
287–313, 2008.
Download