Ordered Incremental Attribute Learning based on mRMR and Neural

advertisement
INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011
86
Ordered Incremental Attribute Learning based
on mRMR and Neural Networks
Ting Wang, Sheng-Uei Guan, and Fei Liu
Abstract—Current feature reduction approaches such as
feature selection and feature extraction are insufficient for
dealing with high-dimensional pattern recognition problems when
all features carry similar significance. An applicable method for
coping with these problems is incremental attribute learning
(IAL) which gradually imports pattern features in one or more
size. Hence a new preprocessing called feature ordering should be
introduced in pattern classification and regression, and the
ordering of imported features should be calculated before
recognition. In previous studies, the calculation of feature
ordering is similar to wrapper methods. However, such a process
is time-consuming in feature selection. In this paper, a substitute
approach for feature ordering is presented, where feature
ordering is ranked by some metrics such as redundancy and
relevance using mRMR criteria. Based on ITID, a neural IAL
model is derived. Experimental results verified that feature
ordering derived by mRMR can not only save time, but also
obtain the best classification rate compared with those in previous
studies. In addition, it is also feasible to apply mRMR to calculate
feature ordering for regression problems.
Index Terms—feature ordering, incremental attribute learning,
mRMR, neural networks
I. INTRODUCTION
P
ROBLEMS like gene analysis and text classification often
have a high-dimensional feature space, which consists of a
large number of features, also called as attributes. The number
of dimensions often stands for the complexity of a problem.
The more features in a problem, the more complex this problem
is. Complex high-dimensional problems will cause dimensional
disasters that will make systems halt in computing. To solve
these problems, some dimensional reduction strategies like
feature selection and feature extraction [1] have been presented
[2]. However, these methods are invalid when the problem has
a large number of features and all the features are crucial and
have similar importance in the problem simultaneously. Thus
This work was supported in part by the National Natural Science Foundation
of China under Grant 61070085.
T. Wang is with the University of Liverpool, Liverpool, L69 3BX UK, and is
offsite studying at the Xi’an Jiaotong-Liverpool University, Suzhou, 215123
China. Phone: 86-13812296645. E-mail: ting.wang@ liverpool.ac.uk
S. Guan, is with Xi’an Jiaotong-Liverpool University, Suzhou, 215123
China. E-mail: Steven.Guan@xjtlu.edu.cn
F. Liu, is with La Trobe University, Bundoora, Victoria 3086, Australia.
E-mail: f.liu@latrobe.edu.au
feature reduction is not the ultimate technique for coping with
high dimensional problems.
One useful strategy for solving high-dimensional problems
is “divide-and-conquer”, where a complex problem is firstly
separated into some smaller modules by features. These
modules will be integrated after they have been tackled
independently. A representative of such methods is Incremental
Attribute Learning (IAL), which incrementally trains pattern
features in one or more size. It has been shown as an applicable
approach for solving machine learning problems in regression
and classification [3-6]. Moreover, some previous studies have
shown that IAL based on neural networks usually obtains better
results than conventional methods which prefer to train all
pattern features in one batch [3, 7]. For example, based on
machine learning datasets from University of California, Irvine
(UCI), Guan and his colleagues employed IAL to solve some
classification and regression problems by neural networks.
Almost all their results were better than those derived from
traditional methods [5, 6]. More specifically, classification
errors of IAL using neural network in the datasets of Diabetes,
Thyroid and Glass were reduced by 8.2%, 14.6% and 12.6%,
respectively [8].
However, because IAL incrementally imports features into
systems, it is necessary to know which feature should be
introduced in an earlier step. Thus feature ordering should be
implemented as a new preprocess apart from conventional
preprocessing tasks like feature selection and feature extraction.
Usually, feature ordering relies on feature’s discriminative
ability. In previous studies on neural IAL, feature ordering was
derived by an approach which is similar to wrappers in feature
selection, where discriminative ability of a single feature is
calculated by some predictive algorithms like neural networks.
The only input consists of the feature to be evaluated, and the
output refers to the discrimination ability of this feature.
However, by comparison with filter, another approach in
feature selection, wrapper is more time-consuming. Therefore,
it is necessary to do some studies on feature ordering based on
filter methods.
In this paper, a new feature ordering method of IAL is
presented based on a filter feature selection approach called
minimal-redundancy-maximal-relevance (mRMR) criterion.
Furthermore, as a neural network algorithm of IAL, ITID will
be used to test the applicability and accuracy of this new
method. In this paper, some background knowledge of ITID
INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011
and mRMR will be introduced respectively in Section 2 and 3;
In Section 4, an IAL feature ordering model based on mRMR
will be presented; Benchmarks with the datasets from UCI will
be tested out in Section 5 followed by some experimental result
analysis; and conclusions will be drawn in the last section with
an outline of future works.
II. IAL BASED ON NEURAL NETWORKS
Incremental attribute learning is a novel approach which
imports features gradually in one or more groups. According to
previous studies, IAL is more feasible to cope with multivariate
dimensional pattern recognition problems. At present, based on
some intelligent predictive methods like neural networks, new
approaches and algorithms have been presented for IAL. For
example, incremental neural network training with an
increasing input dimension (ITID) [9] is an incremental neural
training method derived from ILIA1 and ILIA2 [7], which have
been shown applicable for classification and regression. It
divides the whole input dimensions into several sub
dimensions, each of which corresponds to an input feature.
Instead of learning input features altogether as an input vector
in a training instance, ITID learns input features one after
another through their corresponding sub-networks, and the
structure of neural networks gradually grows with an
increasing input dimension as shown in Fig. 1. During training,
information obtained by a new sub-network is merged together
with the information obtained by the old network. Such
architecture is based on ILIA1. After training, if the outputs of
neural networks are collapsed with an additional network
sitting on the top where links to the collapsed output units and
all the input units are built to collect more information from the
inputs, this results in ILIA2 as shown in Fig. 2. Finally, a
pruning technique is adopted to find out the appropriate
network architecture. With less internal interference among
input features, ITID achieves higher generalization accuracy
than conventional methods [9].
III.
87
MRMR CRITERION
Minimal-redundancy-maximal-relevance criterion (mRMR)
is a method for first-order incremental feature selection [10],
which is a hot topic in research. In mRMR criteria, features
which have both minimum redundancy for input features and
maximum relevancy for output classes should be selected. Thus
this method is based on two important metrics. One is mutual
information between an output and each input, which is used to
measure relevancy, and the other is mutual information
between every two inputs, which is used to calculate
redundancy between these inputs.
More specifically, let S denote the subset of selected
features, and Ω is the pool of all input features, the minimum
redundancy can be computed by
1
min
S ⊂Ω
S
2
∑I( f , f
i , j ∈S
i
j
).
(1)
where I(fi, fj) is mutual information between fi and fj, and |S| is
the number of input feature of S. On the other hand, mutual
information I(c, fi) is usually employed to calculate
discrimination ability from feature fi to class c = {c1,…,ck}.
Therefore, the maximum relevancy can be calculated by
max
S ⊂Ω
1
S
∑ I ( c, f ) .
i∈S
(2)
i
Combining (1) with (2), mRMR feature selection criterion can
be obtained as below, either in quotient form:
⎧⎪
⎡1
max ⎨∑ I (c, f i ) / ⎢
S ⊂Ω
⎪⎩ i∈S
⎣S
∑I( f , f
j
∑I( f , f
j
i , j ∈S
i
⎤ ⎫⎪
)⎥ ⎬
⎦ ⎪⎭
(3)
⎤ ⎫⎪
)⎥ ⎬ .
⎦ ⎪⎭
(4)
or in a different form:
⎧⎪
⎡1
max ⎨∑ I (c, f i ) − ⎢
S ⊂Ω
⎪⎩ i∈S
⎣S
i , j∈S
i
In the solutions of mRMR, features are incrementally added
into the selected feature subset. According to such a process,
the sequence of incremental addition can be regarded as an
order of discrimination ability of features. Thus feature
ordering also can be calculated by (3) or (4), if all features have
been put into the selected subset by mRMR.
IV. FEATURE ORDERING BASED ON MRMR
Fig. 1. The basic network structure of ITID
Fig. 2. The network structure of ITID with a sub-network on the top
Feature ordering is unique to data preparation work of IAL.
Compared with conventional approaches where input features
are trained in one batch, features will be gradually imported
into pattern recognition one after another in IAL. In this
process, how to derive an order for training is very important.
Hence feature ordering, seldom used in conventional pattern
recognition techniques, is indispensable in IAL. Moreover, the
computing procedure of feature ordering is different from that
of feature selection methods, where feature selection discards
some features from the original feature set, while feature
ordering merely puts down all features in a given order which
may be different from the original sequence.
INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011
Due to the fact that the calculation of feature ordering in
previous studies is based on wrappers, which is timeconsuming compared with filters, using filter methods is able to
bring benefits for feature ordering in IAL. Moreover, apart
from saving time in preprocessing, there are some other
advantages in the calculation of feature ordering using filters.
For example, feature reduction and feature ordering can be
applied simultaneously, because the former severely relies on
discriminability while the calculation of discriminability is the
key in the latter.
Although these two mRMR methods are different, both are
applicable in the calculation of input feature ordering, and each
ordering can be employed in training. Fig. 3 is a model of
ordered IAL, where feature ordering is calculated by mRMR.
In this model, there are two phases of ordered IAL: obtaining
feature ordering and applying pattern recognition. The step of
obtaining feature ordering refers to the computing of Feature
Ordering Vector, which is derived from Feature Ordering
Calculator. In this step, the discriminability of each feature is
calculated on the basis of training dataset, and the results of
discriminability are placed in a descending order, which will
obtain better results than the other orders [11]. In the second
phase, the formal machine learning process starts. Patterns are
randomly divided into three different datasets: training,
validation and testing [12]. The process in this step is based on
these datasets, and feature ordering for importing features is a
foundation for pattern recognition in each round.
Fig. 3. The model of Ordered IAL based on mRMR
V. EXPERIMENTS AND ANALYSIS
The proposed ordered IAL method using mRMR and ITID
were tested on four benchmarks from UCI machine learning
datasets. They are Diabetes, Cancer, Thyroid and Flare. The
first three are classification problems while the last one is a
regression problem. In these experiments, all the patterns were
randomly divided into three groups: training set (50%),
validation set (25%) and testing set (25%). Especially, the
training data were firstly used to rank feature ordering based on
mRMR in the first place as a preprocessing task while ITID was
employed for classification or regression according to this
88
feature ordering in the following step. Furthermore, to compare
with previous studies which merely focus on ILIA1, all the
results about ITID were based on ILIA1 as well.
To evaluate the performance, two types of metrics were
employed for analysis, preprocessing time and error rate of the
pattern recognition process.
In terms of time, previous feature ordering computation
employed one feature as the only input of neural networks to
classify or predict all patterns. Such processing can estimate
one feature’s discriminability according to the error rate of
classification or regression, and such discriminability can rank
feature ordering. This processing is similar to wrappers which
were widely used in feature selection.
Compared with wrapper-like feature ordering computing,
the calculation using mRMR is quite different where several
metrics will be employed to measure the discriminability. Such
processing is similar to filters in feature selection. Obviously, a
filter-like approach usually takes much less time than a wrapper
method does. Thus computing using mRMR for feature
ordering is definitely more timesaving.
In terms of error rate, both mRMR methods are applicable
for feature ordering, thus two streams of experiments based on
mRMR were implemented as well. The following subsections
present the details of different experiments using different
datasets.
A. Diabetes
Diabetes is a two-category classification problem which has
8 continuous input features that are used to diagnose whether a
Pima Indian has diabetes or not. There are 768 patterns in this
dataset, 65% of which belong to class 1 (no diabetes), 35%
class 2 (diabetes). Table I shows the results in comparison with
classification using ITID based on different feature orderings
which were derived from mRMR (Difference), mRMR
(Quotient), wrappers and the conventional method which has
no feature ordering. According to the results, mRMR
(Difference) obtained the lowest error rate (22.86459%) and
the result of mRMR (Quotient) is as good as that of wrappers
(22.96876%). All of them are better than those derived from
conventional method (23.93229%) which trains patterns in one
batch. Thus using mRMR can obtain a better feature ordering
for IAL in Diabetes.
B. Cancer
Cancer is a classification problem including 9 continuous
inputs, 2 outputs, and 699 patterns, which is used to diagnose
breast cancer. 66% of the patterns belong to class 1 (benign)
and 34% of them belong to class 2 (malign). Table II shows
Cancer’s experimental results. By comparison, both mRMR
methods obtained the same best classification results
(2.29885%) in this test, while those of wrappers and
conventional methods are 2.4999985% and 2.87356%,
respectively. Hence using mRMR for feature ordering is better
than using the other approaches in Cancer.
INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011
C. Thyroid
Thyroid diagnoses whether a patient’s thyroid has
over-function, normal function, or under-function based on
patient query data and patient examination data. This
classification problem has 21 inputs features, 3 outputs, and
7200 patterns where class 1, 2 and 3 have 2.3%, 5.1% and
92.6% of all the patterns, respectively. Table III presents the
classification results of Thyroid. Compared with the error rate
of wrappers (1.838888%) and conventional method
(1.8638875%), both mRMR (Difference) and mRMR
(Quotient) exhibited better performance, where the result of the
former is 1.619443% and that of the latter is 1.625001%.
Therefore, mRMR is a better approach for feature ordering in
Thyroid.
D. Flare
The Flare problem is a regression problem that predicts three
outputs of solar flares. There are 10 inputs, 3 outputs and 1066
patterns in this dataset, where the first three inputs have 7, 6,
and 4 features, respectively, and each of latter inputs merely
has one feature. Thus the total number of input features in Flare
is 24. Table IV exhibits the performance of regression where
the input ordering is derived either from mRMR or wrappers.
According to the column of testing error, mRMR is feasible for
feature ordering in regression, but the results are not better than
those derived in previous studies using conventional method
and wrappers. The reason of such a phenomenon is that, in
previous studies, the first three inputs which consist of multiple
features were not trained dispersedly and individually, but
trained group by group. Thus feature grouping may impact on
the final results, which coincides with the idea in [13].
According to the analysis presented above, in classification
problems, feature ordering derived from mRMR exhibits better
performance than that obtained by the conventional method
and wrappers; while in regression problems, the ordering
calculated by mRMR is not the best because of the feature
grouping in Flare. Therefore, as a well-known method of
feature selection, mRMR is also available for feature ordering.
It takes less preprocessing time to compute input feature
ordering, and brings acceptable results with stable performance
for simulation and application. However, because of the
difference among feature ordering, the final results of different
approaches are different. How to get the best feature ordering is
still unknown. Nevertheless, this is a study which will be taken
in the future.
VI. CONCLUSION
IAL is a novel approach which gradually trains input
attributes in one or more sizes. Feature ordering in training is a
unique preprocessing step in IAL pattern recognition. In
previous studies, feature ordering of IAL was derived by
wrapper methods which are more time-consuming than filter
approaches like mRMR. Moreover, experimental results also
demonstrated that the feature ordering obtained by mRMR can
exhibit better performance than those derived by wrappers or
89
TABLE I
RESULTS OF DIABETES
Feature Ordering
Classification Error
mRMR-Difference
2-6-1-7-3-8-4-5
22.86459%
mRMR-Quotient
2-6-1-7-3-8-5-4
22.96876%
wrappers
2-6-1-7-3-8-5-4
22.96876%
Conventional method
23.93229%
TABLE II
RESULTS OF CANCER
Feature Ordering
Classification Error
mRMR-Difference
2-6-1-7-3-8-5-4-9
2.29885%
mRMR-Quotient
2-6-1-7-8-3-5-4-9
2.29885%
wrappers
2-3-5-8-6-7-4-1-9
2.4999985%
Conventional method
2.87356%
TABLE III
RESULTS OF THYROID
Feature Ordering
Classification
Error
mRMR-Difference
3-7-17-10-6-8-13-16-4-5-12
-21-18-19-2-20-15-9-14-11-1
1.619443%
mRMR-Quotient
3-10-16-7-6-17-2-8-13-5-1-4
-11-12-14-9-21-15-18-19-20
1.625001%
wrappers
18-17-19-20-11-21-15-10-3
-8-13-7-1-2-12-16-6-5-4-14-9
1.838888%
Conventional method
1.8638875%
TABLE IV
RESULTS OF FLARE
Feature Ordering
Testing
Error
mRMR-Difference
16-23-5-18-10-4-13-6-17-19-11-21
-7-20-12-1-24-2-3-9-22-15-8-14
0.568722%
mRMR-Quotient
16-23-5-18-10-4-13-6-19-17-11-21
-7-20-14-2-9-15-12-22-8-3-1-24
0.573042%
wrappers
(1-2-3-4-5-6-7)-(8-9-10-11-12-13)(14-15-16-17)-18-21-23-22-20-19-24
0.5255421%
Conventional method
0.55%
traditional methods which train features in one batch in the
process of pattern recognition. Nonetheless, there are a number
of further studies needed to be done in the future. For example,
although mRMR can attain better performance than the other
approaches, how to obtain feature ordering with an optimal
pattern recognition result is still unknown. Moreover, in spite
of the predictive results which are not better than others,
whether there exists any improvement of feature ordering for
regression is worthy of being researched. In addition, whether
feature grouping of input attributes is a factor which may
influence pattern recognition also needs to be researched in the
future.
Generally, using mRMR to calculate feature ordering is
applicable for saving time and enhancing the classification rate
in pattern classification problems based on neural IAL
approaches. Although the performance exhibited in regression
INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011
is not the best compared with results derived from some
previous studies, mRMR is still applicable in feature ordering
calculation for prediction with neural IAL.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
H. Liu, “Evolving feature selection,” IEEE Intelligent Systems, vol. 20, no.
6, pp. 64-76, 2005.
S.H. Weiss and N. Indurkhya, Predictive data mining: a practical guide,
Morgan Kaufmann Publishers, CA: San Francisco, 1998.
S. Chao and F. Wong, “An incremental decision tree learning
methodology regarding attributes in medical data mining,” Proc. of the
8th Int’l Conf. on Machine Learning and Cybernetics, Baoding, pp.
1694-1699, 2009.
R. K. Agrawal and R. Bala, “Incremental Bayesian classification for
multivariate normal distribution data,” Pattern Recognition Letters, vol.
29, no. 13, pp. 1873-1876, Oct. 2008.
S.U. Guan and J. Liu, “Feature selection for modular networks based on
incremental training,” Journal of Intelligent Systems, vol. 14, no. 4, pp.
353-383, 2005.
F. Zhu and S.U. Guan, “Ordered incremental training for GA-based
classifiers,” vol. 26, no. 14, Pattern Recognition Letters, pp. 2135-2151,
Oct. 2005.
S.U. Guan and S. Li, “Incremental learning with respect to new incoming
input attributes,” Neural Processing Letters, vol. 14, no. 3, pp. 241-260,
Dec. 2001.
S.U. Guan and S. Li, “Parallel growing and training of neural networks
using output parallelism,” IEEE Trans. on Neural Networks, vol. 13, no.
3, pp. 542 -550, May 2002.
S.U. Guan and J. Liu, “Incremental neural network training with an
increasing input dimension,” Journal of Intelligent Systems, vol. 13, no. 1,
pp. 43-69, 2004.
H. Peng, F. Long, C. Ding, “Feature selection based on mutual
information: criteria of max-dependency, max-relevance, and
min-redundancy,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 27, no. 8, pp. 1226-1238, 2005.
S.U. Guan and J. Liu, “Incremental Ordered Neural Network Training,”
Journal of Intelligent Systems, vol. 12, no. 3, pp. 137-172, 2002.
Ripley B. D., Pattern Recognition and Neural Networks, Cambridge
University Press, UK:Cambridge, 1996.
J.H. Ang, S.U. Guan, K.C. Tan, A.A. Mamun, “Interference-Less Neural
Network Training”, Neurocomputing, vol. 71, no. 16-18, pp 3509-3524,
2008.
90
Ting WANG was born in Wuxi, China, on May 28th, 1981.
He obtained his BSc degree in computer science at the
China University of Mining and Technology, in 2003 and
his MSc degree in computer science at the Guilin University
of Technology, China, in 2008, and is now a PhD candidate
in computer science, at the University of Liverpool.
From July 2003 to July 2004, he was a system analyst at
the Microstar International Inc. Before he started his PhD
program in March 2009, he was a research and development engineer at the
Jiangnan Institution of Computing Technology. He is now interested in the field
of artificial intelligence and information management.
Sheng-Uei Guan received his M.Sc. & Ph.D. from the
University of North Carolina at Chapel Hill. He is currently
a professor and head of the computer science and software
engineering department at Xian Jiaotong-Liverpool
University. Before joining XJTLU, he was a professor and
chair in intelligent systems at Brunel University, UK.
Prof. Guan has worked in a prestigious R&D
organization for several years, serving as a design engineer, project leader, and
manager. After leaving the industry, he joined Yuan-Ze University in Taiwan
for three and half years. He served as deputy director for the Computing Center
and the chairman for the Department of Information & Communication
Technology. Later he joined the Electrical & Computer Engineering
Department at National University of Singapore as an associate professor.
Fei Liu completed her PhD from The Department of
Computer Science & Computer Engineering, La Trobe
University in 1998. Before joining the department as an
academic staff member in 2002, she worked as a lecturer in
The School of Computer & Information Science, The
University of South Australia, and The School of Computer
Science & Information Technology, Royal Melbourne
Institute of Technology. She also worked as a software engineer in Ericsson
Australia. Her research interests include Logic Programming, Semantic Web
and Security in Electronic Commerce. She has been teaching in Artificial
Intelligence, Programming, and Security in Electronic Commerce.
Download