INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011 86 Ordered Incremental Attribute Learning based on mRMR and Neural Networks Ting Wang, Sheng-Uei Guan, and Fei Liu Abstract—Current feature reduction approaches such as feature selection and feature extraction are insufficient for dealing with high-dimensional pattern recognition problems when all features carry similar significance. An applicable method for coping with these problems is incremental attribute learning (IAL) which gradually imports pattern features in one or more size. Hence a new preprocessing called feature ordering should be introduced in pattern classification and regression, and the ordering of imported features should be calculated before recognition. In previous studies, the calculation of feature ordering is similar to wrapper methods. However, such a process is time-consuming in feature selection. In this paper, a substitute approach for feature ordering is presented, where feature ordering is ranked by some metrics such as redundancy and relevance using mRMR criteria. Based on ITID, a neural IAL model is derived. Experimental results verified that feature ordering derived by mRMR can not only save time, but also obtain the best classification rate compared with those in previous studies. In addition, it is also feasible to apply mRMR to calculate feature ordering for regression problems. Index Terms—feature ordering, incremental attribute learning, mRMR, neural networks I. INTRODUCTION P ROBLEMS like gene analysis and text classification often have a high-dimensional feature space, which consists of a large number of features, also called as attributes. The number of dimensions often stands for the complexity of a problem. The more features in a problem, the more complex this problem is. Complex high-dimensional problems will cause dimensional disasters that will make systems halt in computing. To solve these problems, some dimensional reduction strategies like feature selection and feature extraction [1] have been presented [2]. However, these methods are invalid when the problem has a large number of features and all the features are crucial and have similar importance in the problem simultaneously. Thus This work was supported in part by the National Natural Science Foundation of China under Grant 61070085. T. Wang is with the University of Liverpool, Liverpool, L69 3BX UK, and is offsite studying at the Xi’an Jiaotong-Liverpool University, Suzhou, 215123 China. Phone: 86-13812296645. E-mail: ting.wang@ liverpool.ac.uk S. Guan, is with Xi’an Jiaotong-Liverpool University, Suzhou, 215123 China. E-mail: Steven.Guan@xjtlu.edu.cn F. Liu, is with La Trobe University, Bundoora, Victoria 3086, Australia. E-mail: f.liu@latrobe.edu.au feature reduction is not the ultimate technique for coping with high dimensional problems. One useful strategy for solving high-dimensional problems is “divide-and-conquer”, where a complex problem is firstly separated into some smaller modules by features. These modules will be integrated after they have been tackled independently. A representative of such methods is Incremental Attribute Learning (IAL), which incrementally trains pattern features in one or more size. It has been shown as an applicable approach for solving machine learning problems in regression and classification [3-6]. Moreover, some previous studies have shown that IAL based on neural networks usually obtains better results than conventional methods which prefer to train all pattern features in one batch [3, 7]. For example, based on machine learning datasets from University of California, Irvine (UCI), Guan and his colleagues employed IAL to solve some classification and regression problems by neural networks. Almost all their results were better than those derived from traditional methods [5, 6]. More specifically, classification errors of IAL using neural network in the datasets of Diabetes, Thyroid and Glass were reduced by 8.2%, 14.6% and 12.6%, respectively [8]. However, because IAL incrementally imports features into systems, it is necessary to know which feature should be introduced in an earlier step. Thus feature ordering should be implemented as a new preprocess apart from conventional preprocessing tasks like feature selection and feature extraction. Usually, feature ordering relies on feature’s discriminative ability. In previous studies on neural IAL, feature ordering was derived by an approach which is similar to wrappers in feature selection, where discriminative ability of a single feature is calculated by some predictive algorithms like neural networks. The only input consists of the feature to be evaluated, and the output refers to the discrimination ability of this feature. However, by comparison with filter, another approach in feature selection, wrapper is more time-consuming. Therefore, it is necessary to do some studies on feature ordering based on filter methods. In this paper, a new feature ordering method of IAL is presented based on a filter feature selection approach called minimal-redundancy-maximal-relevance (mRMR) criterion. Furthermore, as a neural network algorithm of IAL, ITID will be used to test the applicability and accuracy of this new method. In this paper, some background knowledge of ITID INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011 and mRMR will be introduced respectively in Section 2 and 3; In Section 4, an IAL feature ordering model based on mRMR will be presented; Benchmarks with the datasets from UCI will be tested out in Section 5 followed by some experimental result analysis; and conclusions will be drawn in the last section with an outline of future works. II. IAL BASED ON NEURAL NETWORKS Incremental attribute learning is a novel approach which imports features gradually in one or more groups. According to previous studies, IAL is more feasible to cope with multivariate dimensional pattern recognition problems. At present, based on some intelligent predictive methods like neural networks, new approaches and algorithms have been presented for IAL. For example, incremental neural network training with an increasing input dimension (ITID) [9] is an incremental neural training method derived from ILIA1 and ILIA2 [7], which have been shown applicable for classification and regression. It divides the whole input dimensions into several sub dimensions, each of which corresponds to an input feature. Instead of learning input features altogether as an input vector in a training instance, ITID learns input features one after another through their corresponding sub-networks, and the structure of neural networks gradually grows with an increasing input dimension as shown in Fig. 1. During training, information obtained by a new sub-network is merged together with the information obtained by the old network. Such architecture is based on ILIA1. After training, if the outputs of neural networks are collapsed with an additional network sitting on the top where links to the collapsed output units and all the input units are built to collect more information from the inputs, this results in ILIA2 as shown in Fig. 2. Finally, a pruning technique is adopted to find out the appropriate network architecture. With less internal interference among input features, ITID achieves higher generalization accuracy than conventional methods [9]. III. 87 MRMR CRITERION Minimal-redundancy-maximal-relevance criterion (mRMR) is a method for first-order incremental feature selection [10], which is a hot topic in research. In mRMR criteria, features which have both minimum redundancy for input features and maximum relevancy for output classes should be selected. Thus this method is based on two important metrics. One is mutual information between an output and each input, which is used to measure relevancy, and the other is mutual information between every two inputs, which is used to calculate redundancy between these inputs. More specifically, let S denote the subset of selected features, and Ω is the pool of all input features, the minimum redundancy can be computed by 1 min S ⊂Ω S 2 ∑I( f , f i , j ∈S i j ). (1) where I(fi, fj) is mutual information between fi and fj, and |S| is the number of input feature of S. On the other hand, mutual information I(c, fi) is usually employed to calculate discrimination ability from feature fi to class c = {c1,…,ck}. Therefore, the maximum relevancy can be calculated by max S ⊂Ω 1 S ∑ I ( c, f ) . i∈S (2) i Combining (1) with (2), mRMR feature selection criterion can be obtained as below, either in quotient form: ⎧⎪ ⎡1 max ⎨∑ I (c, f i ) / ⎢ S ⊂Ω ⎪⎩ i∈S ⎣S ∑I( f , f j ∑I( f , f j i , j ∈S i ⎤ ⎫⎪ )⎥ ⎬ ⎦ ⎪⎭ (3) ⎤ ⎫⎪ )⎥ ⎬ . ⎦ ⎪⎭ (4) or in a different form: ⎧⎪ ⎡1 max ⎨∑ I (c, f i ) − ⎢ S ⊂Ω ⎪⎩ i∈S ⎣S i , j∈S i In the solutions of mRMR, features are incrementally added into the selected feature subset. According to such a process, the sequence of incremental addition can be regarded as an order of discrimination ability of features. Thus feature ordering also can be calculated by (3) or (4), if all features have been put into the selected subset by mRMR. IV. FEATURE ORDERING BASED ON MRMR Fig. 1. The basic network structure of ITID Fig. 2. The network structure of ITID with a sub-network on the top Feature ordering is unique to data preparation work of IAL. Compared with conventional approaches where input features are trained in one batch, features will be gradually imported into pattern recognition one after another in IAL. In this process, how to derive an order for training is very important. Hence feature ordering, seldom used in conventional pattern recognition techniques, is indispensable in IAL. Moreover, the computing procedure of feature ordering is different from that of feature selection methods, where feature selection discards some features from the original feature set, while feature ordering merely puts down all features in a given order which may be different from the original sequence. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011 Due to the fact that the calculation of feature ordering in previous studies is based on wrappers, which is timeconsuming compared with filters, using filter methods is able to bring benefits for feature ordering in IAL. Moreover, apart from saving time in preprocessing, there are some other advantages in the calculation of feature ordering using filters. For example, feature reduction and feature ordering can be applied simultaneously, because the former severely relies on discriminability while the calculation of discriminability is the key in the latter. Although these two mRMR methods are different, both are applicable in the calculation of input feature ordering, and each ordering can be employed in training. Fig. 3 is a model of ordered IAL, where feature ordering is calculated by mRMR. In this model, there are two phases of ordered IAL: obtaining feature ordering and applying pattern recognition. The step of obtaining feature ordering refers to the computing of Feature Ordering Vector, which is derived from Feature Ordering Calculator. In this step, the discriminability of each feature is calculated on the basis of training dataset, and the results of discriminability are placed in a descending order, which will obtain better results than the other orders [11]. In the second phase, the formal machine learning process starts. Patterns are randomly divided into three different datasets: training, validation and testing [12]. The process in this step is based on these datasets, and feature ordering for importing features is a foundation for pattern recognition in each round. Fig. 3. The model of Ordered IAL based on mRMR V. EXPERIMENTS AND ANALYSIS The proposed ordered IAL method using mRMR and ITID were tested on four benchmarks from UCI machine learning datasets. They are Diabetes, Cancer, Thyroid and Flare. The first three are classification problems while the last one is a regression problem. In these experiments, all the patterns were randomly divided into three groups: training set (50%), validation set (25%) and testing set (25%). Especially, the training data were firstly used to rank feature ordering based on mRMR in the first place as a preprocessing task while ITID was employed for classification or regression according to this 88 feature ordering in the following step. Furthermore, to compare with previous studies which merely focus on ILIA1, all the results about ITID were based on ILIA1 as well. To evaluate the performance, two types of metrics were employed for analysis, preprocessing time and error rate of the pattern recognition process. In terms of time, previous feature ordering computation employed one feature as the only input of neural networks to classify or predict all patterns. Such processing can estimate one feature’s discriminability according to the error rate of classification or regression, and such discriminability can rank feature ordering. This processing is similar to wrappers which were widely used in feature selection. Compared with wrapper-like feature ordering computing, the calculation using mRMR is quite different where several metrics will be employed to measure the discriminability. Such processing is similar to filters in feature selection. Obviously, a filter-like approach usually takes much less time than a wrapper method does. Thus computing using mRMR for feature ordering is definitely more timesaving. In terms of error rate, both mRMR methods are applicable for feature ordering, thus two streams of experiments based on mRMR were implemented as well. The following subsections present the details of different experiments using different datasets. A. Diabetes Diabetes is a two-category classification problem which has 8 continuous input features that are used to diagnose whether a Pima Indian has diabetes or not. There are 768 patterns in this dataset, 65% of which belong to class 1 (no diabetes), 35% class 2 (diabetes). Table I shows the results in comparison with classification using ITID based on different feature orderings which were derived from mRMR (Difference), mRMR (Quotient), wrappers and the conventional method which has no feature ordering. According to the results, mRMR (Difference) obtained the lowest error rate (22.86459%) and the result of mRMR (Quotient) is as good as that of wrappers (22.96876%). All of them are better than those derived from conventional method (23.93229%) which trains patterns in one batch. Thus using mRMR can obtain a better feature ordering for IAL in Diabetes. B. Cancer Cancer is a classification problem including 9 continuous inputs, 2 outputs, and 699 patterns, which is used to diagnose breast cancer. 66% of the patterns belong to class 1 (benign) and 34% of them belong to class 2 (malign). Table II shows Cancer’s experimental results. By comparison, both mRMR methods obtained the same best classification results (2.29885%) in this test, while those of wrappers and conventional methods are 2.4999985% and 2.87356%, respectively. Hence using mRMR for feature ordering is better than using the other approaches in Cancer. INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011 C. Thyroid Thyroid diagnoses whether a patient’s thyroid has over-function, normal function, or under-function based on patient query data and patient examination data. This classification problem has 21 inputs features, 3 outputs, and 7200 patterns where class 1, 2 and 3 have 2.3%, 5.1% and 92.6% of all the patterns, respectively. Table III presents the classification results of Thyroid. Compared with the error rate of wrappers (1.838888%) and conventional method (1.8638875%), both mRMR (Difference) and mRMR (Quotient) exhibited better performance, where the result of the former is 1.619443% and that of the latter is 1.625001%. Therefore, mRMR is a better approach for feature ordering in Thyroid. D. Flare The Flare problem is a regression problem that predicts three outputs of solar flares. There are 10 inputs, 3 outputs and 1066 patterns in this dataset, where the first three inputs have 7, 6, and 4 features, respectively, and each of latter inputs merely has one feature. Thus the total number of input features in Flare is 24. Table IV exhibits the performance of regression where the input ordering is derived either from mRMR or wrappers. According to the column of testing error, mRMR is feasible for feature ordering in regression, but the results are not better than those derived in previous studies using conventional method and wrappers. The reason of such a phenomenon is that, in previous studies, the first three inputs which consist of multiple features were not trained dispersedly and individually, but trained group by group. Thus feature grouping may impact on the final results, which coincides with the idea in [13]. According to the analysis presented above, in classification problems, feature ordering derived from mRMR exhibits better performance than that obtained by the conventional method and wrappers; while in regression problems, the ordering calculated by mRMR is not the best because of the feature grouping in Flare. Therefore, as a well-known method of feature selection, mRMR is also available for feature ordering. It takes less preprocessing time to compute input feature ordering, and brings acceptable results with stable performance for simulation and application. However, because of the difference among feature ordering, the final results of different approaches are different. How to get the best feature ordering is still unknown. Nevertheless, this is a study which will be taken in the future. VI. CONCLUSION IAL is a novel approach which gradually trains input attributes in one or more sizes. Feature ordering in training is a unique preprocessing step in IAL pattern recognition. In previous studies, feature ordering of IAL was derived by wrapper methods which are more time-consuming than filter approaches like mRMR. Moreover, experimental results also demonstrated that the feature ordering obtained by mRMR can exhibit better performance than those derived by wrappers or 89 TABLE I RESULTS OF DIABETES Feature Ordering Classification Error mRMR-Difference 2-6-1-7-3-8-4-5 22.86459% mRMR-Quotient 2-6-1-7-3-8-5-4 22.96876% wrappers 2-6-1-7-3-8-5-4 22.96876% Conventional method 23.93229% TABLE II RESULTS OF CANCER Feature Ordering Classification Error mRMR-Difference 2-6-1-7-3-8-5-4-9 2.29885% mRMR-Quotient 2-6-1-7-8-3-5-4-9 2.29885% wrappers 2-3-5-8-6-7-4-1-9 2.4999985% Conventional method 2.87356% TABLE III RESULTS OF THYROID Feature Ordering Classification Error mRMR-Difference 3-7-17-10-6-8-13-16-4-5-12 -21-18-19-2-20-15-9-14-11-1 1.619443% mRMR-Quotient 3-10-16-7-6-17-2-8-13-5-1-4 -11-12-14-9-21-15-18-19-20 1.625001% wrappers 18-17-19-20-11-21-15-10-3 -8-13-7-1-2-12-16-6-5-4-14-9 1.838888% Conventional method 1.8638875% TABLE IV RESULTS OF FLARE Feature Ordering Testing Error mRMR-Difference 16-23-5-18-10-4-13-6-17-19-11-21 -7-20-12-1-24-2-3-9-22-15-8-14 0.568722% mRMR-Quotient 16-23-5-18-10-4-13-6-19-17-11-21 -7-20-14-2-9-15-12-22-8-3-1-24 0.573042% wrappers (1-2-3-4-5-6-7)-(8-9-10-11-12-13)(14-15-16-17)-18-21-23-22-20-19-24 0.5255421% Conventional method 0.55% traditional methods which train features in one batch in the process of pattern recognition. Nonetheless, there are a number of further studies needed to be done in the future. For example, although mRMR can attain better performance than the other approaches, how to obtain feature ordering with an optimal pattern recognition result is still unknown. Moreover, in spite of the predictive results which are not better than others, whether there exists any improvement of feature ordering for regression is worthy of being researched. In addition, whether feature grouping of input attributes is a factor which may influence pattern recognition also needs to be researched in the future. Generally, using mRMR to calculate feature ordering is applicable for saving time and enhancing the classification rate in pattern classification problems based on neural IAL approaches. Although the performance exhibited in regression INTERNATIONAL JOURNAL OF DESIGN, ANALYSIS AND TOOLS FOR CIRCUITS AND SYSTEMS, VOL. 2, NO. 2, AUGUST 2011 is not the best compared with results derived from some previous studies, mRMR is still applicable in feature ordering calculation for prediction with neural IAL. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] H. Liu, “Evolving feature selection,” IEEE Intelligent Systems, vol. 20, no. 6, pp. 64-76, 2005. S.H. Weiss and N. Indurkhya, Predictive data mining: a practical guide, Morgan Kaufmann Publishers, CA: San Francisco, 1998. S. Chao and F. Wong, “An incremental decision tree learning methodology regarding attributes in medical data mining,” Proc. of the 8th Int’l Conf. on Machine Learning and Cybernetics, Baoding, pp. 1694-1699, 2009. R. K. Agrawal and R. Bala, “Incremental Bayesian classification for multivariate normal distribution data,” Pattern Recognition Letters, vol. 29, no. 13, pp. 1873-1876, Oct. 2008. S.U. Guan and J. Liu, “Feature selection for modular networks based on incremental training,” Journal of Intelligent Systems, vol. 14, no. 4, pp. 353-383, 2005. F. Zhu and S.U. Guan, “Ordered incremental training for GA-based classifiers,” vol. 26, no. 14, Pattern Recognition Letters, pp. 2135-2151, Oct. 2005. S.U. Guan and S. Li, “Incremental learning with respect to new incoming input attributes,” Neural Processing Letters, vol. 14, no. 3, pp. 241-260, Dec. 2001. S.U. Guan and S. Li, “Parallel growing and training of neural networks using output parallelism,” IEEE Trans. on Neural Networks, vol. 13, no. 3, pp. 542 -550, May 2002. S.U. Guan and J. Liu, “Incremental neural network training with an increasing input dimension,” Journal of Intelligent Systems, vol. 13, no. 1, pp. 43-69, 2004. H. Peng, F. Long, C. Ding, “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, 2005. S.U. Guan and J. Liu, “Incremental Ordered Neural Network Training,” Journal of Intelligent Systems, vol. 12, no. 3, pp. 137-172, 2002. Ripley B. D., Pattern Recognition and Neural Networks, Cambridge University Press, UK:Cambridge, 1996. J.H. Ang, S.U. Guan, K.C. Tan, A.A. Mamun, “Interference-Less Neural Network Training”, Neurocomputing, vol. 71, no. 16-18, pp 3509-3524, 2008. 90 Ting WANG was born in Wuxi, China, on May 28th, 1981. He obtained his BSc degree in computer science at the China University of Mining and Technology, in 2003 and his MSc degree in computer science at the Guilin University of Technology, China, in 2008, and is now a PhD candidate in computer science, at the University of Liverpool. From July 2003 to July 2004, he was a system analyst at the Microstar International Inc. Before he started his PhD program in March 2009, he was a research and development engineer at the Jiangnan Institution of Computing Technology. He is now interested in the field of artificial intelligence and information management. Sheng-Uei Guan received his M.Sc. & Ph.D. from the University of North Carolina at Chapel Hill. He is currently a professor and head of the computer science and software engineering department at Xian Jiaotong-Liverpool University. Before joining XJTLU, he was a professor and chair in intelligent systems at Brunel University, UK. Prof. Guan has worked in a prestigious R&D organization for several years, serving as a design engineer, project leader, and manager. After leaving the industry, he joined Yuan-Ze University in Taiwan for three and half years. He served as deputy director for the Computing Center and the chairman for the Department of Information & Communication Technology. Later he joined the Electrical & Computer Engineering Department at National University of Singapore as an associate professor. Fei Liu completed her PhD from The Department of Computer Science & Computer Engineering, La Trobe University in 1998. Before joining the department as an academic staff member in 2002, she worked as a lecturer in The School of Computer & Information Science, The University of South Australia, and The School of Computer Science & Information Technology, Royal Melbourne Institute of Technology. She also worked as a software engineer in Ericsson Australia. Her research interests include Logic Programming, Semantic Web and Security in Electronic Commerce. She has been teaching in Artificial Intelligence, Programming, and Security in Electronic Commerce.