International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 9- Feb 2014 Hybrid Feature Selection Algorithm for High Dimensional Database E. Poonguzhali#1, M. Vijayalakshmi*2, K. Shamily*3, V. Priyadarshini*4 # Assistant Professor, IT & Pondicherry University * Student, IT & Pondicherry University India Abstract— Feature subset selection is mainly applied to select the most important features from the original set of features. Feature subset selection is mainly calculated from the efficiency and effectiveness point of view. Feature subset selection is important in removing irrelevant and redundant features which reduces the dimensionality of data and is used to increase the understandability of the data which helps to avoid the slow performance of the algorithm. The FCBF (Fast correlation Based Feature Selection) algorithm is implemented for the feature subset selection methods. The FAST algorithm is very much effective when compared with other algorithms. The FCBF is combined with the FAST algorithm to improve the efficiency of the FAST algorithm. This hybrid methods are more effective for both the text and image data and the hybrid method is also give more accuracy when compared to other algorithms. Keywords— Feature selection, Distributed clustering, Hybrid algorithm, Redundancy analysis, Removal of irrelevant data. I. INTRODUCTION Feature selection is the process of choosing a subset of original features by removing the Irrelevant and redundant data. Removing redundant data feature is equally critical. Feature selection is frequently used technique in high dimensional database. Feature selection increases the efficiency. The feature selection subset maybe evaluated from both the efficiency and effectiveness. Efficiency represents the time required to find the subset of feature and the Effectiveness represents the quality of subset of feature. Feature subset selection can be used for accessing the required data from the database. It increases the efficiency and also the accuracy. The main aim of Feature selection is to focus on searching for a relevant data. The irrelevant features and the redundant features affect the accuracy so the Feature selection should be able to identify those data and remove it as possible. FAST algorithm is faster than the other algorithm; Fast algorithm can be used to reduce the redundancy and the irrelevant data. It also produces the output in a fast effective manner; hence it ISSN: 2231-5381 reduces the time complexity. The efficiency of the FAST algorithm can be ensured by Minimum spanning tree (MST) which follows the clustering method. The target of Clustering is to grouping the data into a homogeneous data and heterogeneous data. It organizes the data into a group and search the required data from the group. The FAST algorithm is very effective for microarray data and text data hence it produces good accuracy but the disadvantage of FAST is that it does not produces accuracy in Image data. Due to this problem, FCBF (Fast Correlation Based Filter) can be used. FCBF is good for both the Image and Text data. FCBF runs faster when compare to the other algorithm called ReliefF and CFS which is used in the feature subset selection method. FCBF has ability to find out the redundant feature. In most of the dataset, the FCBF increases and maintain the accuracy. Nowadays, in much application the data has been increasing in both the manner (rows and columns) whereas rows indicate number of instances and column indicates number of features. This causes problem to the machine learning algorithm. For example, if a high dimensional data contains a data sets with more than thousands of feature may contain a large number of redundant and irrelevant information, which may reduces the performance of the machine learning algorithms. Hence for high dimensional data, feature selection has become necessary. Feature selection fall on two models. They are Filter model and wrapper model. The filter model does not involve on learning algorithm to select the features. But the wrapper model needs the learning algorithm to determine which feature has to be selected but the disadvantage is, the wrapper model becomes expensive when the number of feature becomes large. Among various methods, the Fast Correlation http://www.ijettjournal.org Page 490 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 9- Feb 2014 Based Filter is very effective. Feature selection is the frequently used technique in data mining hence it reduces the redundancy, irrelevant data and dimensionality. It also improves the results. II. LITERATURE REVIEW The survey for feature subset selection algorithm for high dimensional data is done and some of the examples are given as follows: Mark A. Hall [1999] proposed a Correlation-based Feature Selection algorithm. This paper suggests that the CFS (Correlation based feature subset) algorithm works well for identifying the interactive features [1]. Mitra, P., Murthy, C. A., & Pal, S. K. (2002).proposed a Unsupervised feature selection algorithm using feature similarity for large data sets which are high in both dimension and size. This paper also demonstrated how redundancy and feature selection is demonstrated with entropy measure [2]. L. Yu and H. Liu performed a redundancy and relevancy analysis using a correlation based method, hence feature relevance alone is not only sufficient for the feature selection in high dimensional data. In this paper, they introduced a new framework called redundancy analysis and relevancy analysis [7]. Z. Zhao and H. Liu proposed a selection of interacting feature. Handling this feature interaction can be very intractable. This feature interaction is developed to achieve the feature selection in an effective way [8]. L. Yu and H. Liu introduced a novel concept to reduce the redundancy as well as relevant data in a high dimensional database without pairwise correlation using Fast Correlation Based Filter method [9]. L. Yu and H. Liu The most frequently used technique in high dimensional database is Feature Selection. In this paper they have used correlation R. Butterworth, G. Piatetsky-Shapiro, and D.A. based approach within the filter model, so that the Simovici (2005) proposed Feature Selection redundancy in high dimensional data can be through Clustering. This paper describes that the removed along with the irrelevant data [10]. analyzed data will be given more importance when compared to the data in the selection process [3]. R. Kohavi and G.H. John , The problem in feature subset selection is selecting the relevant feature L.C. Molina, L. Belanche, and A. Nebot[2002] from the database. In this paper they have used an measured a survey and evaluation on Feature wrapper method to select the optimal feature from Selection Algorithms. This paper defines the the data set [11]. ranking category for algorithms based on the degree of matching between the outputs and the optimal L. Yu and H. Liu , proposed an efficient method by solution [4]. comparing the relationship between the redundancy and relevancy. This method effectively removes the F. Pereira, N. Tishby, and L. Lee [1993] proposed a redundant feature [12]. text categorization technique for Distributional III. EXISTING SYSTEM Clustering of English Words. This paper defines that this method is a good choice for applications The Fast algorithm involves removing the irrelevant which is having limited amount of labeled training data. The redundant data can be removed by two data sets [5]. steps. First step is the construction of minimal spanning tree. Second step is partitioning the L. Yu and H. Liu [2003] proposed an FCBF Minimal spanning tree and it selects the required algorithm for high dimensional data. This paper feature. It provides accuracy in micro array data and proposed the fast filter method which first identifies text data but it does not produce accuracy in Image the relevant features as well as redundancy among data. The methods used in Fast are, ReliefF, Focus the relevance features without the pair wise SF, Consist, CFS and FCBF. The ReliefF method correlation analysis [6]. leads to the problem called mismatch. That is the ISSN: 2231-5381 http://www.ijettjournal.org Page 491 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 9- Feb 2014 needed output does not appear. Focus SF does not focus on the particular topic. The problem in Consist method is the time required to complete the task will be extended and also it contain some loss. CFS (Correlation Based Feature Selection) method based on the priority so it is not effective. FCBF (Fast correlation based Filter) method is faster when comparing to the other method mentioned above. FCBF gives accuracy in Text data and Image data. This FCBF has the greatest degree of dimensionality reduction and enhance the classification accuracy with the predominant features [6]. This hybrid algorithm can maintain or even accuracy of the data. This hybrid algorithm works in two steps. First it will identify the relevant features and then it will remove the redundant features. This hybrid algorithm does the clustering process two times. In this algorithm, once it identifies the relevant features it will do the clustering process with the help of the minimum spanning tree construction. Then it will do the clustering process with the relevant data for more accuracy of the output. This Hybrid algorithm eliminates only one feature from each set of iteration which makes the elimination process more balanced. By this means, output will be more efficient for micro array data, text and image data Fig1: Removal of irrelevant data and redundant data IV. PROPOSED SYSTEM In a Feature subset selection methods there were several algorithms were implemented in the efficiency and effectiveness point of view. The FCBF and FAST algorithm is combined to improve the efficiency of all types of data. This Hybrid algorithm is proposed to improve the efficiency of text, image and microarray data. The FAST algorithms will not more accuracy for text and image data. The FCBF will give better accuracy for text and image data [2]. The FCBF is a best and efficient algorithm for text and image data. The FCBF uses the interdependence of features with the dependence of the class. The FCBF selects the best features from the subset of features by means of backward elimination technique. The FCBF mainly concentrate on the features which are dependent on the class than the correlation between the features. In FCBF backward elimination is achieved by means of graph theoretic clustering. ISSN: 2231-5381 Fig2: Grouping the relevant data for Feature Selection http://www.ijettjournal.org Page 492 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 9- Feb 2014 This Hybrid (combination of FAST and FCBF) algorithm consists of two stages: 1. The first step involved in this stage is the relevance analysis which is used to order the input variables based on the threshold value which is considered as the symmetric uncertainty with respect to the target output. This stage is used to eliminate the irrelevant variables whose relevant score is below a predefined threshold. 2. The second step involved in this stage is the redundancy analysis which is aimed at selecting the leading features from the relevant set obtained in the first stage. This selection process gets continued until it eliminates the approximate variables from the subset. By this means, it proves to be a fast best subset selection method which gives more accuracy in text and image data. like end hierarchical manner. This kind of clustering helps in visualization of data and summarization of data. B. Subset selection algorithm 1. The irrelevant features and the redundant features severely affect the accuracy of the learning algorithms. The feature subset selection is used to identify the relevant feature and remove the irrelevant information as possible. 2 .Good feature subset selection selects the best feature from the original set which are highly correlated with the class label yet correlated with the other features. 3. We propose a novel concept for predominant correlation which is used for analyze the redundant data with the Fast correlation based filter approach and then propose a FAST algorithm with less MODULES quadratic time complexity. A. Distributed clustering 4. The filter model separates features from classifier 1. The distributed clustering is used to cluster learning to select the best feature that is words into groups based on the grammatical independent of any learning algorithms. The main relations with the other words by Pereira et al. or aim of feature selection is to reduce the no of based on the distribution of the class labels by features and to increase the computation prediction. Baker or McCallum. The distributed clustering is C. Time complexity used to boost the quality of results by combining the set of clusters obtained from the partial views of The algorithm has the linear time complexity in data. terms of no of features. The FCBF can remove the 2. Each and every cluster has access to all the no of redundant features in the current iteration. objects. Here the data will get clustered according The major amount of work for algorithm involves to the class. Data in the clusters will not get computation of relevance analysis and correlation clustered depends on the nearest data instead it get analysis. The best case would be that all of the clustered as per the class. remaining features in the ranked list will get 3. Partitioning here used is a hierarchical removed. On average case, half of the remaining clustering. Generally distributing clustering of features will be removed in each iteration. The words are agglomerative in nature it results in relevant features are selected in the first part when k suboptimal word cluster and it require high ¼ only one feature is selected. computational cost so a new informative theoretic clustering algorithm was proposed for word D. Data Resource clustering and applied it to a text classification which is proposed to cluster features based on The proportion of the selected features can be special metric called distance which is used to removed by each of selection algorithm compared cluster features based on hierarchy of features to with that of the data sets. This indicates that the six algorithms works well for the microarray data and choose the best features. 4. Here K means and K medoids are used for the good for text and the image data. The purpose of partitioning algorithms. The hierarchical clustering evaluating the performance and effectiveness of our partition the data into different levels which looks proposed Hybrid algorithm verifying that the ISSN: 2231-5381 http://www.ijettjournal.org Page 493 International Journal of Engineering Trends and Technology (IJETT) – Volume 8 Number 9- Feb 2014 3. The representative features can be selected method is potentially useful in practice which allow the researches to conform the results 35 publicly available data sets were used.35 data sets features may vary from 37 to 49, 52 with a mean of 7,874. The dimensionalities of 54.3 data sets exceed 5,000 of which 28.6 percent data sets have more than 10,000 features. The 35 data sets cover the range of application such as text, image and the micro array data classification with the continuous valued features. from the subset. CONCLUSION In this paper we propose a novel concept of correlation technique which introduces an efficient way of analyzing feature redundancy and design a hybrid approach. The new hybrid algorithm is implemented and evaluated through extensive experiments which compared with the other feature selection algorithms. Our approach demonstrates its E. Microarray data efficiency and effectiveness in dealing with high dimensional data for classification. Our future work The proportion of the six algorithms is improved will extend the work on higher dimensionality like by six feature selection algorithms compared with thousands of features. Hybrid algorithm will be that of the given data sets. This shows that the sic more efficient if it runs on the multiprocessor algorithms will work well for the microarray data. systems. For microarray data the hybrid algorithm ranks 1. REFERENCES For image data and microarray data FCBF ranks 1. [1] Hall, M. (1999). Correlation based feature selection for machine For image data, the classification accuracy of Navie learning. Doctoral dissertation, University of Waikato, Dept. of Computer Science. bayes can be improved by FAST and FCBF. This [2] Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature indicates that the hybrid algorithm outperforms all selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 301–312. the other algorithms. Of the six algorithms CFS [3] R. Butterworth, G. Piatetsky-Shapiro, and D.A. Simovici, “On Feature cannot choose two data sets which have more Selection through Clustering,” Proc. IEEE Fifth Int’l Conf. Data Mining, pp. 581-584, 2005. dimensionalities. [4] F. Irrelevant feature [5] Irrelevant feature reduces the accuracy hence it should be removed along with the redundant data. The irrelevant feature removal is a simple and straightforward technique once the relevant feature is selected or identified. But the removal of irrelevant feature is a difficult task. In our proposed Hybrid algorithm, it involves three steps. 1. Minimum spanning construction from the weighted complete graph. [6] [7] [8] [9] [10] [11] 2. The partitioning of minimum spanning tree into a forest with each tree representing the cluster. ISSN: 2231-5381 [12] L.C. Molina, L. Belanche, and A. Nebot, “Feature Selection Algorithms: A Survey and Experimental Evaluation,” Proc. IEEE Int’l Conf. Data Mining, pp. 306- 313, 2002. F. Pereira, N. Tishby, and L. Lee, “Distributional Clustering of English Words,” Proc. 31st Ann. Meeting on Assoc. for Computational Linguistics, pp. 183-190, 1993. L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” Proc. 20th Int’l Conf. Machine Leaning, vol. 20, no. 2, pp. 856-863, 2003. L. Yu and H. Liu, “Efficient Feature Selection via Analysis ofRelevance and Redundancy,” J. Machine Learning Research, vol. 10,no. 5, pp. 1205-1224, 2004. Z. Zhao and H. Liu, “Searching for Interacting Features,” Proc.20th Int’l Joint Conf. Artificial Intelligence, 2007. L. Yu and H. Liu, “Feature Selection for High-Dimensional Data:A Fast Correlation-Based Filter Solution,” Proc. 20th Int’l Conf.Machine Leaning, vol. 20, no. 2, pp. 856-863, 2003. L. Yu and H. Liu, “Efficiently Handling Feature Redundancy in HighDimensional Data,” Proc. Ninth ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD ’03), pp. 685-690, 2003. R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997. L. Yu and H. Liu, “Redundancy Based Feature Selection for Microarray Data,” Proc. 10th ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining, pp. 737-742, 2004. http://www.ijettjournal.org Page 494