Hybrid Feature Selection Algorithm for High Dimensional Database —

Hybrid Feature Selection Algorithm for High
Dimensional Database
E. Poonguzhali#1, M. Vijayalakshmi*2, K. Shamily*3, V. Priyadarshini*4
Assistant Professor, IT & Pondicherry University
Student, IT & Pondicherry University
Abstract— Feature subset selection is mainly applied to select
the most important features from the original set of features.
Feature subset selection is mainly calculated from the efficiency
and effectiveness point of view. Feature subset selection is
important in removing irrelevant and redundant features which
reduces the dimensionality of data and is used to increase the
understandability of the data which helps to avoid the slow
performance of the algorithm. The FCBF (Fast correlation Based
Feature Selection) algorithm is implemented for the feature
subset selection methods. The FAST algorithm is very much
effective when compared with other algorithms. The FCBF is
combined with the FAST algorithm to improve the efficiency of
the FAST algorithm. This hybrid methods are more effective for
both the text and image data and the hybrid method is also give
more accuracy when compared to other algorithms.
Keywords— Feature selection, Distributed clustering, Hybrid
algorithm, Redundancy analysis, Removal of irrelevant data.
Feature selection is the process of choosing a subset
of original features by removing the Irrelevant and
redundant data. Removing redundant data feature is
equally critical. Feature selection is frequently used
technique in high dimensional database. Feature
selection increases the efficiency. The feature
selection subset maybe evaluated from both the
efficiency and effectiveness. Efficiency represents
the time required to find the subset of feature and
the Effectiveness represents the quality of subset of
feature. Feature subset selection can be used for
accessing the required data from the database. It
increases the efficiency and also the accuracy. The
main aim of Feature selection is to focus on
searching for a relevant data. The irrelevant features
and the redundant features affect the accuracy so
the Feature selection should be able to identify
those data and remove it as possible.
FAST algorithm is faster than the other algorithm;
Fast algorithm can be used to reduce the
redundancy and the irrelevant data. It also produces
the output in a fast effective manner; hence it
reduces the time complexity. The efficiency of the
FAST algorithm can be ensured by Minimum
spanning tree (MST) which follows the clustering
method. The target of Clustering is to grouping the
data into a homogeneous data and heterogeneous
data. It organizes the data into a group and search
the required data from the group. The FAST
algorithm is very effective for microarray data and
text data hence it produces good accuracy but the
disadvantage of FAST is that it does not produces
accuracy in Image data. Due to this problem, FCBF
(Fast Correlation Based Filter) can be used.
FCBF is good for both the Image and Text data.
FCBF runs faster when compare to the other
algorithm called ReliefF and CFS which is used in
the feature subset selection method. FCBF has
ability to find out the redundant feature. In most of
the dataset, the FCBF increases and maintain the
accuracy. Nowadays, in much application the data
has been increasing in both the manner (rows and
columns) whereas rows indicate number of
instances and column indicates number of features.
This causes problem to the machine learning
algorithm. For example, if a high dimensional data
contains a data sets with more than thousands of
feature may contain a large number of redundant
and irrelevant information, which may reduces the
performance of the machine learning algorithms.
Hence for high dimensional data, feature selection
has become necessary. Feature selection fall on two
models. They are Filter model and wrapper model.
The filter model does not involve on learning
algorithm to select the features. But the wrapper
model needs the learning algorithm to determine
which feature has to be selected but the
disadvantage is, the wrapper model becomes
expensive when the number of feature becomes
large. Among various methods, the Fast Correlation
Based Filter is very effective. Feature selection is
the frequently used technique in data mining hence
it reduces the redundancy, irrelevant data and
dimensionality. It also improves the results.
The survey for feature subset selection algorithm
for high dimensional data is done and some of the
examples are given as follows: Mark A. Hall [1999]
proposed a Correlation-based Feature Selection
algorithm. This paper suggests that the CFS
(Correlation based feature subset) algorithm works
well for identifying the interactive features [1].
Mitra, P., Murthy, C. A., & Pal, S. K.
(2002).proposed a Unsupervised feature selection
algorithm using feature similarity for large data sets
which are high in both dimension and size.
This paper also demonstrated how redundancy and
feature selection is demonstrated with entropy
measure [2].
L. Yu and H. Liu performed a redundancy and
relevancy analysis using a correlation based method,
hence feature relevance alone is not only sufficient
for the feature selection in high dimensional data. In
this paper, they introduced a new framework called
redundancy analysis and relevancy analysis [7].
Z. Zhao and H. Liu proposed a selection of
interacting feature. Handling this feature interaction
can be very intractable. This feature interaction is
developed to achieve the feature selection in an
effective way [8].
L. Yu and H. Liu introduced a novel concept to
reduce the redundancy as well as relevant data in a
high dimensional database without pairwise
correlation using Fast Correlation Based Filter
method [9].
L. Yu and H. Liu The most frequently used
technique in high dimensional database is Feature
Selection. In this paper they have used correlation
R. Butterworth, G. Piatetsky-Shapiro, and D.A. based approach within the filter model, so that the
Simovici (2005) proposed Feature Selection redundancy in high dimensional data can be
through Clustering. This paper describes that the removed along with the irrelevant data [10].
analyzed data will be given more importance when
compared to the data in the selection process [3].
R. Kohavi and G.H. John , The problem in feature
subset selection is selecting the relevant feature
L.C. Molina, L. Belanche, and A. Nebot[2002] from the database. In this paper they have used an
measured a survey and evaluation on Feature wrapper method to select the optimal feature from
Selection Algorithms. This paper defines the the data set [11].
ranking category for algorithms based on the degree
of matching between the outputs and the optimal L. Yu and H. Liu , proposed an efficient method by
solution [4].
comparing the relationship between the redundancy
and relevancy. This method effectively removes the
F. Pereira, N. Tishby, and L. Lee [1993] proposed a redundant feature [12].
text categorization technique for Distributional
Clustering of English Words. This paper defines
that this method is a good choice for applications The Fast algorithm involves removing the irrelevant
which is having limited amount of labeled training data. The redundant data can be removed by two
data sets [5].
steps. First step is the construction of minimal
spanning tree. Second step is partitioning the
L. Yu and H. Liu [2003] proposed an FCBF Minimal spanning tree and it selects the required
algorithm for high dimensional data. This paper feature. It provides accuracy in micro array data and
proposed the fast filter method which first identifies text data but it does not produce accuracy in Image
the relevant features as well as redundancy among data. The methods used in Fast are, ReliefF, Focus
the relevance features without the pair wise SF, Consist, CFS and FCBF. The ReliefF method
correlation analysis [6].
leads to the problem called mismatch. That is the
needed output does not appear. Focus SF does not
focus on the particular topic. The problem in
Consist method is the time required to complete the
task will be extended and also it contain some loss.
CFS (Correlation Based Feature Selection) method
based on the priority so it is not effective. FCBF
(Fast correlation based Filter) method is faster when
comparing to the other method mentioned above.
FCBF gives accuracy in Text data and Image data.
This FCBF has the greatest degree of
dimensionality reduction and enhance the
classification accuracy with the predominant
features [6]. This hybrid algorithm can maintain or
even accuracy of the data.
This hybrid algorithm works in two steps. First it
will identify the relevant features and then it will
remove the redundant features. This hybrid
algorithm does the clustering process two times. In
this algorithm, once it identifies the relevant
features it will do the clustering process with the
help of the minimum spanning tree construction.
Then it will do the clustering process with the
relevant data for more accuracy of the output. This
Hybrid algorithm eliminates only one feature from
each set of iteration which makes the elimination
process more balanced. By this means, output will
be more efficient for micro array data, text and
image data
Fig1: Removal of irrelevant data and redundant data
In a Feature subset selection methods there were
several algorithms were implemented in the
efficiency and effectiveness point of view. The
FCBF and FAST algorithm is combined to improve
the efficiency of all types of data.
This Hybrid algorithm is proposed to improve the
efficiency of text, image and microarray data. The
FAST algorithms will not more accuracy for text
and image data. The FCBF will give better accuracy
for text and image data [2].
The FCBF is a best and efficient algorithm for text
and image data. The FCBF uses the
interdependence of features with the dependence of
the class. The FCBF selects the best features from
the subset of features by means of backward
elimination technique. The FCBF mainly
concentrate on the features which are dependent on
the class than the correlation between the features.
In FCBF backward elimination is achieved by
means of graph theoretic clustering.
Fig2: Grouping the relevant data for Feature Selection
This Hybrid (combination of FAST and FCBF)
algorithm consists of two stages:
1. The first step involved in this stage is the
relevance analysis which is used to order the input
variables based on the threshold value which is
considered as the symmetric uncertainty with
respect to the target output. This stage is used to
eliminate the irrelevant variables whose relevant
score is below a predefined threshold.
2. The second step involved in this stage is the
redundancy analysis which is aimed at selecting the
leading features from the relevant set obtained in
the first stage. This selection process gets continued
until it eliminates the approximate variables from
the subset. By this means, it proves to be a fast best
subset selection method which gives more accuracy
in text and image data.
like end hierarchical manner. This kind of
clustering helps in visualization of data and
summarization of data.
B. Subset selection algorithm
1. The irrelevant features and the redundant features
severely affect the accuracy of the learning
algorithms. The feature subset selection is used to
identify the relevant feature and remove the
irrelevant information as possible.
2 .Good feature subset selection selects the best
feature from the original set which are highly
correlated with the class label yet correlated with
the other features.
3. We propose a novel concept for predominant
correlation which is used for analyze the redundant
data with the Fast correlation based filter approach
and then propose a FAST algorithm with less
quadratic time complexity.
A. Distributed clustering
4. The filter model separates features from classifier
1. The distributed clustering is used to cluster learning to select the best feature that is
words into groups based on the grammatical independent of any learning algorithms. The main
relations with the other words by Pereira et al. or aim of feature selection is to reduce the no of
based on the distribution of the class labels by features and to increase the computation prediction.
Baker or McCallum. The distributed clustering is
C. Time complexity
used to boost the quality of results by combining
the set of clusters obtained from the partial views of The algorithm has the linear time complexity in
terms of no of features. The FCBF can remove the
2. Each and every cluster has access to all the no of redundant features in the current iteration.
objects. Here the data will get clustered according The major amount of work for algorithm involves
to the class. Data in the clusters will not get computation of relevance analysis and correlation
clustered depends on the nearest data instead it get analysis. The best case would be that all of the
clustered as per the class.
remaining features in the ranked list will get
3. Partitioning here used is a hierarchical removed. On average case, half of the remaining
clustering. Generally distributing clustering of features will be removed in each iteration. The
words are agglomerative in nature it results in relevant features are selected in the first part when k
suboptimal word cluster and it require high ¼ only one feature is selected.
computational cost so a new informative theoretic
clustering algorithm was proposed for word D. Data Resource
clustering and applied it to a text classification
which is proposed to cluster features based on The proportion of the selected features can be
special metric called distance which is used to removed by each of selection algorithm compared
cluster features based on hierarchy of features to with that of the data sets. This indicates that the six
algorithms works well for the microarray data and
choose the best features.
4. Here K means and K medoids are used for the good for text and the image data. The purpose of
partitioning algorithms. The hierarchical clustering evaluating the performance and effectiveness of our
partition the data into different levels which looks proposed Hybrid algorithm verifying that the
3. The representative features can be selected
method is potentially useful in practice which allow
the researches to conform the results 35 publicly
available data sets were used.35 data sets features
may vary from 37 to 49, 52 with a mean of 7,874.
The dimensionalities of 54.3 data sets exceed 5,000
of which 28.6 percent data sets have more than
10,000 features. The 35 data sets cover the range of
application such as text, image and the micro array
data classification with the continuous valued
from the subset.
In this paper we propose a novel concept of
correlation technique which introduces an efficient
way of analyzing feature redundancy and design a
hybrid approach. The new hybrid algorithm is
implemented and evaluated through extensive
experiments which compared with the other feature
selection algorithms. Our approach demonstrates its
E. Microarray data
efficiency and effectiveness in dealing with high
dimensional data for classification. Our future work
The proportion of the six algorithms is improved will extend the work on higher dimensionality like
by six feature selection algorithms compared with thousands of features. Hybrid algorithm will be
that of the given data sets. This shows that the sic more efficient if it runs on the multiprocessor
algorithms will work well for the microarray data. systems.
For microarray data the hybrid algorithm ranks 1.
For image data and microarray data FCBF ranks 1.
