IDEAL

advertisement
Practical evaluation of feature selection methods
Vladislav Dolganov, Ivan Smetannikov, Fedor Tsarev
Abstract. Feature selection is an important step in data preprocessing for machine learning. In this paper we report on our study on practical evaluation of
feature selection methods. We use various methods to rank features. There are
several methods based on correlation between features and target vector. We also describe a new feature selection algorithm, which uses feature ranking from
different methods. We compare results of each method independently using
AUC (the area under the receiver operating characteristic curve) on the user
classification problem. We use a logistic regression classifier to calculate true
positive and false positive rates for AUC calculation. Results show that our algorithm can provide a feature set with the best AUC metric.
Keywords: Machine learning, feature selection, classification, internet databases.
1
Introduction
Nowadays, machine learning is widely used in science and industry. One of the
most common problems of machine learning is the classification problem. There are
many applications where it is necessary to classify large amounts of data, for example, targeted advertising, social networks, tumor recognition. In such problems, the
number of features used for classification can reach hundreds of thousands and sometimes even millions. Thus, the problem of decreasing the number of features for faster
processing arises. Moreover, discarding useless features can improve accuracy of the
classifier in some problems. Two approaches for reducing the dimensionality of the
feature space are known: feature selection and feature extraction [1]. In the first approach some features are discarded, thus the resulting set is a subset of the original
feature set. In the second approach new features are created from the old ones. In this
paper we consider only feature selection methods.
There are a number of commonly used feature selection methods. They can be divided into three types of techniques with their own advantages and disadvantages [2].
Filtering techniques are based on filtering features according to some criterion. In
these methods some feature metrics are developed and used for feature evaluation.
Then a subset of features with best metric values is selected and used as input for a
classification algorithm. In the next approach, namely wrapper technique, several
feature subsets are generated, and then their effectiveness is evaluated with a classification algorithm. This approach trains a new model for each subset so it is computationally expensive. The third type is embedded techniques which use particular qualities of selected classification algorithms.
Previous studies showed that embedded and filtering methods are rather fast in
comparison with wrapper methods [3]. In this paper we use only embedded and filtering techniques due to the large number of features in training and test sets.
The problem used for methods evaluation was provided by Mail.Ru Group Company [4] as a real industrial problem. The task was formulated as follows: determine
how much the users were watching TV using the list of websites visited by them.
Initial data contained 30000 users with information about how much they watch TV
and what websites they had visited in the previous 3 months. There were about
170000 different websites. For each website a binary feature indicates whether the
user visited this site or not. Before processing all data have been anonymized – websites and users have been replaced with numeric identifiers.
The paper is structured as follows. In Section 2, some feature evaluation methods
are described. These methods are required for filtering methods and have a huge impact on the classifier efficiency. Our feature selection method is described in Section
3. In Section 4, experimental results are provided and Section 5 concludes.
2
Feature evaluation algorithms
In this section, we briefly describe the implemented methods. We choose commonly used feature ranking methods. Each method we are using actually computes a score
for each feature and then ranks them based on these scores.
2.1
Symmetrical uncertainty
One approach to solving the feature selection problem is to consider features as
random variables. For each feature 𝑋 it is possible to calculate its entropy as
H(𝑋) = − ∑ p(π‘₯𝑖 ) log 2 p(π‘₯𝑖 ) ,
(1)
𝑖
where p(π‘₯𝑖 ) is the prior probability for value π‘₯𝑖 . For each two features 𝑋 and π‘Œ the
conditional entropy of 𝑋 on π‘Œ is defined as
H(𝑋|π‘Œ) = − ∑ p (𝑦𝑗 ) ∑ p(π‘₯𝑖 |𝑦𝑗 )log 2 p(π‘₯𝑖 |𝑦𝑗 ) ,
𝑗
(2)
𝑖
where p(π‘₯𝑖 |𝑦𝑗 ) is the posterior probability of π‘₯𝑖 given the value 𝑦𝑗 . It is also possible
to calculate the information gain:
IG(𝑋|π‘Œ) = H(𝑋) − H(𝑋|π‘Œ).
(3)
Information gain is a symmetrical measure but it needs to be normalized to be
comparable across different features. Symmetrical uncertainty [5] is a normalized
information gain measure:
SU(𝑋, π‘Œ) =
2IG(𝑋|π‘Œ)
.
H(𝑋) + H(π‘Œ)
(4)
It normalizes information gain to the range [0, 1]. If SU(𝑋, π‘Œ) equals zero then 𝑋
and π‘Œ are independent. If SU(𝑋, π‘Œ) equals one then 𝑋 and π‘Œ are fully correlated.
2.2
Spearman rank correlation coefficient
The Spearman correlation coefficient [6] is defined as the Pearson correlation coefficient between ranked variables:
𝜌=
∑𝑖𝑗 (π‘₯𝑖𝑗 − π‘₯̅𝑗 )(𝑦𝑖 − 𝑦̅)
√∑𝑖𝑗 (π‘₯𝑖𝑗 − π‘₯̅𝑗 )2 ∑𝑖 (𝑦𝑖 − 𝑦̅)2
,
(5)
where i is the current instance index, j is the current feature index, 𝑦̅ is the mean of
the target vector, π‘₯̅𝑗 is the mean of the j-th feature. The value of ρ belongs to [-1;1].
The measures of most correlated features are close to zero.
2.3
Value difference metric
In this section Value Difference Metric (VDM) [5] for binary target vector is described. Then VDM function could be defined as
VDM(𝑋1 , 𝑋2 ) =
1
∫|p(𝑋1 = π‘₯) − p(𝑋2 = π‘₯)| 𝑑π‘₯.
2
(6)
VDM relevance between feature and class is defined as
VDM(𝑋, π‘Œ) =
1
∫|p(𝑋 = π‘₯|𝑐1 ) − p(𝑋 = π‘₯|𝑐2 )| 𝑑π‘₯,
2
(7)
where 𝑐1 and 𝑐2 are possible class values. In the discrete case the integral is replaced
by a sum, so equation (7) can be rewritten in the following way:
VDM(𝑋, π‘Œ) =
1
∑|p(𝑋 = π‘₯𝑖 |𝑐1 ) − p(𝑋 = π‘₯𝑖 |𝑐2 )| ,
2
(8)
𝑖
where i is the current instance index. Resulting relevance lies in the range [0,1]. Features closest to zero have the lowest relevance to vector of classes, and features closest to one have the highest relevance to it.
2.4
Fit criterion
Fit criterion [5] is a measure similar to the z-score used in statistics. A decision for
whether a point x belongs to distribution 𝑋1 or distribution 𝑋2 is defined as
|π‘₯ − Μ…Μ…Μ…
𝑋1 | |π‘₯ − Μ…Μ…Μ…
𝑋2 |
<
var(𝑋1 ) var(𝑋2 )
FCP(π‘₯, 𝑋1 , 𝑋2 ) =
.
|π‘₯ − Μ…Μ…Μ…
𝑋1 | |π‘₯ − Μ…Μ…Μ…
𝑋2 |
2, 𝑖𝑓
>
var(𝑋1 ) var(𝑋2 )
{
1, 𝑖𝑓
(9)
For k distributions and a feature, this formula can be generalized to
|π‘₯ − 𝑋̅𝑖 |
.
𝑖=1,…,π‘˜ var(𝑋𝑖 )
FCP(π‘₯, 𝑋) = arg min
(10)
The resulting formula for features fit criterion weight is the sum with an indicator
that returns one or zero depending on correctness of the FCP prediction:
𝑁
1
FC(π‘‹π‘˜ , π‘Œ) = ∑ 1FCP(π‘₯ 𝑖 ,[𝑋 |𝑦𝑖=𝑐 ],[𝑋 |𝑦𝑖 =𝑐 ])=𝑦𝑖 .
1
2
π‘˜
π‘˜
π‘˜
𝑛
(11)
𝑖=1
2.5
π›˜πŸ statistic
The χ2 statistic [7] evaluates the lack of independence between feature t and class
c. The feature is evaluated by the following formula:
χ2 (𝑑, 𝑐) =
𝑁(𝐴𝐷 − 𝐢𝐡)2
,
(𝐴 + 𝐢)(𝐡 + 𝐷)(𝐴 + 𝐡)(𝐢 + 𝐷)
(12)
where A is the number of times t and c co-occur, B is the number of times t occurs
without c, C is the number of times c occurs without t, D is the number of times neither c nor t occur and N is the total number of instances. The measures of the most
correlated features are close to zero.
3
Feature set generation
In this section we describe the proposed methods for choosing the final feature set
using the feature score provided by feature evaluation algorithms.
The first approach, which we call raw sorted sets, sorts features using the score
from one of the feature evaluation methods, and takes the top N features. These N
features are defined as the new feature set.
The main idea of our second approach is inspired by the fact that a classifiers assembly works better than each of the classifiers separately. That is why we decided to
combine several ranking measures for better effectiveness. In this approach, which we
call mixture methods, uses several feature evaluating measures at once. On the first
step, the top feature from each ranked feature list is selected. On the second step,
these features are added into the resulting set if they are not already in it. On the third
step, current top features are removed from each ranked feature list. All three steps
are repeated until N features are selected for a new feature set. In this paper we used
combinations of two and three ranking methods.
For example we need to select 6 features from 10 and two methods returned the
next orders: “B A C D E F G H I J” and “A I D C G B H E F J”. At the first iteration algorithm selects features B and A as top features for both methods and puts them
into the resulting set “A B”. At the second iteration method selects features A and I as
current top features in both methods, but puts only I into the resulting set “A B I” as
A is already in it. At the third iteration algorithm puts both features C and D into the
resulting set “A B I C D”. At the fourth iteration algorithm selects features D and C
but none of them are included into the resulting set as both of them are already there.
At the fifth iteration algorithm selects features E and G but puts only one of them into
the resulting set randomly as we need only 6 features. Resulting set is “A B I C D E”
or “A B I C D G”.
4
Results
We divided the data set into the training set and the test sets ten times. Each time
60% of the data was chosen randomly for the training set and remaining 40% of the
data was chosen for the test set. Thus, we have prepared ten data sets for experiments.
Then, features in each experiment were evaluated on the training set by all methods
described in Section 2. Resulting metrics were used for feature sets generation as
shown in Section 3. After that, we used one of the most popular metrics, AUC (the
area under the ROC curve), for evaluating the obtained sets. The ROC (receiver operating characteristic) curve is the curve created by plotting the true positive rate vs. the
false positive rate. We used a logistic regression classifier trained on the training set
for calculating the true positive rate and the false positive rate in each experiment.
Then we evaluated the AUC score for each method in each experiment. After that,
we used the paired Wilcoxon Signed-Rank test to compare two different methods. If
the p-value obtained from the test was less than 0.05 then we considered that two
methods were different and the method with the greatest AUC score was the best of
them. The mean values of AUC metrics for all methods are shown in Table 1.
The first column contains used method combinations. The second column contains
the AUC measure value for a given method. Lines in Table 1 are sorted by the AUC
metric. Combination of the Fit Criterion and the Symmetrical Uncertainty methods
gave the best results. The classifier learned on all features worked about 47 seconds
on the test set while the classifier learned on the selected features worked about 10
seconds. The table with p-values is presented in the supplement materials [8] available online.
Table 1. The AUC metrics for used feature selection methods. FC stands for Fit Criterion, SU
for Symmetrical Uncertainty, CS for Chi Squared, VDM for Value Difference Metric and Sp
for Spearman
Method
AUC
Method
AUC
FC + SU
0.71866
Sp + VDM
0.71081
FC + Sp + SU
0.71828
CS + Sp + SU
0.71071
FC + Sp
0.71815
CS + Sp
0.71042
SU
0.71757
CS + SU
0.71039
Sp + SU
0.71692
CS + FC
0.71029
Sp
0.71655
CS + SU + VDM
0.71028
Sp + SU + VDM
0.71232
CS + Sp + VDM
0.71022
Base feature set
0.71229
CS
0.70960
FC + SU + VDM
0.71221
CS + VDM
0.70956
SU + VDM
0.71209
CS + FC + VDM
0.70950
CS + FC + SU
0.71154
FC + VDM
0.66733
FC + Sp + VDM
0.71143
VDM
0.66652
CS + FC + Sp
0.71141
FC
0.66124
ROC curves for different feature sets are presented in Fig. 1. The higher classifier
ROC curve goes from random ROC curve, the better its feature set is.
Fig. 1. ROC curve for the top filtered feature set and base feature set
The best method of raw sorted methods was the filtering method based on Symmetrical uncertainty evaluation with 0.71757 AUC score. The best method among the
mixed methods was a filtering method based on mixture of Fit criterion and Symmetrical uncertainty evaluation methods with 0.71866 AUC score. ROC curve for the top
feature set obtained by a mixture of Fit criterion and Symmetrical uncertainty evaluation methods and ROC curve for the set of all features are shown in Fig. 1.
As shown in the p-value table [8], five methods with the greatest AUC scores have
p-values less than 0.05 when tested against other methods. That is why we could suppose that these five methods are better than the others for the problem described in
this paper. Four of them are mixed methods. Thus, mixture methods usually work
better than raw methods.
5
Conclusion
In this paper a feature evaluation method was presented. This method was tested on
a real-world data problem of feature number decrease. Most of used methods performed similarly. For example, top ten features of resulting sets built by Chi squared,
Fit criterion and Symmetrical uncertainty methods were almost the same. As demonstrated by the results, selected sets have about the same value of the AUC metric in
comparison with the full feature set.
A slight increase in effectiveness can be seen when some mixture methods are
used. Furthermore, the classifier works a lot faster after using these combined filtering
methods. As shown in Table 1, the best methods combination for our task was Fit
Criterion and Symmetric Uncertainty combination with the highest AUC score. Worst
filtering metrics for our problem were VDM and Fit Criterion, although in paper [6] it
was shown that Fit Criterion and VDM were the best methods for that problem. Thus,
our assumption about method combinations is justified. However, presented methods
ignore the relationships between features which could lead to less classifier effectiveness. Therefore, for each problem different methods should be tried for better effectiveness.
6
Acknowledgments
This work was partially financially supported by the Government of Russian Federation, Grant 074-U01. Authors would like to thank Mail.Ru Group Company.
References
1. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, ‘An introduction to variable and feature
selection’, Journal of Machine Learning Research, vol. 3, 1157–1182 (2003).
2. Yvan Saeys, Iñaki Inza, Pedro Larrañaga, ‘A review of feature selection techniques in bioinformatics’, Journal Bioinformatics, vol. 23, Issue 19, Oxford University Press Oxford,
UK, 2507-2517 (2007).
3. L Yu, H Liu, ‘Feature selection for high-dimensional data: A fast correlation-based filter
solution’, ICML 3, 856-863 (2003).
4. Mail.Ru Group Company, http://corp.mail.ru/en/
5. Benjamin Auffarth, Maite Lopez, Jesus Cerquides. ‘Comparison of Redundancy and Relevance Measures for Features Selection in Tissue Classification of CT images’, Advances
in Data Mining. Applications and Theoretical Aspects, Lecture Notes in Computer Science, vol. 6171, 248-262 (2010).
6. Spearman C., ‘The proof and measurement of association between two things’, The American Journal of Psychology, vol. 15, No. 1, 72–101 (1904).
7. Yiming Yang, Jan O. Pedersen, ‘A Comparative Study on Feature Selection in Text Categorization’, ICML '97 Proceedings of the Fourteenth International Conference on Machine
Learning, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 412-420 (1997).
8. Supplement materials, p-values http://genome.ifmo.ru/files/papers_files/IDEAL2014/pvalues.csv
Download