Practical evaluation of feature selection methods Vladislav Dolganov, Ivan Smetannikov, Fedor Tsarev Abstract. Feature selection is an important step in data preprocessing for machine learning. In this paper we report on our study on practical evaluation of feature selection methods. We use various methods to rank features. There are several methods based on correlation between features and target vector. We also describe a new feature selection algorithm, which uses feature ranking from different methods. We compare results of each method independently using AUC (the area under the receiver operating characteristic curve) on the user classification problem. We use a logistic regression classifier to calculate true positive and false positive rates for AUC calculation. Results show that our algorithm can provide a feature set with the best AUC metric. Keywords: Machine learning, feature selection, classification, internet databases. 1 Introduction Nowadays, machine learning is widely used in science and industry. One of the most common problems of machine learning is the classification problem. There are many applications where it is necessary to classify large amounts of data, for example, targeted advertising, social networks, tumor recognition. In such problems, the number of features used for classification can reach hundreds of thousands and sometimes even millions. Thus, the problem of decreasing the number of features for faster processing arises. Moreover, discarding useless features can improve accuracy of the classifier in some problems. Two approaches for reducing the dimensionality of the feature space are known: feature selection and feature extraction [1]. In the first approach some features are discarded, thus the resulting set is a subset of the original feature set. In the second approach new features are created from the old ones. In this paper we consider only feature selection methods. There are a number of commonly used feature selection methods. They can be divided into three types of techniques with their own advantages and disadvantages [2]. Filtering techniques are based on filtering features according to some criterion. In these methods some feature metrics are developed and used for feature evaluation. Then a subset of features with best metric values is selected and used as input for a classification algorithm. In the next approach, namely wrapper technique, several feature subsets are generated, and then their effectiveness is evaluated with a classification algorithm. This approach trains a new model for each subset so it is computationally expensive. The third type is embedded techniques which use particular qualities of selected classification algorithms. Previous studies showed that embedded and filtering methods are rather fast in comparison with wrapper methods [3]. In this paper we use only embedded and filtering techniques due to the large number of features in training and test sets. The problem used for methods evaluation was provided by Mail.Ru Group Company [4] as a real industrial problem. The task was formulated as follows: determine how much the users were watching TV using the list of websites visited by them. Initial data contained 30000 users with information about how much they watch TV and what websites they had visited in the previous 3 months. There were about 170000 different websites. For each website a binary feature indicates whether the user visited this site or not. Before processing all data have been anonymized – websites and users have been replaced with numeric identifiers. The paper is structured as follows. In Section 2, some feature evaluation methods are described. These methods are required for filtering methods and have a huge impact on the classifier efficiency. Our feature selection method is described in Section 3. In Section 4, experimental results are provided and Section 5 concludes. 2 Feature evaluation algorithms In this section, we briefly describe the implemented methods. We choose commonly used feature ranking methods. Each method we are using actually computes a score for each feature and then ranks them based on these scores. 2.1 Symmetrical uncertainty One approach to solving the feature selection problem is to consider features as random variables. For each feature π it is possible to calculate its entropy as H(π) = − ∑ p(π₯π ) log 2 p(π₯π ) , (1) π where p(π₯π ) is the prior probability for value π₯π . For each two features π and π the conditional entropy of π on π is defined as H(π|π) = − ∑ p (π¦π ) ∑ p(π₯π |π¦π )log 2 p(π₯π |π¦π ) , π (2) π where p(π₯π |π¦π ) is the posterior probability of π₯π given the value π¦π . It is also possible to calculate the information gain: IG(π|π) = H(π) − H(π|π). (3) Information gain is a symmetrical measure but it needs to be normalized to be comparable across different features. Symmetrical uncertainty [5] is a normalized information gain measure: SU(π, π) = 2IG(π|π) . H(π) + H(π) (4) It normalizes information gain to the range [0, 1]. If SU(π, π) equals zero then π and π are independent. If SU(π, π) equals one then π and π are fully correlated. 2.2 Spearman rank correlation coefficient The Spearman correlation coefficient [6] is defined as the Pearson correlation coefficient between ranked variables: π= ∑ππ (π₯ππ − π₯Μ π )(π¦π − π¦Μ ) √∑ππ (π₯ππ − π₯Μ π )2 ∑π (π¦π − π¦Μ )2 , (5) where i is the current instance index, j is the current feature index, π¦Μ is the mean of the target vector, π₯Μ π is the mean of the j-th feature. The value of ρ belongs to [-1;1]. The measures of most correlated features are close to zero. 2.3 Value difference metric In this section Value Difference Metric (VDM) [5] for binary target vector is described. Then VDM function could be defined as VDM(π1 , π2 ) = 1 ∫|p(π1 = π₯) − p(π2 = π₯)| ππ₯. 2 (6) VDM relevance between feature and class is defined as VDM(π, π) = 1 ∫|p(π = π₯|π1 ) − p(π = π₯|π2 )| ππ₯, 2 (7) where π1 and π2 are possible class values. In the discrete case the integral is replaced by a sum, so equation (7) can be rewritten in the following way: VDM(π, π) = 1 ∑|p(π = π₯π |π1 ) − p(π = π₯π |π2 )| , 2 (8) π where i is the current instance index. Resulting relevance lies in the range [0,1]. Features closest to zero have the lowest relevance to vector of classes, and features closest to one have the highest relevance to it. 2.4 Fit criterion Fit criterion [5] is a measure similar to the z-score used in statistics. A decision for whether a point x belongs to distribution π1 or distribution π2 is defined as |π₯ − Μ Μ Μ π1 | |π₯ − Μ Μ Μ π2 | < var(π1 ) var(π2 ) FCP(π₯, π1 , π2 ) = . |π₯ − Μ Μ Μ π1 | |π₯ − Μ Μ Μ π2 | 2, ππ > var(π1 ) var(π2 ) { 1, ππ (9) For k distributions and a feature, this formula can be generalized to |π₯ − πΜ π | . π=1,…,π var(ππ ) FCP(π₯, π) = arg min (10) The resulting formula for features fit criterion weight is the sum with an indicator that returns one or zero depending on correctness of the FCP prediction: π 1 FC(ππ , π) = ∑ 1FCP(π₯ π ,[π |π¦π=π ],[π |π¦π =π ])=π¦π . 1 2 π π π π (11) π=1 2.5 ππ statistic The χ2 statistic [7] evaluates the lack of independence between feature t and class c. The feature is evaluated by the following formula: χ2 (π‘, π) = π(π΄π· − πΆπ΅)2 , (π΄ + πΆ)(π΅ + π·)(π΄ + π΅)(πΆ + π·) (12) where A is the number of times t and c co-occur, B is the number of times t occurs without c, C is the number of times c occurs without t, D is the number of times neither c nor t occur and N is the total number of instances. The measures of the most correlated features are close to zero. 3 Feature set generation In this section we describe the proposed methods for choosing the final feature set using the feature score provided by feature evaluation algorithms. The first approach, which we call raw sorted sets, sorts features using the score from one of the feature evaluation methods, and takes the top N features. These N features are defined as the new feature set. The main idea of our second approach is inspired by the fact that a classifiers assembly works better than each of the classifiers separately. That is why we decided to combine several ranking measures for better effectiveness. In this approach, which we call mixture methods, uses several feature evaluating measures at once. On the first step, the top feature from each ranked feature list is selected. On the second step, these features are added into the resulting set if they are not already in it. On the third step, current top features are removed from each ranked feature list. All three steps are repeated until N features are selected for a new feature set. In this paper we used combinations of two and three ranking methods. For example we need to select 6 features from 10 and two methods returned the next orders: “B A C D E F G H I J” and “A I D C G B H E F J”. At the first iteration algorithm selects features B and A as top features for both methods and puts them into the resulting set “A B”. At the second iteration method selects features A and I as current top features in both methods, but puts only I into the resulting set “A B I” as A is already in it. At the third iteration algorithm puts both features C and D into the resulting set “A B I C D”. At the fourth iteration algorithm selects features D and C but none of them are included into the resulting set as both of them are already there. At the fifth iteration algorithm selects features E and G but puts only one of them into the resulting set randomly as we need only 6 features. Resulting set is “A B I C D E” or “A B I C D G”. 4 Results We divided the data set into the training set and the test sets ten times. Each time 60% of the data was chosen randomly for the training set and remaining 40% of the data was chosen for the test set. Thus, we have prepared ten data sets for experiments. Then, features in each experiment were evaluated on the training set by all methods described in Section 2. Resulting metrics were used for feature sets generation as shown in Section 3. After that, we used one of the most popular metrics, AUC (the area under the ROC curve), for evaluating the obtained sets. The ROC (receiver operating characteristic) curve is the curve created by plotting the true positive rate vs. the false positive rate. We used a logistic regression classifier trained on the training set for calculating the true positive rate and the false positive rate in each experiment. Then we evaluated the AUC score for each method in each experiment. After that, we used the paired Wilcoxon Signed-Rank test to compare two different methods. If the p-value obtained from the test was less than 0.05 then we considered that two methods were different and the method with the greatest AUC score was the best of them. The mean values of AUC metrics for all methods are shown in Table 1. The first column contains used method combinations. The second column contains the AUC measure value for a given method. Lines in Table 1 are sorted by the AUC metric. Combination of the Fit Criterion and the Symmetrical Uncertainty methods gave the best results. The classifier learned on all features worked about 47 seconds on the test set while the classifier learned on the selected features worked about 10 seconds. The table with p-values is presented in the supplement materials [8] available online. Table 1. The AUC metrics for used feature selection methods. FC stands for Fit Criterion, SU for Symmetrical Uncertainty, CS for Chi Squared, VDM for Value Difference Metric and Sp for Spearman Method AUC Method AUC FC + SU 0.71866 Sp + VDM 0.71081 FC + Sp + SU 0.71828 CS + Sp + SU 0.71071 FC + Sp 0.71815 CS + Sp 0.71042 SU 0.71757 CS + SU 0.71039 Sp + SU 0.71692 CS + FC 0.71029 Sp 0.71655 CS + SU + VDM 0.71028 Sp + SU + VDM 0.71232 CS + Sp + VDM 0.71022 Base feature set 0.71229 CS 0.70960 FC + SU + VDM 0.71221 CS + VDM 0.70956 SU + VDM 0.71209 CS + FC + VDM 0.70950 CS + FC + SU 0.71154 FC + VDM 0.66733 FC + Sp + VDM 0.71143 VDM 0.66652 CS + FC + Sp 0.71141 FC 0.66124 ROC curves for different feature sets are presented in Fig. 1. The higher classifier ROC curve goes from random ROC curve, the better its feature set is. Fig. 1. ROC curve for the top filtered feature set and base feature set The best method of raw sorted methods was the filtering method based on Symmetrical uncertainty evaluation with 0.71757 AUC score. The best method among the mixed methods was a filtering method based on mixture of Fit criterion and Symmetrical uncertainty evaluation methods with 0.71866 AUC score. ROC curve for the top feature set obtained by a mixture of Fit criterion and Symmetrical uncertainty evaluation methods and ROC curve for the set of all features are shown in Fig. 1. As shown in the p-value table [8], five methods with the greatest AUC scores have p-values less than 0.05 when tested against other methods. That is why we could suppose that these five methods are better than the others for the problem described in this paper. Four of them are mixed methods. Thus, mixture methods usually work better than raw methods. 5 Conclusion In this paper a feature evaluation method was presented. This method was tested on a real-world data problem of feature number decrease. Most of used methods performed similarly. For example, top ten features of resulting sets built by Chi squared, Fit criterion and Symmetrical uncertainty methods were almost the same. As demonstrated by the results, selected sets have about the same value of the AUC metric in comparison with the full feature set. A slight increase in effectiveness can be seen when some mixture methods are used. Furthermore, the classifier works a lot faster after using these combined filtering methods. As shown in Table 1, the best methods combination for our task was Fit Criterion and Symmetric Uncertainty combination with the highest AUC score. Worst filtering metrics for our problem were VDM and Fit Criterion, although in paper [6] it was shown that Fit Criterion and VDM were the best methods for that problem. Thus, our assumption about method combinations is justified. However, presented methods ignore the relationships between features which could lead to less classifier effectiveness. Therefore, for each problem different methods should be tried for better effectiveness. 6 Acknowledgments This work was partially financially supported by the Government of Russian Federation, Grant 074-U01. Authors would like to thank Mail.Ru Group Company. References 1. I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, ‘An introduction to variable and feature selection’, Journal of Machine Learning Research, vol. 3, 1157–1182 (2003). 2. Yvan Saeys, Iñaki Inza, Pedro Larrañaga, ‘A review of feature selection techniques in bioinformatics’, Journal Bioinformatics, vol. 23, Issue 19, Oxford University Press Oxford, UK, 2507-2517 (2007). 3. L Yu, H Liu, ‘Feature selection for high-dimensional data: A fast correlation-based filter solution’, ICML 3, 856-863 (2003). 4. Mail.Ru Group Company, http://corp.mail.ru/en/ 5. Benjamin Auffarth, Maite Lopez, Jesus Cerquides. ‘Comparison of Redundancy and Relevance Measures for Features Selection in Tissue Classification of CT images’, Advances in Data Mining. Applications and Theoretical Aspects, Lecture Notes in Computer Science, vol. 6171, 248-262 (2010). 6. Spearman C., ‘The proof and measurement of association between two things’, The American Journal of Psychology, vol. 15, No. 1, 72–101 (1904). 7. Yiming Yang, Jan O. Pedersen, ‘A Comparative Study on Feature Selection in Text Categorization’, ICML '97 Proceedings of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 412-420 (1997). 8. Supplement materials, p-values http://genome.ifmo.ru/files/papers_files/IDEAL2014/pvalues.csv