Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference Robustness of Threshold-Based Feature Rankers with Data Sampling on Noisy and Imbalanced Data Ahmad Abu Shanab, Taghi Khoshgoftaar and Randall Wald Florida Atlantic University 777 Glades Road, Boca Raton, FL 33431 Abstract applied to randomly selected subsets of the same input data (Kuncheva 2007) (Loscalzo, Yu, & Ding 2009). With stable feature selection techniques, practitioners can be confident that the selected features are relatively robust to variations in the training data. Class imbalance is another major challenge to machine learning. Many important gene expression datasets are characterized by class imbalance, where there are few cases of the positive class, also called the class of interest, and many more cases of the negative class. This can result in suboptimal classification performance because many classifiers assume that the classes are equal in size and some performance metrics reach their maximum value without properly balancing the weight of each class. Thus, the classifier will have a very high rate of false negatives which will mainly affect the positive class, which is the most important class. A variety of techniques have been proposed to alleviate the problems associated with class imbalance. The most popular such technique is data sampling, where the dataset is transformed into a more balanced one by adding or removing instances. This study applies random undersampling, a widely-used data sampling technique, to investigate the effect of sampling on stability. Another factor that can characterize real-world datasets is noise. Noise refers to errors or missing values contained in real-world data. Noise in the independent features is called attribute noise, while noise in the class label is described as class noise. To the best of our knowledge, the stability of feature selection techniques in the presence of noise has received little attention. Given the prevalence of noise in real-world datasets, there is clearly a need to understand the impact of noise on the stability of feature selection. Thus, all experiments in this paper were performed on data which was first determined to be relatively free of noise and which then had artificial class noise injected in a controlled fashion. This way, the results can be used to determine the impact of class noise and sampling on the stability of the feature selection. In this paper, we evaluate threshold-based feature ranking techniques based on the degree of agreement between a feature ranker’s output on both the original datasets and on the modified datasets which have been modified (have had noise injected into them, have had some instances removed from them due to random undersampling, or both). Note that we Gene selection has become a vital component in the learning process when using high-dimensional gene expression data. Although extensive research has been done towards evaluating the performance of classifiers trained with the selected features, the stability of feature ranking techniques has received relatively little study. This work evaluates the robustness of eleven threshold-based feature selection techniques, examining the impact of data sampling and class noise on the stability of feature selection. To assess the robustness of feature selection techniques, we use four groups of gene expression datasets, employ eleven threshold-based feature rankers, and generate artificial class noise to better simulate real-world datasets. The results demonstrate that although no ranker consistently outperforms the others, MI and Dev show the best stability on average, while GI and PR show the least stability on average. Results also show that trying to balance datasets through data sampling has on average no positive impact on the stability of feature ranking techniques applied to those datasets. In addition, increased feature subset sizes improve stability, but only does so reliably for noisy datasets. Introduction One of the major challenges to cancer classification and prediction is the high abundance of features (genes), in most cases exceeding the number of instances/cases. Most of these attributes provide little or no useful information for building a classification model. The process of removing irrelevant and redundant attributes is known as feature selection. Reducing the number of attributes in a dataset can lead to better performance. For this reason, feature selection has received a lot of attention in the past few years. Most research has focused on evaluating feature selection techniques by assessing the performance of a chosen classifier trained with the selected features. Relatively little research has focused on evaluating the stability of feature selection techniques to changes in the datasets. The stability of a feature selection method is defined as the degree of agreement between the outputs of the feature selection method when c 2012, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 92 on a software engineering dataset, discovering that significant variations existed, with AUC and PRC performing well above the rest. In this paper, we evaluate feature ranking techniques based on the degree of agreement between a feature ranker’s output on both the original datasets and on the modified datasets which have had noise injected into them and then have had some instances removed from them due to sampling. This study compares three different scenarios. The first scenario involves sampling on the original datasets, with sampling being repeated 30 times. The second scenario involves injecting nine levels of noise but with no sampling performed, with each level of noise injected 30 times. The third scenario is similar to scenario two, except that it involves sampling after noise injection. Given the prevalence of noise in real-world datasets, there is clearly a need to understand the impact of noise on the stability of feature selection. This paper shows how to distinguish the most and least stable threshold-based feature rankers and points out the importance of considering the impact of noise and sampling on the stability of feature rankers. are comparing the feature subsets before and after modification, rather than comparing the subsets from different runs of the modification approach. This method has not been greatly studied in the literature, and constitutes a contribution of this paper. Related Work Feature selection is a common preprocessing technique used to select a subset of the original feature to be used in the learning process. Feature selection has been extensively studied for many years in data mining and machine learning. A comprehensive survey of feature selection algorithms can be found in the work of Liu and Yu (Liu & Yu 2005). Hall and Holmes (Hall & Holmes 2003) evaluated six feature ranking methods and applied them to 15 datasets from the UCI repository. They came to the conclusion that there is no single best approach for all situations. Sayes et al. (Saeys, Abeel, & Peer 2008) studied the use of ensemble feature selection methods and showed that the ensemble approach provides more robust feature subsets than a single feature selection method. Data sampling is another important preprocessing activity in data mining. Data sampling is used to deal with the class imbalance problem, that is, the overabundance of negative class instances versus positive class instances. This problem is seen in many real-world datasets. Comprehensive studies on different sampling techniques were performed by Kotsiantis (Kotsiantis, Kanellopoulos, & Pintelas 2006) and Guo (Guo et al. 2008), including both oversampling and undersampling techniques (which add instances to the minority class and remove instances from the majority class, respectively), and both random and directed forms of sampling. Chawla (Chawla et al. 2002) proposed an intelligent oversampling method called Synthetic Minority Oversampling Technique (SMOTE). SMOTE adds new, artificial minority examples by extrapolating between preexisting minority instances rather than simply duplicating original instances. In this study, due to space considerations (and prior research showing its effectiveness), we used random undersampling (Seiffert, Khoshgoftaar, & Van Hulse 2009). A common way to evaluate feature selection techniques is based on their classification performance, comparing the classification performance of learners built with the selected features to those built with the complete set of attributes. Another evaluation criterion is the stability of feature ranking techniques. Dunne et al. (Dunne, Cunningham, & Azuaje 2002) proposed a framework for comparing different sets of features. They evaluated the stability of standard feature selection methods and an aggregated approach, and concluded that the aggregated approach was superior to the standard wrapper-based feature selection techniques. Křı́žek et al. (Křı́žek, Kittler, & Hlaváč 2007) proposed an entropy-based measure for assessing the stability of feature selection methods. Kuncheva (Kuncheva 2007) proposed a stability index for measuring the discrepancy in different sequences of features obtained from different runs of sequential forward selection, a widely used feature selection method. Wang (Wang & Khoshgoftaar 2011) compared the stability of 11 threshold-based feature ranking techniques Feature Ranking Techniques In this paper, we examine filter-based feature rankers, since wrapper-based techniques can be very computationally expensive. Eleven threshold-based feature selection techniques were employed within WEKA (Witten & Frank 2005). The family of threshold-based feature rankers consists of a novel approach to permit the use of a classification performance metric as a feature ranker (Wang, Khoshgoftaar, & Van Hulse 2010). Note that while none of these feature rankers use a classifier, they do use the feature values (normalized to lie between 0 and 1) as a posterior probability, choosing a threshold and “classifying” instances based directly on the values of the feature being examined. Classifier performance metrics are then used to evaluate the quality of the feature. In effect, this allows the use of the performance metrics to describe how well the feature correlates with the class; since no actual classifiers are built, this still qualifies as filter-based feature selection. 1. F-Measure (FM) is a single measure that combines both precision and recall. In particular, FM is the harmonic mean of precision and recall. Using a tunable parameter β to indicate the relative importance of precision and recall, it is calculated as follows: max F M =t∈[0,1] (1 − β 2 )R(t)P RE(t) β 2 (R(t) + P RE(t)) (1) where R(t) and P (t) are Recall and Precision at threshold t, respectively. Note that Recall, R(t), is equivalent to TPR(t) while Precision, P RE(t), represents the proportion of positive predictions that are truly positive at each threshold t ∈ [0, 1]. More precisely, P RE(t) is defined as the number of positive instances with X̂ j > t divided by the total number of instances with X̂ j > t. 2. Odds Ratio (OR) is a measure used to describe the strength of association between an independent variable 93 and the dependent variable. It is defined as: max OR =t∈[0,1] T P (t) ∗ T N (t) F P (t) ∗ F N (t) 7. The Kolmogorov-Smirnov Statistic (KS) measures a feature’s relevance by dividing the data into clusters based on the class and comparing the distribution of that particular attribute among the clusters. It is effectively the maximum difference between the curves generated by the true positive and false positive rates (T P R(t) and F P R(t)) of the ersatz “classifier” as the decision threshold changes from 0 to 1, and its formula is given as follows: (2) where T P (t) and T N (t) represent the number of true positives and true negatives at threshold t, respectively while F P (t) and F N (t) represent the number of false positives and false negatives at threshold t, respectively. KS = max |T P R(t) − F P R(t)| 3. Power (Pow) is a measure that avoids false positive cases while giving stronger preference for positive cases. It is defined as: max k t∈[0,1] (7) 8. Deviance (Dev) is the minimum residual sum of squares based on a threshold t. It measures the sum of the squared errors from the mean class given a partitioning of the space based on the threshold t, as shown in the equation below. X X min Dev =t∈[0,1] (µN − xi )2 + (µP − xi )2 (8) k P ow =t∈[0,1] ((1 − F P R(t)) − (1 − T P R(t)) ) (3) where k = 5. 4. Probability Ratio (PR) is the sample estimate probability of the feature given the positive class divided by the sample estimate probability of the feature given the negative class. max T P R(t) P R =t∈[0,1] (4) F P R(t) ŷ t (xi )=N ŷ t (xi )=P Here, ŷ t (x) represents the predicted class of instance x (either N or P ), µN is the mean value of all instances actually found in the negative class, and µP is the mean value of all instances actually found in the positive class. As it represents to total error found in the partitioning, lower values are preferred. 5. Gini Index (GI) is derived from a decision tree construction process where a score is used as a splitting criterion to grow the tree along a particular branch. It measures the impurity of each feature towards categorization, and it is obtained by: 9. Geometric Mean (GM) is a single-value performance measure obtained by calculating the square root of the product of the true positive rate, T P R(t), and the true negative rate, T N R(t). GM ranges from 0 to 1, with a value of 1 attributed to the feature that is perfectly correlated to the class. p max GM =t∈[0,1] T P R(t) × T N R(t) (9) min GI =t∈[0,1] [2P (t)(1−P (t))+2N P V (t)(1−N P V (t))] (5) where N P V (t), the negative predicted value at threshold t, is the percentage of examples predicted to be negative that are actually negative. GI of a feature is thus the minimum at all decision thresholds t ∈ [0, 1]. Thus, a feature’s predictive power is determined by the maximum value of GM as different GM values are obtained, one at each value of the normalized attribute range. 6. Mutual Information (MI) computes the mutual information criterion with respect to the number of times a feature value and a class co-occur, the feature value occurs without the class, and the class occurs without the feature value. Mutual information is defined as: X X p (ŷ t , y) M I = max p ŷ t , y log p (ŷ t ) p(y) t∈[0,1] t 10. Area Under the ROC Curve (AUC), the area under the receiver operating characteristic (ROC) curve, is a singlevalue measure based on statistical decision theory and was developed for the analysis of electronic signal detection. It is the result of plotting F P R(t) against T P R(t). In this study, ROC is used to determine each feature’s predictive power. ROC curves are generated by varying the decision threshold t used to transform the normalized attribute values into a predicted class. That is, as the threshold for the normalized attribute varies from 0 to 1, the true positive and false positive rates are calculated. ŷ ∈{P,N } y∈{P,N } (6) where y(x) is the actual class of instance x, ŷ t (x) is the predicted class based on the value of the attribute Xj at a threshold t, n o j x X̂ (x) = α ∩ (y(x) = β) t p ŷ = α, y = β = |P | + |N | 11. Area Under the PRC Curve (PRC), the area under the precision-recall characteristic curve, is a single-value measure depicting the trade-off between precision and recall. It is the result of plotting T P R(t) against precision, P re(t). Its value ranges from 0 to 1, with 1 denoting a feature with highest predictive power. The PRC curve is generated by varying the decision threshold t from 0 to 1 and plotting the recall (x-axis) and precision (y-axis) at each point in a similar manner to the ROC curve. |{(x|y(x) = α)}| p ŷ t = α = |P | + |N | α, β ∈ {P, N } Note that the class (actual or predicted) can be either positive (P ) or negative (N ). 94 | P | is the number of examples in the smaller class (often referred to as the positive class). This ensures that the positive class is not drastically impacted by the level of corruption, especially if the data is highly imbalanced. The second parameter, denoted β (β = 0%, 25%, 50%, 75%, 100%), represents the percentage of class noise injected in the positive instances and is referred to as noise distribution (ND). In other words, if there are 125 positive class examples in the training dataset and α = 20% and β = 75%, then 50 examples will be injected with noise, and 75% of those (38) will be from the positive class. These parameters serve to ensure systematic control of the training data corruption. Due to space constraints, more details on the noise injection scheme are not included. For those details, readers are referred to (Van Hulse & Khoshgoftaar 2009). Table 1: Datasets Data set # attributes # instances % positive % negative Lung cancer 12534 181 17.1 82.9 ALL 12559 327 24.2 75.8 Lung clean 12601 132 17.4 82.6 Ovarian Cancer 15155 253 36.0 64.0 Empirical Evaluation Datasets Table 1 lists the four datasets used in this study, including their characteristics in terms of the total number of attributes, number of instances, percentage of positive instances, and percentage of negative instances. They are all binary class datasets. That is, for all the datasets, each instance is assigned one of two class labels. We chose these because the TBFS ranking techniques can only be used on binary datasets. All datasets considered are gene expression datasets. The Lung Cancer dataset is a classification of malignant pleural mesothelioma (MPM) vs. adenocarcinoma (ADCA) of the lung, and consists of 181 tissue samples (31 MPM, 150 ADCA) (Wang & Gotoh 2009). The acute lymphoblastic leukemia dataset consists of 327 tumor samples of which 79 are positive (24.2%). The Lung Clean dataset was derived from a noisy lung cancer dataset containing 203 instances, including 64 (31.53%) minority instances and 139 (68.47%) majority instances. To produce a dataset that both was imbalanced and could be considered ‘clean’ (as defined by many classifiers having relatively near perfect classification on the dataset), a supervised cleansing process was used to reduce the original lung dataset. 5-fold cross-validation was performed on the original lung dataset using a 5NN classifier, and any instances which produced a probability of membership in the opposite class that was greater than 0.1 were removed. The ovarian cancer dataset consists of proteomic spectra derived from analysis of serum to distinguish ovarian cancer from non-cancer (Petricoin et al. 2002). Sampling Techniques Sampling is a family of preprocessing techniques used for modifying a dataset to improve its balance, to help resolve the problem of class imbalance. There are four major classes of sampling techniques, depending on two choices: whether the sampling will be undersampling (removing samples from the majority) or oversampling (adding samples to the minority), and whether the sampling will be random (removing/adding arbitrary samples) or focused/algorithmic (e.g., removing majority samples near the class boarder, or adding artificially-generated minority samples). In this paper, due to space considerations (and prior research showing its effectiveness), we used random undersampling (Seiffert, Khoshgoftaar, & Van Hulse 2009), which deleted instances from the majority class until the class ratio was 50:50 majority:minority. Future research will consider a wider range of sampling techniques and balance levels. Stability Measure Previous work assessed the stability of feature selection techniques using different measures. Liu and Yu used an entropy-based measure (Liu & Yu 2005), Fayyad and Irani used the Hamming distance (Fayyad & Irani 1992), Kononenko used the correlation coefficient (Kononenko 1994), and Křı́žek et al used the consistency index (Křı́žek, Kittler, & Hlaváč 2007). In this study and to avoid bias due to chance we used the consistency index. First the original dataset is assumed to have n features. Ti and Tj are two subsets of features, where k is the number of features in each subset (e.g., k = |Ti | = |Tj |). When comparing Ti and Tj the consistency index is defined as follows: Noise Injection Mechanism To accomplish our goal of analyzing filters in the presence of class noise, noise is injected into the training datasets using two simulation parameters. These datasets are chosen because preliminary analysis showed near perfect classification. Ensuring that the datasets are relatively clean prior to noise injection is important because it is very undesirable to inject class noise into already noisy datasets. Class noise is injected into the datasets. For the noise injection mechanism, the same procedure as reported by (Van Hulse & Khoshgoftaar 2009) is used. That is, the levels of class noise are regulated by two noise parameters. The first parameter, denoted α (α = 40%, 50%), is used to determine the overall class noise level (NL) in the data. Precisely, α is the noise level relative to the number of instances belonging to the positive class, i.e., the number of examples to be injected with noise is 2 × α× | P |, where IC (Tj , Ti ) = dn − k 2 k(n − k) where d is the cardinality of the intersection between subsets Ti and Tj , and −1 < IC (Tj , Ti ) < 1. The greater the consistency index IC the more similar the subsets are. Note that for this experiment, all IC values are found by comparing features chosen from a modified dataset to those chosen from the original dataset; no pairwise comparison of modified datasets was employed. 95 Table 2: Average Ic values for scenario one Filter FM OR Pow PR GI MI KS Dev GM AUC PRC Avg 10 Att 0.565491 0.528800 0.551316 0.489600 0.482930 0.631377 0.698930 0.626372 0.689756 0.718112 0.592179 0.597715 14 Att 0.626969 0.585259 0.609693 0.527457 0.514349 0.670470 0.712183 0.656168 0.715759 0.724697 0.632929 0.634176 25 Att 0.658330 0.637627 0.606901 0.544779 0.506368 0.720118 0.775227 0.710431 0.769884 0.749509 0.688386 0.669778 0.25% Att 0.690415 0.644247 0.585301 0.520774 0.499123 0.757126 0.791286 0.703754 0.789790 0.780315 0.713555 0.679608 0.5% Att 0.689724 0.644915 0.585008 0.480706 0.439665 0.749852 0.789925 0.709854 0.789321 0.794064 0.753931 0.675179 1% Att 0.675187 0.591508 0.598641 0.453528 0.389938 0.754507 0.794178 0.670589 0.797754 0.810691 0.750514 0.662458 2% Att 0.662877 0.573474 0.621122 0.451060 0.343508 0.734643 0.799993 0.672898 0.801379 0.826301 0.756214 0.658497 5% Att 0.655833 0.578386 0.675688 0.488098 0.348842 0.745050 0.803384 0.693443 0.805352 0.830997 0.764504 0.671780 Table 3: Average Ic values for scenario two Avg 0.653103 0.598027 0.604209 0.494500 0.440590 0.720393 0.770638 0.680439 0.769874 0.779336 0.706527 0.656149 Filter FM OR Pow PR GI MI KS Dev GM AUC PRC Avg 10 Att 0.190497 0.194015 0.258048 0.170294 0.125541 0.370723 0.293995 0.360160 0.238490 0.227744 0.265181 0.244972 14 Att 0.222853 0.198748 0.279128 0.174980 0.130030 0.413791 0.328112 0.393598 0.268989 0.262041 0.302894 0.270469 25 Att 0.269098 0.197321 0.297444 0.181145 0.138629 0.456812 0.395208 0.433659 0.335945 0.328268 0.372952 0.309680 0.25% Att 0.287799 0.200472 0.306716 0.180260 0.140979 0.463218 0.416617 0.440437 0.361507 0.356831 0.394186 0.322638 0.5% Att 0.315697 0.200271 0.336771 0.178505 0.146150 0.471262 0.448645 0.438906 0.406476 0.405106 0.430831 0.343511 1% Att 0.325431 0.201967 0.368565 0.190917 0.159357 0.460313 0.457815 0.422539 0.429074 0.429674 0.442980 0.353512 2% Att 0.334663 0.219185 0.395241 0.210272 0.177443 0.460358 0.473543 0.430886 0.452450 0.461940 0.457345 0.370302 5% Att 0.356951 0.252927 0.441514 0.250986 0.209229 0.484584 0.497963 0.453408 0.485961 0.508882 0.494060 0.403315 Avg 0.287874 0.208113 0.335428 0.192170 0.153420 0.447633 0.413987 0.421699 0.372362 0.372561 0.395054 0.327300 Results Table 4: Average Ic values for scenario three As mentioned earlier, experiments were conducted with eleven threshold-based feature rankers (FM, OR, Pow, PR, GI, MI, KS, Dev, GM, AUC, and PRC). Four datasets were used in these experiments. These datasets are relatively clean to avoid validity problems caused by injecting noise into a dataset that already has noise. We investigated three scenarios to assess the robustness of feature rankers under different circumstances and for different feature subset sizes. In the first scenario sampling takes place on the clean (original) datasets, and each sampling technique is performed 30 times on each dataset. In the second scenario noise is injected into the dataset and no sampling is performed. The third scenario is similar to scenario two, except that it involves sampling, where sampling is performed after noise injection. Given that the noise injection process is performed 30 times for each noise level, sampling is only performed once per noisy dataset. For all of these scenarios, the assessment is based on the degree of agreement between a ranker’s output on both the original datasets and on the modified datasets which have had noise injected into them, have had some instances removed due to sampling, or both. We used the average of the consistency index Ic over all runs to evaluate the stability of feature rankers. In the experiments, we used eight sizes of feature subsets for each dataset (10, 14, 25, 0.25%, 0.5%, 0.5%, 1%, 2%, and 5%). Preliminary experiments conducted on the corresponding datasets show that these numbers are appropriate. As we have four datasets, nine levels of noise, and eleven feature rankers, we repeat the experiment 11,880 times for each of scenarios two and three, and 1,320 times for scenario one. Only the average results of the 30 repetitions are presented in the tables. Further discussion of the breakdown based on the different noise injection patterns could not be included due to space considerations. Tables 2 through 4 represent the average Ic value for each scenario for each feature ranker for every subset size, across all nine levels of injected noise (scenario two and scenario three). We also present (1) the average performance (last column of the tables) of each of the feature rankers for each scenario over the four datasets, and (2) the average performance (last row of each section of the tables) of each subset size over the eleven feature rankers and for each scenario. In all tables “Attributes” is abbreviated as “Att” for space considerations, and bold values represent the best performance for that combination of feature ranker and subset size for Filter FM OR Pow PR GI MI KS Dev GM AUC PRC Avg 10 Att 0.061783 0.159728 0.182339 0.137860 0.149813 0.241085 0.207633 0.238860 0.167047 0.165937 0.193272 0.173214 14 Att 0.072486 0.172727 0.201794 0.151938 0.159948 0.274088 0.235553 0.272696 0.189605 0.193647 0.219468 0.194905 25 Att 0.091967 0.192903 0.224819 0.169455 0.175425 0.314981 0.281248 0.313904 0.233787 0.242103 0.276613 0.228837 0.25% Att 0.100093 0.196442 0.229973 0.164765 0.172535 0.326698 0.300766 0.322458 0.258900 0.264428 0.293711 0.239161 0.5% Att 0.119167 0.204697 0.247015 0.155742 0.156996 0.337231 0.327553 0.335287 0.294405 0.304828 0.335506 0.256221 1% Att 0.138073 0.198318 0.264133 0.157956 0.150343 0.343192 0.343155 0.331503 0.319431 0.334212 0.354537 0.266805 2% Att 0.159457 0.202956 0.284858 0.169800 0.151347 0.348091 0.364082 0.339854 0.347838 0.362618 0.377724 0.282602 5% Att 0.190835 0.229403 0.325600 0.199801 0.163191 0.373361 0.395971 0.366484 0.388530 0.410338 0.416624 0.314558 Avg 0.116733 0.194647 0.245066 0.163415 0.159950 0.319841 0.306995 0.315131 0.274943 0.284764 0.308432 0.244538 each scenario. The results demonstrate that while there was no clear winner among the eleven filters, Gini Index, Probability Ratio, and Odds Ratio show the worst stability across all scenarios and sampling techniques on average. Using scenario one (sampling without noise injection) AUC shows the best stability on average, followed closely by KS and GM. When considering other scenarios AUC is closer to the middle of the pack; MI performs best, followed closely by Dev and KS. This indicates that the three feature rankers (MI, Dev, and KS) are less sensitive to class noise and are good choices for stable feature extraction. The results also show that the size of the subset of selected features can influence the stability of a feature ranking technique. All feature rankers show more stable behavior as the feature subset size is increased when class noise is present (scenario two and scenario three). However, without injected class noise (scenario 1), many feature rankers have an internal optimum for feature subset size, and increasing it beyond that will reduce performance. The exact location of this optimum varies from 25 attributes to 1% of the total original attributes, although some rankers (including KS, Dev, GM, AUC, and PRC) show consistent improvement as subset size increases. In addition, when looking across all scenarios, scenario three shows the worst stability, which demonstrates that sampling does not improve stability of feature selection techniques when noise is present. The only exception to this is when GI is used and for small subset sizes, random undersampling improved the stability of GI in the presence of class noise. Nevertheless, GI is still among the worst performing feature rankers. 96 Conclusion 25th IASTED International Multi-Conference: artificial intelligence and applications, 390–395. Anaheim, CA, USA: ACTA Press. Křı́žek, P.; Kittler, J.; and Hlaváč, V. 2007. Improving stability of feature selection methods. In Proceedings of the 12th international conference on Computer analysis of images and patterns, CAIP’07, 929–936. Berlin, Heidelberg: Springer-Verlag. Liu, H., and Yu, L. 2005. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 17(4):491–502. Loscalzo, S.; Yu, L.; and Ding, C. 2009. Consensus group stable feature selection. In KDD ’09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 567–576. New York, NY, USA: ACM. Petricoin, E. F.; Ardekani, A. M.; Hitt, B. A.; Levine, P. J.; Fusaro, V. A.; Steinberg, S. M.; Mills, G. B.; Simone, C.; Fishman, D. A.; Kohn, E. C.; and Liotta, L. A. 2002. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359(9306):572–577. Saeys, Y.; Abeel, T.; and Peer, Y. 2008. Robust feature selection using ensemble feature selection techniques. In ECML PKDD ’08: Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases Part II, 313–325. Berlin, Heidelberg: Springer-Verlag. Seiffert, C.; Khoshgoftaar, T.; and Van Hulse, J. 2009. Improving software-quality predictions with data sampling and boosting. Systems, Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on 39(6):1283 –1294. Van Hulse, J., and Khoshgoftaar, T. M. 2009. Knowledge discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12):1513–1542. Wang, X., and Gotoh, O. 2009. Accurate molecular classification of cancer using simple rules. BMC Medical Genomics 2(1):64. Wang, H., and Khoshgoftaar, T. M. 2011. Measuring stability of threshold-based feature selection techniques. In 23rd IEEE International Conference on Tools with Artificial Intelligence, 986–993. Wang, H.; Khoshgoftaar, T. M.; and Van Hulse, J. 2010. A comparative study of threshold-based feature selection techniques. 2010 IEEE International Conference on Granular Computing 499–504. Witten, I. H., and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2nd edition. To the best of our knowledge this is the first study to investigate the stability of threshold-based feature selection techniques. We conducted stability analysis on eleven thresholdbased feature selection techniques, and four groups of realworld gene expression datasets. We injected noise into these datasets to better simulate real-world datasets. We investigated three scenarios (sampling only, noise injection only, noise injection followed by sampling) to assess the impact of data sampling and class noise on stability. The experimental results demonstrate that GI performed worst among the eleven feature rankers across all scenarios. Furthermore, in the presence of class noise, the best overall three filters are MI, Dev, and KS. Results also show that trying to balance datasets through data sampling has on average a negative impact on the stability of feature ranking techniques applied to those datasets. In addition, although for noisy data larger feature subset sizes are almost always better, the same cannot be said for clean data (which often shows an internal optimum past which larger sizes hurt performance). Future research may involve conducting more experiments, using other feature selection techniques, other data sampling balance levels (e.g., 65:35), examining more datasets from other application domains, and considering other feature subset sizes. References Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer, W. P. 2002. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321– 357. Dunne, K.; Cunningham, P.; and Azuaje, F. 2002. Solutions to Instability Problems with Sequential Wrapper-Based Approaches To Feature Selection. Technical Report TCD-CD2002-28, Department of Computer Science, Trinity College, Dublin, Ireland. Fayyad, U. M., and Irani, K. B. 1992. On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 8:87–102. Guo, X.; Yin, Y.; Dong, C.; Yang, G.; and Zhou, G. 2008. On the class imbalance problem. In Fourth International Conference on Natural Computation, 2008. ICNC ’08., volume 4, 192–201. Hall, M. A., and Holmes, G. 2003. Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering 15(6):1437 – 1447. Kononenko, I. 1994. Estimating attributes: analysis and extensions of relief. In Proceedings of the European conference on machine learning on Machine Learning, 171–182. Secaucus, NJ, USA: Springer-Verlag New York, Inc. Kotsiantis, S.; Kanellopoulos, D.; and Pintelas, P. 2006. Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 30(1):25–36. Kuncheva, L. I. 2007. A stability index for feature selection. In Proceedings of the 25th conference on Proceedings of the 97