Robustness of Threshold-Based Feature Rankers with

Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference
Robustness of Threshold-Based Feature Rankers with
Data Sampling on Noisy and Imbalanced Data
Ahmad Abu Shanab, Taghi Khoshgoftaar and Randall Wald
Florida Atlantic University
777 Glades Road, Boca Raton, FL 33431
Abstract
applied to randomly selected subsets of the same input data
(Kuncheva 2007) (Loscalzo, Yu, & Ding 2009). With stable
feature selection techniques, practitioners can be confident
that the selected features are relatively robust to variations
in the training data.
Class imbalance is another major challenge to machine
learning. Many important gene expression datasets are characterized by class imbalance, where there are few cases of
the positive class, also called the class of interest, and many
more cases of the negative class. This can result in suboptimal classification performance because many classifiers
assume that the classes are equal in size and some performance metrics reach their maximum value without properly
balancing the weight of each class. Thus, the classifier will
have a very high rate of false negatives which will mainly
affect the positive class, which is the most important class.
A variety of techniques have been proposed to alleviate the
problems associated with class imbalance. The most popular such technique is data sampling, where the dataset is
transformed into a more balanced one by adding or removing instances. This study applies random undersampling, a
widely-used data sampling technique, to investigate the effect of sampling on stability.
Another factor that can characterize real-world datasets is
noise. Noise refers to errors or missing values contained in
real-world data. Noise in the independent features is called
attribute noise, while noise in the class label is described
as class noise. To the best of our knowledge, the stability
of feature selection techniques in the presence of noise has
received little attention. Given the prevalence of noise in
real-world datasets, there is clearly a need to understand the
impact of noise on the stability of feature selection. Thus,
all experiments in this paper were performed on data which
was first determined to be relatively free of noise and which
then had artificial class noise injected in a controlled fashion.
This way, the results can be used to determine the impact
of class noise and sampling on the stability of the feature
selection.
In this paper, we evaluate threshold-based feature ranking
techniques based on the degree of agreement between a feature ranker’s output on both the original datasets and on the
modified datasets which have been modified (have had noise
injected into them, have had some instances removed from
them due to random undersampling, or both). Note that we
Gene selection has become a vital component in the
learning process when using high-dimensional gene
expression data. Although extensive research has been
done towards evaluating the performance of classifiers
trained with the selected features, the stability of
feature ranking techniques has received relatively
little study. This work evaluates the robustness of
eleven threshold-based feature selection techniques,
examining the impact of data sampling and class noise
on the stability of feature selection. To assess the
robustness of feature selection techniques, we use four
groups of gene expression datasets, employ eleven
threshold-based feature rankers, and generate artificial
class noise to better simulate real-world datasets. The
results demonstrate that although no ranker consistently
outperforms the others, MI and Dev show the best
stability on average, while GI and PR show the least
stability on average. Results also show that trying to
balance datasets through data sampling has on average
no positive impact on the stability of feature ranking
techniques applied to those datasets. In addition,
increased feature subset sizes improve stability, but
only does so reliably for noisy datasets.
Introduction
One of the major challenges to cancer classification and prediction is the high abundance of features (genes), in most
cases exceeding the number of instances/cases. Most of
these attributes provide little or no useful information for
building a classification model. The process of removing
irrelevant and redundant attributes is known as feature selection. Reducing the number of attributes in a dataset can
lead to better performance. For this reason, feature selection
has received a lot of attention in the past few years. Most
research has focused on evaluating feature selection techniques by assessing the performance of a chosen classifier
trained with the selected features. Relatively little research
has focused on evaluating the stability of feature selection
techniques to changes in the datasets. The stability of a feature selection method is defined as the degree of agreement
between the outputs of the feature selection method when
c 2012, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
92
on a software engineering dataset, discovering that significant variations existed, with AUC and PRC performing well
above the rest.
In this paper, we evaluate feature ranking techniques
based on the degree of agreement between a feature ranker’s
output on both the original datasets and on the modified
datasets which have had noise injected into them and then
have had some instances removed from them due to sampling. This study compares three different scenarios. The
first scenario involves sampling on the original datasets, with
sampling being repeated 30 times. The second scenario involves injecting nine levels of noise but with no sampling
performed, with each level of noise injected 30 times. The
third scenario is similar to scenario two, except that it involves sampling after noise injection. Given the prevalence
of noise in real-world datasets, there is clearly a need to understand the impact of noise on the stability of feature selection. This paper shows how to distinguish the most and least
stable threshold-based feature rankers and points out the importance of considering the impact of noise and sampling on
the stability of feature rankers.
are comparing the feature subsets before and after modification, rather than comparing the subsets from different runs of
the modification approach. This method has not been greatly
studied in the literature, and constitutes a contribution of this
paper.
Related Work
Feature selection is a common preprocessing technique used
to select a subset of the original feature to be used in the
learning process. Feature selection has been extensively
studied for many years in data mining and machine learning. A comprehensive survey of feature selection algorithms
can be found in the work of Liu and Yu (Liu & Yu 2005).
Hall and Holmes (Hall & Holmes 2003) evaluated six feature ranking methods and applied them to 15 datasets from
the UCI repository. They came to the conclusion that there is
no single best approach for all situations. Sayes et al. (Saeys,
Abeel, & Peer 2008) studied the use of ensemble feature
selection methods and showed that the ensemble approach
provides more robust feature subsets than a single feature
selection method.
Data sampling is another important preprocessing activity
in data mining. Data sampling is used to deal with the class
imbalance problem, that is, the overabundance of negative
class instances versus positive class instances. This problem is seen in many real-world datasets. Comprehensive
studies on different sampling techniques were performed by
Kotsiantis (Kotsiantis, Kanellopoulos, & Pintelas 2006) and
Guo (Guo et al. 2008), including both oversampling and
undersampling techniques (which add instances to the minority class and remove instances from the majority class,
respectively), and both random and directed forms of sampling. Chawla (Chawla et al. 2002) proposed an intelligent oversampling method called Synthetic Minority Oversampling Technique (SMOTE). SMOTE adds new, artificial
minority examples by extrapolating between preexisting minority instances rather than simply duplicating original instances. In this study, due to space considerations (and prior
research showing its effectiveness), we used random undersampling (Seiffert, Khoshgoftaar, & Van Hulse 2009).
A common way to evaluate feature selection techniques
is based on their classification performance, comparing the
classification performance of learners built with the selected
features to those built with the complete set of attributes.
Another evaluation criterion is the stability of feature ranking techniques. Dunne et al. (Dunne, Cunningham, &
Azuaje 2002) proposed a framework for comparing different sets of features. They evaluated the stability of standard feature selection methods and an aggregated approach,
and concluded that the aggregated approach was superior
to the standard wrapper-based feature selection techniques.
Křı́žek et al. (Křı́žek, Kittler, & Hlaváč 2007) proposed an
entropy-based measure for assessing the stability of feature
selection methods. Kuncheva (Kuncheva 2007) proposed a
stability index for measuring the discrepancy in different
sequences of features obtained from different runs of sequential forward selection, a widely used feature selection
method. Wang (Wang & Khoshgoftaar 2011) compared the
stability of 11 threshold-based feature ranking techniques
Feature Ranking Techniques
In this paper, we examine filter-based feature rankers, since
wrapper-based techniques can be very computationally expensive. Eleven threshold-based feature selection techniques were employed within WEKA (Witten & Frank
2005). The family of threshold-based feature rankers consists of a novel approach to permit the use of a classification
performance metric as a feature ranker (Wang, Khoshgoftaar, & Van Hulse 2010). Note that while none of these feature rankers use a classifier, they do use the feature values
(normalized to lie between 0 and 1) as a posterior probability, choosing a threshold and “classifying” instances based
directly on the values of the feature being examined. Classifier performance metrics are then used to evaluate the quality
of the feature. In effect, this allows the use of the performance metrics to describe how well the feature correlates
with the class; since no actual classifiers are built, this still
qualifies as filter-based feature selection.
1. F-Measure (FM) is a single measure that combines both
precision and recall. In particular, FM is the harmonic
mean of precision and recall. Using a tunable parameter β
to indicate the relative importance of precision and recall,
it is calculated as follows:
max
F M =t∈[0,1]
(1 − β 2 )R(t)P RE(t)
β 2 (R(t) + P RE(t))
(1)
where R(t) and P (t) are Recall and Precision at threshold t, respectively. Note that Recall, R(t), is equivalent to
TPR(t) while Precision, P RE(t), represents the proportion of positive predictions that are truly positive at each
threshold t ∈ [0, 1]. More precisely, P RE(t) is defined
as the number of positive instances with X̂ j > t divided
by the total number of instances with X̂ j > t.
2. Odds Ratio (OR) is a measure used to describe the
strength of association between an independent variable
93
and the dependent variable. It is defined as:
max
OR =t∈[0,1]
T P (t) ∗ T N (t)
F P (t) ∗ F N (t)
7. The Kolmogorov-Smirnov Statistic (KS) measures a feature’s relevance by dividing the data into clusters based on
the class and comparing the distribution of that particular
attribute among the clusters. It is effectively the maximum difference between the curves generated by the true
positive and false positive rates (T P R(t) and F P R(t))
of the ersatz “classifier” as the decision threshold changes
from 0 to 1, and its formula is given as follows:
(2)
where T P (t) and T N (t) represent the number of true
positives and true negatives at threshold t, respectively
while F P (t) and F N (t) represent the number of false
positives and false negatives at threshold t, respectively.
KS = max |T P R(t) − F P R(t)|
3. Power (Pow) is a measure that avoids false positive cases
while giving stronger preference for positive cases. It is
defined as:
max
k
t∈[0,1]
(7)
8. Deviance (Dev) is the minimum residual sum of squares
based on a threshold t. It measures the sum of the squared
errors from the mean class given a partitioning of the
space based on the threshold t, as shown in the equation
below.
X
X
min
Dev =t∈[0,1]
(µN − xi )2 +
(µP − xi )2 (8)
k
P ow =t∈[0,1] ((1 − F P R(t)) − (1 − T P R(t)) ) (3)
where k = 5.
4. Probability Ratio (PR) is the sample estimate probability
of the feature given the positive class divided by the sample estimate probability of the feature given the negative
class.
max T P R(t)
P R =t∈[0,1]
(4)
F P R(t)
ŷ t (xi )=N
ŷ t (xi )=P
Here, ŷ t (x) represents the predicted class of instance x
(either N or P ), µN is the mean value of all instances
actually found in the negative class, and µP is the mean
value of all instances actually found in the positive class.
As it represents to total error found in the partitioning,
lower values are preferred.
5. Gini Index (GI) is derived from a decision tree construction process where a score is used as a splitting criterion
to grow the tree along a particular branch. It measures the
impurity of each feature towards categorization, and it is
obtained by:
9. Geometric Mean (GM) is a single-value performance
measure obtained by calculating the square root of the
product of the true positive rate, T P R(t), and the true
negative rate, T N R(t). GM ranges from 0 to 1, with a
value of 1 attributed to the feature that is perfectly correlated to the class.
p
max
GM =t∈[0,1] T P R(t) × T N R(t)
(9)
min
GI =t∈[0,1] [2P (t)(1−P (t))+2N P V (t)(1−N P V (t))]
(5)
where N P V (t), the negative predicted value at threshold
t, is the percentage of examples predicted to be negative
that are actually negative. GI of a feature is thus the minimum at all decision thresholds t ∈ [0, 1].
Thus, a feature’s predictive power is determined by the
maximum value of GM as different GM values are obtained, one at each value of the normalized attribute range.
6. Mutual Information (MI) computes the mutual information criterion with respect to the number of times a feature value and a class co-occur, the feature value occurs
without the class, and the class occurs without the feature
value. Mutual information is defined as:
X
X
p (ŷ t , y)
M I = max
p ŷ t , y log
p (ŷ t ) p(y)
t∈[0,1]
t
10. Area Under the ROC Curve (AUC), the area under the
receiver operating characteristic (ROC) curve, is a singlevalue measure based on statistical decision theory and was
developed for the analysis of electronic signal detection.
It is the result of plotting F P R(t) against T P R(t). In this
study, ROC is used to determine each feature’s predictive
power. ROC curves are generated by varying the decision
threshold t used to transform the normalized attribute values into a predicted class. That is, as the threshold for the
normalized attribute varies from 0 to 1, the true positive
and false positive rates are calculated.
ŷ ∈{P,N } y∈{P,N }
(6)
where y(x) is the actual class of instance x, ŷ t (x) is the
predicted class based on the value of the attribute Xj at a
threshold t,
n o
j
x
X̂
(x)
=
α
∩
(y(x)
=
β)
t
p ŷ = α, y = β =
|P | + |N |
11. Area Under the PRC Curve (PRC), the area under the
precision-recall characteristic curve, is a single-value
measure depicting the trade-off between precision and recall. It is the result of plotting T P R(t) against precision,
P re(t). Its value ranges from 0 to 1, with 1 denoting a
feature with highest predictive power. The PRC curve is
generated by varying the decision threshold t from 0 to
1 and plotting the recall (x-axis) and precision (y-axis) at
each point in a similar manner to the ROC curve.
|{(x|y(x) = α)}|
p ŷ t = α =
|P | + |N |
α, β ∈ {P, N }
Note that the class (actual or predicted) can be either positive (P ) or negative (N ).
94
| P | is the number of examples in the smaller class (often referred to as the positive class). This ensures that the positive
class is not drastically impacted by the level of corruption,
especially if the data is highly imbalanced. The second parameter, denoted β (β = 0%, 25%, 50%, 75%, 100%), represents the percentage of class noise injected in the positive
instances and is referred to as noise distribution (ND). In
other words, if there are 125 positive class examples in the
training dataset and α = 20% and β = 75%, then 50 examples will be injected with noise, and 75% of those (38)
will be from the positive class. These parameters serve to
ensure systematic control of the training data corruption.
Due to space constraints, more details on the noise injection scheme are not included. For those details, readers are
referred to (Van Hulse & Khoshgoftaar 2009).
Table 1: Datasets
Data set
# attributes # instances % positive % negative
Lung cancer
12534
181
17.1
82.9
ALL
12559
327
24.2
75.8
Lung clean
12601
132
17.4
82.6
Ovarian Cancer 15155
253
36.0
64.0
Empirical Evaluation
Datasets
Table 1 lists the four datasets used in this study, including
their characteristics in terms of the total number of attributes,
number of instances, percentage of positive instances, and
percentage of negative instances. They are all binary class
datasets. That is, for all the datasets, each instance is assigned one of two class labels. We chose these because
the TBFS ranking techniques can only be used on binary
datasets.
All datasets considered are gene expression datasets. The
Lung Cancer dataset is a classification of malignant pleural mesothelioma (MPM) vs. adenocarcinoma (ADCA) of
the lung, and consists of 181 tissue samples (31 MPM, 150
ADCA) (Wang & Gotoh 2009). The acute lymphoblastic
leukemia dataset consists of 327 tumor samples of which
79 are positive (24.2%). The Lung Clean dataset was derived from a noisy lung cancer dataset containing 203 instances, including 64 (31.53%) minority instances and 139
(68.47%) majority instances. To produce a dataset that both
was imbalanced and could be considered ‘clean’ (as defined
by many classifiers having relatively near perfect classification on the dataset), a supervised cleansing process was used
to reduce the original lung dataset. 5-fold cross-validation
was performed on the original lung dataset using a 5NN
classifier, and any instances which produced a probability of
membership in the opposite class that was greater than 0.1
were removed. The ovarian cancer dataset consists of proteomic spectra derived from analysis of serum to distinguish
ovarian cancer from non-cancer (Petricoin et al. 2002).
Sampling Techniques
Sampling is a family of preprocessing techniques used for
modifying a dataset to improve its balance, to help resolve
the problem of class imbalance. There are four major classes
of sampling techniques, depending on two choices: whether
the sampling will be undersampling (removing samples
from the majority) or oversampling (adding samples to the
minority), and whether the sampling will be random (removing/adding arbitrary samples) or focused/algorithmic
(e.g., removing majority samples near the class boarder, or
adding artificially-generated minority samples). In this paper, due to space considerations (and prior research showing
its effectiveness), we used random undersampling (Seiffert,
Khoshgoftaar, & Van Hulse 2009), which deleted instances
from the majority class until the class ratio was 50:50 majority:minority. Future research will consider a wider range
of sampling techniques and balance levels.
Stability Measure
Previous work assessed the stability of feature selection
techniques using different measures. Liu and Yu used
an entropy-based measure (Liu & Yu 2005), Fayyad and
Irani used the Hamming distance (Fayyad & Irani 1992),
Kononenko used the correlation coefficient (Kononenko
1994), and Křı́žek et al used the consistency index (Křı́žek,
Kittler, & Hlaváč 2007). In this study and to avoid bias due
to chance we used the consistency index. First the original dataset is assumed to have n features. Ti and Tj are two
subsets of features, where k is the number of features in each
subset (e.g., k = |Ti | = |Tj |). When comparing Ti and Tj
the consistency index is defined as follows:
Noise Injection Mechanism
To accomplish our goal of analyzing filters in the presence
of class noise, noise is injected into the training datasets using two simulation parameters. These datasets are chosen
because preliminary analysis showed near perfect classification. Ensuring that the datasets are relatively clean prior to
noise injection is important because it is very undesirable to
inject class noise into already noisy datasets.
Class noise is injected into the datasets.
For the
noise injection mechanism, the same procedure as reported
by (Van Hulse & Khoshgoftaar 2009) is used. That is, the
levels of class noise are regulated by two noise parameters.
The first parameter, denoted α (α = 40%, 50%), is used
to determine the overall class noise level (NL) in the data.
Precisely, α is the noise level relative to the number of instances belonging to the positive class, i.e., the number of
examples to be injected with noise is 2 × α× | P |, where
IC (Tj , Ti ) =
dn − k 2
k(n − k)
where d is the cardinality of the intersection between subsets Ti and Tj , and −1 < IC (Tj , Ti ) < 1. The greater the
consistency index IC the more similar the subsets are. Note
that for this experiment, all IC values are found by comparing features chosen from a modified dataset to those chosen
from the original dataset; no pairwise comparison of modified datasets was employed.
95
Table 2: Average Ic values for scenario one
Filter
FM
OR
Pow
PR
GI
MI
KS
Dev
GM
AUC
PRC
Avg
10 Att
0.565491
0.528800
0.551316
0.489600
0.482930
0.631377
0.698930
0.626372
0.689756
0.718112
0.592179
0.597715
14 Att
0.626969
0.585259
0.609693
0.527457
0.514349
0.670470
0.712183
0.656168
0.715759
0.724697
0.632929
0.634176
25 Att
0.658330
0.637627
0.606901
0.544779
0.506368
0.720118
0.775227
0.710431
0.769884
0.749509
0.688386
0.669778
0.25% Att
0.690415
0.644247
0.585301
0.520774
0.499123
0.757126
0.791286
0.703754
0.789790
0.780315
0.713555
0.679608
0.5% Att
0.689724
0.644915
0.585008
0.480706
0.439665
0.749852
0.789925
0.709854
0.789321
0.794064
0.753931
0.675179
1% Att
0.675187
0.591508
0.598641
0.453528
0.389938
0.754507
0.794178
0.670589
0.797754
0.810691
0.750514
0.662458
2% Att
0.662877
0.573474
0.621122
0.451060
0.343508
0.734643
0.799993
0.672898
0.801379
0.826301
0.756214
0.658497
5% Att
0.655833
0.578386
0.675688
0.488098
0.348842
0.745050
0.803384
0.693443
0.805352
0.830997
0.764504
0.671780
Table 3: Average Ic values for scenario two
Avg
0.653103
0.598027
0.604209
0.494500
0.440590
0.720393
0.770638
0.680439
0.769874
0.779336
0.706527
0.656149
Filter
FM
OR
Pow
PR
GI
MI
KS
Dev
GM
AUC
PRC
Avg
10 Att
0.190497
0.194015
0.258048
0.170294
0.125541
0.370723
0.293995
0.360160
0.238490
0.227744
0.265181
0.244972
14 Att
0.222853
0.198748
0.279128
0.174980
0.130030
0.413791
0.328112
0.393598
0.268989
0.262041
0.302894
0.270469
25 Att
0.269098
0.197321
0.297444
0.181145
0.138629
0.456812
0.395208
0.433659
0.335945
0.328268
0.372952
0.309680
0.25% Att
0.287799
0.200472
0.306716
0.180260
0.140979
0.463218
0.416617
0.440437
0.361507
0.356831
0.394186
0.322638
0.5% Att
0.315697
0.200271
0.336771
0.178505
0.146150
0.471262
0.448645
0.438906
0.406476
0.405106
0.430831
0.343511
1% Att
0.325431
0.201967
0.368565
0.190917
0.159357
0.460313
0.457815
0.422539
0.429074
0.429674
0.442980
0.353512
2% Att
0.334663
0.219185
0.395241
0.210272
0.177443
0.460358
0.473543
0.430886
0.452450
0.461940
0.457345
0.370302
5% Att
0.356951
0.252927
0.441514
0.250986
0.209229
0.484584
0.497963
0.453408
0.485961
0.508882
0.494060
0.403315
Avg
0.287874
0.208113
0.335428
0.192170
0.153420
0.447633
0.413987
0.421699
0.372362
0.372561
0.395054
0.327300
Results
Table 4: Average Ic values for scenario three
As mentioned earlier, experiments were conducted with
eleven threshold-based feature rankers (FM, OR, Pow, PR,
GI, MI, KS, Dev, GM, AUC, and PRC). Four datasets were
used in these experiments. These datasets are relatively
clean to avoid validity problems caused by injecting noise
into a dataset that already has noise. We investigated three
scenarios to assess the robustness of feature rankers under
different circumstances and for different feature subset sizes.
In the first scenario sampling takes place on the clean (original) datasets, and each sampling technique is performed 30
times on each dataset. In the second scenario noise is injected into the dataset and no sampling is performed. The
third scenario is similar to scenario two, except that it involves sampling, where sampling is performed after noise
injection. Given that the noise injection process is performed
30 times for each noise level, sampling is only performed
once per noisy dataset. For all of these scenarios, the assessment is based on the degree of agreement between a
ranker’s output on both the original datasets and on the modified datasets which have had noise injected into them, have
had some instances removed due to sampling, or both.
We used the average of the consistency index Ic over all
runs to evaluate the stability of feature rankers. In the experiments, we used eight sizes of feature subsets for each dataset
(10, 14, 25, 0.25%, 0.5%, 0.5%, 1%, 2%, and 5%). Preliminary experiments conducted on the corresponding datasets
show that these numbers are appropriate. As we have four
datasets, nine levels of noise, and eleven feature rankers, we
repeat the experiment 11,880 times for each of scenarios two
and three, and 1,320 times for scenario one. Only the average results of the 30 repetitions are presented in the tables.
Further discussion of the breakdown based on the different
noise injection patterns could not be included due to space
considerations.
Tables 2 through 4 represent the average Ic value for each
scenario for each feature ranker for every subset size, across
all nine levels of injected noise (scenario two and scenario
three). We also present (1) the average performance (last
column of the tables) of each of the feature rankers for each
scenario over the four datasets, and (2) the average performance (last row of each section of the tables) of each subset
size over the eleven feature rankers and for each scenario. In
all tables “Attributes” is abbreviated as “Att” for space considerations, and bold values represent the best performance
for that combination of feature ranker and subset size for
Filter
FM
OR
Pow
PR
GI
MI
KS
Dev
GM
AUC
PRC
Avg
10 Att
0.061783
0.159728
0.182339
0.137860
0.149813
0.241085
0.207633
0.238860
0.167047
0.165937
0.193272
0.173214
14 Att
0.072486
0.172727
0.201794
0.151938
0.159948
0.274088
0.235553
0.272696
0.189605
0.193647
0.219468
0.194905
25 Att
0.091967
0.192903
0.224819
0.169455
0.175425
0.314981
0.281248
0.313904
0.233787
0.242103
0.276613
0.228837
0.25% Att
0.100093
0.196442
0.229973
0.164765
0.172535
0.326698
0.300766
0.322458
0.258900
0.264428
0.293711
0.239161
0.5% Att
0.119167
0.204697
0.247015
0.155742
0.156996
0.337231
0.327553
0.335287
0.294405
0.304828
0.335506
0.256221
1% Att
0.138073
0.198318
0.264133
0.157956
0.150343
0.343192
0.343155
0.331503
0.319431
0.334212
0.354537
0.266805
2% Att
0.159457
0.202956
0.284858
0.169800
0.151347
0.348091
0.364082
0.339854
0.347838
0.362618
0.377724
0.282602
5% Att
0.190835
0.229403
0.325600
0.199801
0.163191
0.373361
0.395971
0.366484
0.388530
0.410338
0.416624
0.314558
Avg
0.116733
0.194647
0.245066
0.163415
0.159950
0.319841
0.306995
0.315131
0.274943
0.284764
0.308432
0.244538
each scenario.
The results demonstrate that while there was no clear winner among the eleven filters, Gini Index, Probability Ratio,
and Odds Ratio show the worst stability across all scenarios and sampling techniques on average. Using scenario one
(sampling without noise injection) AUC shows the best stability on average, followed closely by KS and GM. When
considering other scenarios AUC is closer to the middle of
the pack; MI performs best, followed closely by Dev and
KS. This indicates that the three feature rankers (MI, Dev,
and KS) are less sensitive to class noise and are good choices
for stable feature extraction.
The results also show that the size of the subset of selected features can influence the stability of a feature ranking technique. All feature rankers show more stable behavior as the feature subset size is increased when class noise is
present (scenario two and scenario three). However, without
injected class noise (scenario 1), many feature rankers have
an internal optimum for feature subset size, and increasing
it beyond that will reduce performance. The exact location
of this optimum varies from 25 attributes to 1% of the total original attributes, although some rankers (including KS,
Dev, GM, AUC, and PRC) show consistent improvement as
subset size increases. In addition, when looking across all
scenarios, scenario three shows the worst stability, which
demonstrates that sampling does not improve stability of
feature selection techniques when noise is present. The only
exception to this is when GI is used and for small subset
sizes, random undersampling improved the stability of GI in
the presence of class noise. Nevertheless, GI is still among
the worst performing feature rankers.
96
Conclusion
25th IASTED International Multi-Conference: artificial intelligence and applications, 390–395. Anaheim, CA, USA:
ACTA Press.
Křı́žek, P.; Kittler, J.; and Hlaváč, V. 2007. Improving stability of feature selection methods. In Proceedings of the
12th international conference on Computer analysis of images and patterns, CAIP’07, 929–936. Berlin, Heidelberg:
Springer-Verlag.
Liu, H., and Yu, L. 2005. Toward integrating feature selection algorithms for classification and clustering.
IEEE Transactions on Knowledge and Data Engineering
17(4):491–502.
Loscalzo, S.; Yu, L.; and Ding, C. 2009. Consensus group
stable feature selection. In KDD ’09: Proceedings of the
15th ACM SIGKDD international conference on Knowledge
discovery and data mining, 567–576. New York, NY, USA:
ACM.
Petricoin, E. F.; Ardekani, A. M.; Hitt, B. A.; Levine, P. J.;
Fusaro, V. A.; Steinberg, S. M.; Mills, G. B.; Simone, C.;
Fishman, D. A.; Kohn, E. C.; and Liotta, L. A. 2002. Use
of proteomic patterns in serum to identify ovarian cancer.
Lancet 359(9306):572–577.
Saeys, Y.; Abeel, T.; and Peer, Y. 2008. Robust feature selection using ensemble feature selection techniques. In ECML
PKDD ’08: Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases Part II, 313–325. Berlin, Heidelberg: Springer-Verlag.
Seiffert, C.; Khoshgoftaar, T.; and Van Hulse, J. 2009. Improving software-quality predictions with data sampling and
boosting. Systems, Man and Cybernetics, Part A: Systems
and Humans, IEEE Transactions on 39(6):1283 –1294.
Van Hulse, J., and Khoshgoftaar, T. M. 2009. Knowledge
discovery from imbalanced and noisy data. Data & Knowledge Engineering 68(12):1513–1542.
Wang, X., and Gotoh, O. 2009. Accurate molecular classification of cancer using simple rules. BMC Medical Genomics
2(1):64.
Wang, H., and Khoshgoftaar, T. M. 2011. Measuring stability of threshold-based feature selection techniques. In 23rd
IEEE International Conference on Tools with Artificial Intelligence, 986–993.
Wang, H.; Khoshgoftaar, T. M.; and Van Hulse, J. 2010. A
comparative study of threshold-based feature selection techniques. 2010 IEEE International Conference on Granular
Computing 499–504.
Witten, I. H., and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2nd edition.
To the best of our knowledge this is the first study to investigate the stability of threshold-based feature selection techniques. We conducted stability analysis on eleven thresholdbased feature selection techniques, and four groups of realworld gene expression datasets. We injected noise into these
datasets to better simulate real-world datasets. We investigated three scenarios (sampling only, noise injection only,
noise injection followed by sampling) to assess the impact of
data sampling and class noise on stability. The experimental results demonstrate that GI performed worst among the
eleven feature rankers across all scenarios. Furthermore, in
the presence of class noise, the best overall three filters are
MI, Dev, and KS. Results also show that trying to balance
datasets through data sampling has on average a negative
impact on the stability of feature ranking techniques applied
to those datasets. In addition, although for noisy data larger
feature subset sizes are almost always better, the same cannot be said for clean data (which often shows an internal
optimum past which larger sizes hurt performance).
Future research may involve conducting more experiments, using other feature selection techniques, other data
sampling balance levels (e.g., 65:35), examining more
datasets from other application domains, and considering
other feature subset sizes.
References
Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; and Kegelmeyer,
W. P. 2002. Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16:321–
357.
Dunne, K.; Cunningham, P.; and Azuaje, F. 2002. Solutions
to Instability Problems with Sequential Wrapper-Based Approaches To Feature Selection. Technical Report TCD-CD2002-28, Department of Computer Science, Trinity College,
Dublin, Ireland.
Fayyad, U. M., and Irani, K. B. 1992. On the handling
of continuous-valued attributes in decision tree generation.
Mach. Learn. 8:87–102.
Guo, X.; Yin, Y.; Dong, C.; Yang, G.; and Zhou, G. 2008.
On the class imbalance problem. In Fourth International
Conference on Natural Computation, 2008. ICNC ’08., volume 4, 192–201.
Hall, M. A., and Holmes, G. 2003. Benchmarking attribute selection techniques for discrete class data mining.
IEEE Transactions on Knowledge and Data Engineering
15(6):1437 – 1447.
Kononenko, I. 1994. Estimating attributes: analysis and
extensions of relief. In Proceedings of the European conference on machine learning on Machine Learning, 171–182.
Secaucus, NJ, USA: Springer-Verlag New York, Inc.
Kotsiantis, S.; Kanellopoulos, D.; and Pintelas, P. 2006.
Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering
30(1):25–36.
Kuncheva, L. I. 2007. A stability index for feature selection.
In Proceedings of the 25th conference on Proceedings of the
97