ReferenceS - Asian Journal of Control

advertisement
Asian Journal of Control, Vol. 6, No. 3, pp. 439-446, September 2004
439
-Brief Paper-
RANDOM APPROXIMATED GREEDY SEARCH
FOR FEATURE SUBSET SELECTION
Feng Gao and Yu-Chi Ho
ABSTRACT
We propose a sequential approach called Random Approximated
Greedy Search (RAGS) in this paper and apply it to the feature subset
selection for regression. It is an extension of GRASP/Super-heuristics
approach to complex stochastic combinatorial optimization problems, where
performance estimation is very expensive. The key points of RAGS are from
the methodology of Ordinal Optimization (OO). We soften the goal and
define success as good enough but not necessarily optimal. In this way, we
use more crude estimation model, and treat the performance estimation error
as randomness, so it can provide random perturbations mandated by the
GRASP/Super-heuristics approach directly and save a lot of computation
effort at the same time. By the multiple independent running of RAGS, we
show that we obtain better solutions than standard greedy search under the
comparable computation effort.
KeyWords: Feature subset selection, ordinal optimization, greedy search,
stochastic combinatorial optimization.
I. INTRODUCTION
A common challenge for regression analysis and
pattern classification problem is the selection of the best
feature subset from a set of predictor variables in terms
of some specified criterion. It is called as feature subset
selection problem in machine learning.
Feature subset selection is a typical combinatorial
optimization problem. If a selected feature is denoted
Manuscript received May 30, 2002; revised September 12,
2002 and April 7, 2003; accepted October 23, 2003.
Feng Gao is with Systems Eng. Inst., Xian Jiaotong Uni.,
Xi’an, Shaanxi, 710049, China.
Yu-Chi Ho is with Division of Engineering and Applied
Sciences, Harvard University, Cambridge, MA 02138, U.S.A.
The work reported in this paper is sponsored in part by
grants from the U.S. Army Research Office (contract
DAAD19-01-1-0610), U.S. Air Force Office of Scientific Research (contract F49620-01-1-0288). And the work of the first
author is also supported in part by the National Outstanding
Young Investigator Grant (6970025), the Key Project of National Natural Science Foundation (59937150), the project of
National Natural Science Foundation (60274054) and 863
High Tech Development Plan (2001AA413910) of China.
with “1” and an unselected feature is denoted with “0”, a
0-1 vector can be used to denoted selected feature subset.
It is called as feature select vector whose dimension is
equal to the number of features. So the target for feature
subset selection is to find the best feature select vector.
In this paper we commit to an approach called
“wrapper approach” [1], in which feature subset selection algorithm conducts a search for a good select vector
simultaneously optimizing the parameters used in the
performance valuation model for any corresponded selected subset. This makes the feature subset selection
procedure very time-consuming when performance
evaluation process is expensive. That is the situation
when we need to build a complex prediction model,
where the generalization error should be estimated as
the performance of a given feature subset. The commonly used numerical method to estimate it is the cross
validation approach [2]. The generalization error is approximated by the average of the test performance on
different test sub-data sets. Therefore, we believe that
feature subset selection is a large-scale computationally
intensive stochastic combinatorial optimization problem
when the number of candidate variables are very large,
440
Asian Journal of Control, Vol. 6, No. 3, September 2004
for example several hundreds.
Since there does not exist an algorithm that can find
the global optimal feature subset efficiently, the greedy
search approaches are the traditional and widely used
heuristic methods. They can find local optimal solutions,
and they are the simplest. But for complex feature subset selection, which we will discuss in following sub
sections, even such simplest greedy search become
time-consuming sometimes. So in the field of machine
learning, there is another kind of feature selection approach, which is called filter approach. They do not take
into account the biases of the induction algorithms and
select feature subsets that are independent of the induction algorithms. They are much more computation efficiently, while the main disadvantage is that it totally
ignores the effects of the selected feature subset on the
performance of the induction algorithm [1].
The motivation for the research of this paper is to
make the computation of the wrapper approach feasible
for complex feature subset selection problem, i.e., how
to obtain good results under the reasonable limitation of
computation power.
Most of the early research on the problem dealt
with the linear case [3], and up to now, most of them are
for classification problems [1,4], only a few researches
consider the general regression problems [5-7]. In this
paper we will focus on the feature selection for regression problems.
1.1 The content of this paper
In combinatorial optimization application, the
greedy approach is only a local search method. Considerable efforts have been made to improve its results.
Greedy Randomized Adaptive Search Procedures
(GRASP) [8] and Super-heuristics [9] methods propose
similar ideas for this target. By introducing the randomness in the greedy search procedure, one can expect to
escape from local optimum, and have a chance to obtain
better result.
Furthermore, the Super-heuristics method is inherited from the Ordinal Optimization (OO) [10,11] methodology for stochastic optimization. OO proposes two
new ideas which are very important in complex optimization problem: (1) order comparison is much easier to
obtain than to estimate the performance difference accurately; (2) finding a good enough solution set is much
easier than finding the best solution.
In the problems like large-scale complex feature
subset selection problem, it is almost impossible to find
the global optimal feature subset. So our target in this
paper is only to find a set of good enough feature subsets under the reasonable limitation of computational
power. According to OO, when we soften our target to
find good enough solutions, we don’t need to calculate
the accurate performance values always, sometimes only
crude and ordinal comparison is enough. So the computational burden can be lessened.
Keeping the idea OO in mind, when we think the
basic idea of GRASP/Super-heuristics again, it is very
interesting that we can combine these ideas together to
obtain a fruitful result. That is, if we treat approximating
error of the crude model of our feature subset performance as the random perturbation required by the
GRASP/Super-heuristics methods directly, we can bypass the need for accurate performance estimation, and
get the improvement effect of GRASP/Super-heuristics
approach directly (literally, we can kill two birds with
one stone).
In summary, we propose a new approach which
combines GRASP/Super-heuristics and ordinal optimization methodology to improve greedy search in this
paper. We name it Random Approximated Greedy
Search (RAGS). The GRASP/Super-heuristics is a
remedy to escape from the greedy search local optimum,
and OO is a remedy to decrease the computation burden.
By combining them together, the new RAGS method
can improve the result of greedy search approach better
in stochastic situation, and it can find a good solution
with reasonable limited computational power.
The rest of paper is organized as following: in section 2, we will introduce the general greedy search and
its improvement by GRASP/Super-heuristics. Then we
will discuss the basic idea of this paper in section 3,
propose the RAGS method and discuss how to apply
RAGS to feature subset selection problem. In section 4
we will show the experimental results of our method.
Finally, we will conclude our paper in section 5.
II. GREEDY SEARCH AND GRASP/SUPER
HEURISTICS IMPROVEMENT
In general, greedy search is in the same spirit as
steepest ascent in the continuous optimization approach.
It is very easy to implement, and it is the simplest
method to find a local optimum.
2.1 Greedy search in feature subset selection
Currently, the most commonly used greedy search
algorithms [1] for feature subset selection are forward
greedy search, forward stepwise regression, backward
elimination, and sequential replacement. We will focus
on forward greedy search in this paper. We will try to
improve it with RAGS approach and compare the results
with and without applying our approach. But it should
be clear that our idea about how to improve forward
greedy search can be applied to improve other greedy
F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection
search algorithms in the same way.
Forward greedy search works in a sequential way.
Usually it begins with the empty feature subset, and
chooses one feature in the unselected feature subset and
put it in the selected feature subset in each stage.
At the i-th stage of forward greedy search, let si1
denote the feature select vector of the previous stage.
Then si1 corresponds to a feature subset with i1 features, that means it has i1 elements with value 1, the
other elements are 0. These i1 “1”s indicate the features that have been selected. Let I0(si1) be the index set
for elements with value 0, indicating the unselected features. And let ej be the unit vector with the j-th element
“1”. Then the target of i-th stage is to find j I0(si1),
such that
j  arg min J (si 1  e )
where
 I 0 (si 1 ) .
(1)
The obtained feature select vector of i-th stage si = si1 +
ej.
Generally, forward greedy search, just as other
greedy search algorithms, is a sequential decision making procedure. It is efficient and intuitive. But it is
near-sighted, because the best solution in each stage is
only selected from a sub-domain. However when the
estimation of performance is expensive, the computational burden is unendurable to conduct greedy search
strictly. This is just the situation of large-scale complex
feature subset selection. Thus, the questions are: Do we
really need to distinguish which is the best feature if we
need to spend a lot of computation power to do it? What
will happen if we soften our target and only find out one
of good features in forward procedure?
The OO methodology tells us that such kinds of
target softening is often effective in many complex optimization applications. In this paper, we will show this
idea is also suitable for greedy search in feature subset
selection.
2.2 GRASP/Super-heuristics improvement
GRASP and Supper-heuristics are two kinds of
methods based on the same idea. They try to improve
heuristic greedy search result by adding a sequence of
small random perturbation in greedy search procedures.
GRASP is an iterative process where each GRASP
iteration step consists of two phases: a construction
phase and a local search phase. In the construction phase,
a feasible solution is iteratively constructed, one element
at a time. At each construction iteration, the choice of
the next element to be added is determined by ordering
all elements in a candidate list with respect to a performance function. The probabilistic component of a
GRASP is characterized by randomly choosing one of
441
the best candidates in the list, but not necessarily the top
candidate. This choice technique allows for different
solutions to be obtained at each GRASP iteration. Then
the local search phase is applied to attempt to improve
each constructed solution to a local optimum.
This approach is successful in many deterministic
combinatorial optimization applications. The basic idea
behind it is that: if the heuristic of greedy search is not
the optimal, maybe there exists a very small probability
to improve it by random perturbation. Such a small
probability can become large probability by multiple
independent repetitions.
In simple words, the application of GRASP to the
forward greedy search for feature subset selections is
this: at the i-th stage, let feature select vector si = si1 + ej,
where j is determined by:
j  arg (one of top minima of J (si 1  e ))
where  I 0 (si1 ) .
(2)
Here, j is picked according to a given select probability,
e.g., one of top-5 best with equal probability.
The basic assumption for GRASP/Supper-heuristics
is that the computation burden for performance estimation is very small. So one can try out many random perturbations, e.g., several hundred times to exchange for
one opportunity to improve the final solution. But in our
problem, when the performance evaluation is not easy to
estimate, this approach will be too time-consuming to
apply. In this paper, we will try to solve this infeasible
problem by the spirit of OO.
III. RANDOM APPROXIMATED
GREEDY SEARCH
As shown in Fig. 1, when we check the characteristic of a search stage of GRASP/Super-heuristics implementation in stochastic situation, we find that one needs
to estimate the performance as accurately as possible
and determine which is the best, the second best, the
third best, and so on. Then instead of choosing the best
one, he randomly selects one of the good designs. That
means the effort for distinguishing the best, the second,
and the third is wasted in some sense. Therefore, we
have a question: can we use a crude estimation to
achieve the same or similar purpose directly?
3.1 Remedies from ordinal optimization
It turns out that the theory of Ordinal Optimization
(OO) provides an answer. OO attempts to separate the
“good enough” from the “bad”, say the top-x% of the
possible solutions from the bad ones. The OO theory
says that one can do this with high probability using
Asian Journal of Control, Vol. 6, No. 3, September 2004
442
Greedy Search
crude performance
estimation
accurate
performance
by average estimation
select the
best
best
design
select one
of good
Directly?
a good
design
GRASP
Fig. 1. The basic idea to improve GRASP/Super-heuristics.
only crude models of the problem [10]. This considerably
simplifies the computational burden. Secondly, since
crude model only guarantees that we locate “good
enough” solution with high probability, i.e., not necessarily the best or the second best, etc. but only one of the
top-x solutions, we automatically achieve the random
perturbations mandated by GRASP/Super-heuristics
approach. Thus we essentially kill two birds with one
stone.
3.2 Random approximated greedy search algorithm
From such understanding, we get a new approach to
improve greedy search for complex stochastic situation.
We name it as “Random Approximated Greedy Search
(RAGS).”
Algorithm RAGS(s0, , K, N)
For i = 1 to N
Get extension sub-domain: (si -1);
Evaluate the crude performance g (sj) for each sj  (si -1);
Sort all sj according to g(sj);
Choice k randomly from [1, K];
Select top-k solution of all sj, and let it be si;
End for
Return si
Where, s0 is the initial solution, () is the extension
sub-domain mapping, K is the random perturbation
scale, N is the loop iterative epochs.
The main benefit from RAGS is that it provides
opportunity to improve a greedy search algorithm without increasing the computational burden as later experimental results will show. How much benefit can be
obtained and under what conditions? These are the further questions need to be answer. So there are more
analysis works need to be done for the mechanism of
RAGS, this will be the further works of the authors.
When we apply it to forward greedy search for feature subset selection, we get a new feature subset selection algorithm, as show in Fig. 2.
In the step to determine rank-k feature subset, we
first use crude performance estimation to introduce random perturbation, and then select rank-k subset randomly to enhance the randomness. The crude
performance estimation can be obtained by M-fold cross
validation, where estimation is the average of test errors
on M data subsets. Further M can be used to control the
scale of the randomness. If M is smaller, the randomness is larger due to fewer test error samples are used. If
M is bigger, the randomness is smaller. But M is limited
by the number of data samples on a given data set. An
alternative approach for crude estimation is bootstrapping.
IV. EXPERIMENT RESULTS
4.1 Problem 1
Let’s consider an artificially constructed feature
subset selection problem1 to explain the idea of RAGS
more clearly, where we can obtain enough test data to
validate the goodness of resulting feature subsets.
This is a linear regression problem. There are 100
features totally as described in Table 1. They belong to
three classes: basic features, derived features, and random features. The real response value (i.e., the model) is
the linear combination of the 20 basic features, and the
coefficients are generated randomly by a normal distribution with mean 0.5 and standard deviation one. The
observed response value is the real response value with
an i.i.d. noise with normal distribution. Due to the existence of derived features and random noises, it is hard
to distinguish basic features from the others. However
some derived features can provide more useful information than single basic features. So our target of this
experiment is to find the effective feature subsets within
25-feature limit. That means we conduct 25 stages of
forward greedy searching. In the following running of
experiment, we find some irrelevant random features are
chosen in the end several stages, that is an evidence of
enough features. Thus we think 25 features are a reasonable choice for this problem.
1
The Matlab program to generate the data set can be downloaded from http://www.sei.xjtu.edu.cn/seien/staff.files/fgao/
linearfs.html.
F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection
Initialize:
i=1, K=4.
Set initial feature select vector s0 = 0.
Set feature in threshold ratio  = 1.02
i-th forward stage:
Determine the k randomly in [1, K].
Construct the current forward subset population:
for each feature of unselected set I0(si-1), combine it with current feature vector si-1 to form a
new feature subset design
Determing the rank-k feature subset:
Calculate the crude estimated performance of each
feature subset, find the rank-k good feature subset,
and let stage performance perfi be the crude estimated performance of this rank-k subset.
hAccept rank-k extension,
If stage performance
set i = i + 1
perfi is improved
at
If
sta comparleast at rate
ge perfi-1
ing with
per
Yes
for
0 ma
nce
Accept rank-k
per No
extension,
fi is
set i = i + 1
im
pro
ved
at extension
Reject the new
lea
of selected
set, and
-2 feature
st
algorithm terminates.
at
rat
e
co
mp
ari
ng
Fig. 2. Flow chart of RAGS wit
for feature subset selection.
h
per
fi-1
Table 1. Problem definition
of experiment 1.
sDeterming the rank-k feature subsets:
Cal
Feature
Type
Description
c
No.
u
#1~4: i.i.d. uniform/normal
random numl
bers; #5~12: square/square
root of features
a
Basic
1~20
#1-4; #13~18: cross-product
of features
t
features
#1-4; #19~20: uniform/normal
random
e
21~93
Derived
features
94~100
Random
features
numbers.
#21~38: cross-product t of features #5-12;
h combinations of
#39~68: random linear
e #69~93: random
#1-38 with i.i.d noise;
linear combinations of #1-38 and real response with i.i.d noise. c
r
#94~95: independent random
numbers with
a little relevance to uobserved response;
d
#96~100: irrelative independent random
e
numbers.
e
s
t
i
m
a
t
e
d
p
e
443
In this experiment, we generate two data sets:
1. Working data set: it is a small data set with 200 data
samples. Feature subset selection algorithm is run on
this data set to obtain the resulting feature subset.
2. Testing data set: it is a large data set with 100,000
data samples. It is used to test the generalization error
of a feature subset. The first half of the samples are
used to calculate regression coefficients for a given
feature subset (i.e., training with the selected features
from #1), and then its generalizing performance is
calculated from the other half samples.
The experiment was performed using MATLAB.
The linear regression is used as inner modeling method
in wrapper approach. The MSE (mean square error) is
used as performance criterion. We use leave-one-out
cross validation (LOOCV) as the “best” performance
estimation, because this is the best way one can do to
use 100 data samples sufficiently. And we treat K-fold
CV as its approximation. The whole experiment contains 4 steps:
1. On the working data set, we execute forward greedy
search feature selecting procedure by LOOCV criterion, and we obtain a feature subset FSLOO. This
feature subset is treated as the baseline for comparison.
2. On the working data set, we execute RAGS revised
forward greedy search. We compare two different
precision criteria: 10-fold CV and 4-fold CV, where
10-fold CV is treated as a better approximation to
LOOCV than 4-fold CV. And we compare the cases
with/without random perturbation respectively. So we
obtain four groups to feature subsets by independent
repetitive running, as shown in Table 2. And the size
of groups is controlled to spend same computation
cost to give a fair comparison.
3. The four groups of feature subsets are assessed by
LOOCV, and only the top 25 feature subsets that remained in each groups as the final result.
4. The four groups with top 25 feature subsets each and
feature subset FSLOO in step 1 are assessed on the
test data set to compare their generalization errors.
The experiment results are shown in Fig. 3 and Table 2. In Fig. 3, a box and whisker plot is produced to
display performance of top-25 feature subsets in each
group by MATLAB function “boxplot,” where the
notched-box has lines at the lower quartile, median, and
upper quartile values of performance; notches represent
a robust estimate of the uncertainty about the means, the
whiskers are lines extending from each end of the box to
show the extent of the rest of the performance data, and
outliers are marked with plus signs beyond the ends of
the whiskers if they exist. The resulting analysis are as
following:
Asian Journal of Control, Vol. 6, No. 3, September 2004
444
1. The feature similarity is significant among feature
subset FSLOO and four final resulting groups of feature subsets. The counts of feature subsets that contain at least 80% features of FSLOO are 15, 11, 14,
and 8 in four resulting groups respectively. And the
total number of features appear in each of four resulting groups are 63, 78, 69, and 82, respectively. The
frequency of each feature appeared in feature subset
group with 4-fold CV criterion and top-4 random
perturbation is shown in Fig. 3(b). So RAGS extends
searching scope to try more different candidates
around FSLOO. That is the basic idea of RAGS.
2. From Fig. 3(a), it can be seen that the performance
diversity increased when cruder estimation is used
and random perturbation is added. That is the result of
broader searching scope. And the performance median values of four groups are better than feature subset
FSLOO in step 1. It shows that improvement to the
“best” performance forward greedy searching procedure can really be obtained by random sampling
around it through “cruder performance” and random
perturbation. That is just what we hope for RAGS. In
this experiment, the group with cruder CV criterion
plus random perturbation, which tries broadest
searching scope, has the best mean performance and
owns best feature subset under the same computation
burden.
3. The MSE of the feature subset with 20 basic features
on the test data set is 1.0562. It is much better than all
feature subsets founded. However, its LOOCV performance estimation on working data set is only
1.0009, while the feature subset FSLOO in step 1 has
the value 0.7855. So it is neglected because its attraction is not large enough from the viewpoint for
LOOCV criterion.
In summary, there really exist some better solutions
in neighborhood of the “best” performance estimation
greedy searching solution in this given problem, and
RAGS really found part of these better solutions under
comparable time limit or computational load.
Table 2. The results of experiment 1.
Group No.
Performance assessing criteria
#1
LOOCV
Random perturbation
Group size
Times of regression need to assess each candidate middle feature
subset
Computation burden totally (The total times of regression in 25
stages forward greedy search)
#2
10-fold CV
#3
10-fold CV
No
No
1
200
200
940,000
MSE of LOOCV criterion on working set (mean/std)
0.7855
MSE of resulting feature subsets on testing set (mean/std)
1.4443
#4
4-fold CV
#5
4-fold CV
within top-4
No
within top-4
200
500
500
10
10
4
4
9,400,000
9,400,000
9,400,000
9,400,000
0.7861
/0.0061
1.3970
/0.1037
0.8128
/0.0071
1.3561
/0.1391
0.8015
/0.0049
1.3680
/0.1024
0.8230
/0.0123
1.3529
/0.1416
1
0.9
1.6
frequency in 25 resulting feature subsets
0.8
MSE on test data set
1.5
1.4
1.3
0.7
0.6
0.5
0.4
0.3
0.2
1.2
0.1
1.1
0
LOO
10-fold
10-fold & top-4
4-fold
(a) comparison of test performance
10
20
4-fold & top-4
30
40
50
features
60
70
80
90
100
(b) feature distribution in resulting feature subset group with
4-fold CV criterion and top-4 random perturbation.
Fig. 3. Comparison results in experiment 1.
F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection
This problem is the Problem 8 from NIPS2000 Unlabeled Data Supervised Learning Competition 2. It involves predicting the bioreactivity of pharmaceutical
compounds based on properties of the compounds molecular structure. The inputs represent 787 real-valued
features that are mixture of various types of descriptors
for the molecules based on their one-dimensional, twodimensional, and electron density properties. The training set contains 147 examples; the test set contains 50
examples. With so many features and so few samples, it
is very important to do feature selection.
The best result of the competition is reported in [6].
They used polynomial modeling to get the suitable feature subset. Based on this feature subset, their reported
best forecasting results on test set was obtained by a NN
model training on the training set. They used polynomials up to degree 3 to build 2325 monomials omitting
cross-product. The best feature subset they found are:
x172, x36, x159, x1722, x2223, x5653.
We test our RAGS approach on these 2325 monomials features too. The linear regression model is used.
The 3-fold CV is used to estimate the crude performance
for feature subset on training set. The forward greedy
procedure extends to 10 features and the performance
decreases continuously. We stop at 10 features, and then
backward greedy is execute to check if any features can
be removed while keep performance. We got 100 features subsets by independent running. The features appear in more than 10% feature subsets are: (x36, 100),
(x1723, 72), (x2223, 54), (x172, 45), (x159, 35), (x26, 27),
(x7793, 26), (x173, 21), (x5943, 20), (x1722, 18), (x326, 18),
(x5653, 13), (x4242, 12), (x261, 12), (x4463, 11), (x260, 11),
where the number in the bracket is the frequency the
feature appearing. It can be seen the 6 features in the
final feature subset of [6] are all in the list.
Then LOOCV is used to estimate the accurate performance of each feature subset on the training set. The
final feature subsets will be selected according to
LOOCV performance. When we test 100 feature subsets
we obtained on the test data set, we found the test errors
are not in totally alignment with LOOCV performances,
and there are some explicit outliers. The test result of
top 25 feature subsets according to LOOCV performance is shown in Fig. 4. The dots on each vertical line
indicate the predict results of a given point in the test set
by the models based on top 25 feature subsets. There
exist a feature subset with the performance MSEtest =
0.203. It is very good for this test set, but MSE test cannot
be a criterion to select features, because the true re2
The data set can be found from http://q.cis.uoguelph.ca/
~skremer/Research/NIPS2000/ with further details.
sponse values in test set are suppose unknown while
select features and build model. So this feature subset
cannot be the answer. The plus sign in Fig. 4 is the predict value by the bagging approaches, where we use
median instead of mean to avoid outlier affection. And
we got MSEtest = 0.456 from bagged polynomial models,
and MSEtest = 0.378 from bagged NN models. Both results are better than what reported in [6]. The results are
summarized in Table 3.
In summary, the important features are founded by
RAGS according to numerical computation. And the test
performance is better.
Table 3. The experiment result of problem 2.
Polynomial in [6]
NN in [6]
RAGS polynomial
RAGS NN
MSEtrain
0.266
0.241
MSELOO
0.289
0.384


MSEtest
0.523
0.417
0.456


0.378
2
0
predictive response
4.2 Problem 2
445
-2
-4
-6
-8
-10
-12
-12
-10
-8
-6
-4
real response
-2
0
2
(On each vertical line, the dots indicate the predict results by
the models based on top 25 feature sets, while the plus sign is
their bagging.)
Fig. 4. Test result of problem 2.
4.3 Problem 3
This problem is another drug design problem [7],
molecular Caco-2 permeability. There are 713 features
and 27 samples in the given data set. Bi et al. studied the
feature selection of this problem by the VS-SSVM
(Variable Selection Via Sparse SVMs) approach they
proposed.
We test RAGS on this problem too. The linear regression model is used. The 3-fold CV is used to estimate the crude performance as in problem 2. The
forward greedy searching procedure extends to 20-stage
Asian Journal of Control, Vol. 6, No. 3, September 2004
446
then the backward greedy searching procedure is executed. We get 200 feature subsets independently, and
their LOOCV performances are estimated. There are
11.5 features in each feature subset on average. Finally a
forward greedy searching by LOO performance is conducted directly too. The mean value of LOOCV Q2 of
200 feature subsets is 0.0854, it is much smaller than the
result 0.293 in [7].
As the basic idea of RAGS, we treat LOOCV as
accurate performance estimation here, and 3-fold CV as
crude performance estimation. From Fig. 5, it can be
seen that the 3-fold CV really means good LOOCV
performance usually. But the best LOOCV of Q2 among
200 features is 0.0037, while the feature subset by
LOOCV directly has Q2 as 0.0015. That means the accurate greedy search obtains the best solution in this
experiment. It seems that the RAGS fails to obtain a
better result, but this is not the fail of the idea of RAGS,
because this best solution is the global optimum probably, and RAGS provide some evidences to lead that
conclusion. So in the cases like this problem, the results
of RAGS are also helpful.
0.5
performance
0.4
0.3
0.2
0.1
0
0
50
100
ordered feature sets
150
200
(The upper curve is 3-fold CV performance, and the lower
curve is LOO performance.)
Fig. 5. Ordered performance curves in problem 3.
V. CONCLUSION
We propose a new approach named Random Approximated Greedy Search (RAGS) to solve stochastic
combinatorial optimization problems. It is an extension
of GRASP/Super-heuristics approach to stochastic
problems where the performance estimation is very expensive.
The RAGS can be applied to the situation where the
object function of a problem is expressed as the expected value of an evaluation function, evaluation pro-
cedure is very time-consuming, and only the solving
approaches based on some heuristic greedy search algorithms are available.
Instead of spending a lot of computation to perform
the heuristic greedy search strictly by estimate performance accurately, we suggest using a crude estimation
combined with some extent of random perturbation in
the heuristic greedy search procedure. Therefore one can
expect to generate opportunities to find better solutions
at the comparable computation effort. In fact, RAGS is a
biased random sampling in the neighborhood of strict
heuristic greedy search solution.
The complex and large-scale feature subset selection problem is just the case where RAGS can be used.
In this paper, we apply RAGS to feature subset selection.
We show its effective by several experiments. In the
next step of research, we will apply RAGS to solve other real feature subset selection problems, and then try to
extend it to other complex stochastic optimization applications.
REFERENCES
1. Kohavi, R. and G. John, “Wrappers for Feature Subset
Selection,” Artif. Intell. J., Special Issue on Relevance,
Vol. 97, No. 1-2, pp. 273-324 (1997).
2. Stone, M. “Cross-Validatory Choice and Assessment
of Statistical Predictions,” J. R. Statist. Soc. B, 36, pp.
111-147 (1974).
3. Langley, P., “Selection of Relevant Features in Machine Learning,” Proc. AAAI Fall Symp. on Relevance,
pp. 140-144 (1994).
4. Guyon, I. and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., Vol.
3, pp. 1157-1182 (2003).
5. Miller, A.J., “Selection of Subsets of Regression Variables,” J. R. Statist. Soc. A, 147, Part 3, pp. 389-425
(1984).
6. Rivals, I. and L. Personnaz, “MLPs for Nonlinear
Modeling,” J. Mach. Learn. Res., Vol. 3, pp.
1383-1398 (2003).
7. Bi, J., K.P. Bennett, etc., “Dimensionality Reduction
via Sparse Support Vector Machines,” J. Mach. Learn.
Res., Vol. 3, pp. 1229-1243 (2003).
8. Feo, T.A. and M.G.C. Resende, “Greedy Randomized
Adaptive Search Procedures,” J. Glob. Optim., Vol. 6,
pp. 109-133 (1995).
9. Lau, E. and Y.-C. Ho, “Super-Heuristics and Its Applications to Combinatorial Problems,” Asian J. Contr.,
Vol. 1, No. 1, pp. 1-13 (1999).
10. Ho, Y.C., “An Explanation of Ordinal Optimization –
Soft Optimization for Hard Problems,” Inf. Sci., Vol.
113, pp. 169-192 (1999).
11. Shi, L. and C.H. Chen, “A New Algorithm for Stochastic Discrete Resource Allocation Optimization,”
Discrete Event Dyn. S., Vol. 10, pp. 271-294 (2000).
-1
-0.5
64210.5
-8
-6
8-4
0-2
F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection
447
Download