Asian Journal of Control, Vol. 6, No. 3, pp. 439-446, September 2004 439 -Brief Paper- RANDOM APPROXIMATED GREEDY SEARCH FOR FEATURE SUBSET SELECTION Feng Gao and Yu-Chi Ho ABSTRACT We propose a sequential approach called Random Approximated Greedy Search (RAGS) in this paper and apply it to the feature subset selection for regression. It is an extension of GRASP/Super-heuristics approach to complex stochastic combinatorial optimization problems, where performance estimation is very expensive. The key points of RAGS are from the methodology of Ordinal Optimization (OO). We soften the goal and define success as good enough but not necessarily optimal. In this way, we use more crude estimation model, and treat the performance estimation error as randomness, so it can provide random perturbations mandated by the GRASP/Super-heuristics approach directly and save a lot of computation effort at the same time. By the multiple independent running of RAGS, we show that we obtain better solutions than standard greedy search under the comparable computation effort. KeyWords: Feature subset selection, ordinal optimization, greedy search, stochastic combinatorial optimization. I. INTRODUCTION A common challenge for regression analysis and pattern classification problem is the selection of the best feature subset from a set of predictor variables in terms of some specified criterion. It is called as feature subset selection problem in machine learning. Feature subset selection is a typical combinatorial optimization problem. If a selected feature is denoted Manuscript received May 30, 2002; revised September 12, 2002 and April 7, 2003; accepted October 23, 2003. Feng Gao is with Systems Eng. Inst., Xian Jiaotong Uni., Xi’an, Shaanxi, 710049, China. Yu-Chi Ho is with Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, U.S.A. The work reported in this paper is sponsored in part by grants from the U.S. Army Research Office (contract DAAD19-01-1-0610), U.S. Air Force Office of Scientific Research (contract F49620-01-1-0288). And the work of the first author is also supported in part by the National Outstanding Young Investigator Grant (6970025), the Key Project of National Natural Science Foundation (59937150), the project of National Natural Science Foundation (60274054) and 863 High Tech Development Plan (2001AA413910) of China. with “1” and an unselected feature is denoted with “0”, a 0-1 vector can be used to denoted selected feature subset. It is called as feature select vector whose dimension is equal to the number of features. So the target for feature subset selection is to find the best feature select vector. In this paper we commit to an approach called “wrapper approach” [1], in which feature subset selection algorithm conducts a search for a good select vector simultaneously optimizing the parameters used in the performance valuation model for any corresponded selected subset. This makes the feature subset selection procedure very time-consuming when performance evaluation process is expensive. That is the situation when we need to build a complex prediction model, where the generalization error should be estimated as the performance of a given feature subset. The commonly used numerical method to estimate it is the cross validation approach [2]. The generalization error is approximated by the average of the test performance on different test sub-data sets. Therefore, we believe that feature subset selection is a large-scale computationally intensive stochastic combinatorial optimization problem when the number of candidate variables are very large, 440 Asian Journal of Control, Vol. 6, No. 3, September 2004 for example several hundreds. Since there does not exist an algorithm that can find the global optimal feature subset efficiently, the greedy search approaches are the traditional and widely used heuristic methods. They can find local optimal solutions, and they are the simplest. But for complex feature subset selection, which we will discuss in following sub sections, even such simplest greedy search become time-consuming sometimes. So in the field of machine learning, there is another kind of feature selection approach, which is called filter approach. They do not take into account the biases of the induction algorithms and select feature subsets that are independent of the induction algorithms. They are much more computation efficiently, while the main disadvantage is that it totally ignores the effects of the selected feature subset on the performance of the induction algorithm [1]. The motivation for the research of this paper is to make the computation of the wrapper approach feasible for complex feature subset selection problem, i.e., how to obtain good results under the reasonable limitation of computation power. Most of the early research on the problem dealt with the linear case [3], and up to now, most of them are for classification problems [1,4], only a few researches consider the general regression problems [5-7]. In this paper we will focus on the feature selection for regression problems. 1.1 The content of this paper In combinatorial optimization application, the greedy approach is only a local search method. Considerable efforts have been made to improve its results. Greedy Randomized Adaptive Search Procedures (GRASP) [8] and Super-heuristics [9] methods propose similar ideas for this target. By introducing the randomness in the greedy search procedure, one can expect to escape from local optimum, and have a chance to obtain better result. Furthermore, the Super-heuristics method is inherited from the Ordinal Optimization (OO) [10,11] methodology for stochastic optimization. OO proposes two new ideas which are very important in complex optimization problem: (1) order comparison is much easier to obtain than to estimate the performance difference accurately; (2) finding a good enough solution set is much easier than finding the best solution. In the problems like large-scale complex feature subset selection problem, it is almost impossible to find the global optimal feature subset. So our target in this paper is only to find a set of good enough feature subsets under the reasonable limitation of computational power. According to OO, when we soften our target to find good enough solutions, we don’t need to calculate the accurate performance values always, sometimes only crude and ordinal comparison is enough. So the computational burden can be lessened. Keeping the idea OO in mind, when we think the basic idea of GRASP/Super-heuristics again, it is very interesting that we can combine these ideas together to obtain a fruitful result. That is, if we treat approximating error of the crude model of our feature subset performance as the random perturbation required by the GRASP/Super-heuristics methods directly, we can bypass the need for accurate performance estimation, and get the improvement effect of GRASP/Super-heuristics approach directly (literally, we can kill two birds with one stone). In summary, we propose a new approach which combines GRASP/Super-heuristics and ordinal optimization methodology to improve greedy search in this paper. We name it Random Approximated Greedy Search (RAGS). The GRASP/Super-heuristics is a remedy to escape from the greedy search local optimum, and OO is a remedy to decrease the computation burden. By combining them together, the new RAGS method can improve the result of greedy search approach better in stochastic situation, and it can find a good solution with reasonable limited computational power. The rest of paper is organized as following: in section 2, we will introduce the general greedy search and its improvement by GRASP/Super-heuristics. Then we will discuss the basic idea of this paper in section 3, propose the RAGS method and discuss how to apply RAGS to feature subset selection problem. In section 4 we will show the experimental results of our method. Finally, we will conclude our paper in section 5. II. GREEDY SEARCH AND GRASP/SUPER HEURISTICS IMPROVEMENT In general, greedy search is in the same spirit as steepest ascent in the continuous optimization approach. It is very easy to implement, and it is the simplest method to find a local optimum. 2.1 Greedy search in feature subset selection Currently, the most commonly used greedy search algorithms [1] for feature subset selection are forward greedy search, forward stepwise regression, backward elimination, and sequential replacement. We will focus on forward greedy search in this paper. We will try to improve it with RAGS approach and compare the results with and without applying our approach. But it should be clear that our idea about how to improve forward greedy search can be applied to improve other greedy F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection search algorithms in the same way. Forward greedy search works in a sequential way. Usually it begins with the empty feature subset, and chooses one feature in the unselected feature subset and put it in the selected feature subset in each stage. At the i-th stage of forward greedy search, let si1 denote the feature select vector of the previous stage. Then si1 corresponds to a feature subset with i1 features, that means it has i1 elements with value 1, the other elements are 0. These i1 “1”s indicate the features that have been selected. Let I0(si1) be the index set for elements with value 0, indicating the unselected features. And let ej be the unit vector with the j-th element “1”. Then the target of i-th stage is to find j I0(si1), such that j arg min J (si 1 e ) where I 0 (si 1 ) . (1) The obtained feature select vector of i-th stage si = si1 + ej. Generally, forward greedy search, just as other greedy search algorithms, is a sequential decision making procedure. It is efficient and intuitive. But it is near-sighted, because the best solution in each stage is only selected from a sub-domain. However when the estimation of performance is expensive, the computational burden is unendurable to conduct greedy search strictly. This is just the situation of large-scale complex feature subset selection. Thus, the questions are: Do we really need to distinguish which is the best feature if we need to spend a lot of computation power to do it? What will happen if we soften our target and only find out one of good features in forward procedure? The OO methodology tells us that such kinds of target softening is often effective in many complex optimization applications. In this paper, we will show this idea is also suitable for greedy search in feature subset selection. 2.2 GRASP/Super-heuristics improvement GRASP and Supper-heuristics are two kinds of methods based on the same idea. They try to improve heuristic greedy search result by adding a sequence of small random perturbation in greedy search procedures. GRASP is an iterative process where each GRASP iteration step consists of two phases: a construction phase and a local search phase. In the construction phase, a feasible solution is iteratively constructed, one element at a time. At each construction iteration, the choice of the next element to be added is determined by ordering all elements in a candidate list with respect to a performance function. The probabilistic component of a GRASP is characterized by randomly choosing one of 441 the best candidates in the list, but not necessarily the top candidate. This choice technique allows for different solutions to be obtained at each GRASP iteration. Then the local search phase is applied to attempt to improve each constructed solution to a local optimum. This approach is successful in many deterministic combinatorial optimization applications. The basic idea behind it is that: if the heuristic of greedy search is not the optimal, maybe there exists a very small probability to improve it by random perturbation. Such a small probability can become large probability by multiple independent repetitions. In simple words, the application of GRASP to the forward greedy search for feature subset selections is this: at the i-th stage, let feature select vector si = si1 + ej, where j is determined by: j arg (one of top minima of J (si 1 e )) where I 0 (si1 ) . (2) Here, j is picked according to a given select probability, e.g., one of top-5 best with equal probability. The basic assumption for GRASP/Supper-heuristics is that the computation burden for performance estimation is very small. So one can try out many random perturbations, e.g., several hundred times to exchange for one opportunity to improve the final solution. But in our problem, when the performance evaluation is not easy to estimate, this approach will be too time-consuming to apply. In this paper, we will try to solve this infeasible problem by the spirit of OO. III. RANDOM APPROXIMATED GREEDY SEARCH As shown in Fig. 1, when we check the characteristic of a search stage of GRASP/Super-heuristics implementation in stochastic situation, we find that one needs to estimate the performance as accurately as possible and determine which is the best, the second best, the third best, and so on. Then instead of choosing the best one, he randomly selects one of the good designs. That means the effort for distinguishing the best, the second, and the third is wasted in some sense. Therefore, we have a question: can we use a crude estimation to achieve the same or similar purpose directly? 3.1 Remedies from ordinal optimization It turns out that the theory of Ordinal Optimization (OO) provides an answer. OO attempts to separate the “good enough” from the “bad”, say the top-x% of the possible solutions from the bad ones. The OO theory says that one can do this with high probability using Asian Journal of Control, Vol. 6, No. 3, September 2004 442 Greedy Search crude performance estimation accurate performance by average estimation select the best best design select one of good Directly? a good design GRASP Fig. 1. The basic idea to improve GRASP/Super-heuristics. only crude models of the problem [10]. This considerably simplifies the computational burden. Secondly, since crude model only guarantees that we locate “good enough” solution with high probability, i.e., not necessarily the best or the second best, etc. but only one of the top-x solutions, we automatically achieve the random perturbations mandated by GRASP/Super-heuristics approach. Thus we essentially kill two birds with one stone. 3.2 Random approximated greedy search algorithm From such understanding, we get a new approach to improve greedy search for complex stochastic situation. We name it as “Random Approximated Greedy Search (RAGS).” Algorithm RAGS(s0, , K, N) For i = 1 to N Get extension sub-domain: (si -1); Evaluate the crude performance g (sj) for each sj (si -1); Sort all sj according to g(sj); Choice k randomly from [1, K]; Select top-k solution of all sj, and let it be si; End for Return si Where, s0 is the initial solution, () is the extension sub-domain mapping, K is the random perturbation scale, N is the loop iterative epochs. The main benefit from RAGS is that it provides opportunity to improve a greedy search algorithm without increasing the computational burden as later experimental results will show. How much benefit can be obtained and under what conditions? These are the further questions need to be answer. So there are more analysis works need to be done for the mechanism of RAGS, this will be the further works of the authors. When we apply it to forward greedy search for feature subset selection, we get a new feature subset selection algorithm, as show in Fig. 2. In the step to determine rank-k feature subset, we first use crude performance estimation to introduce random perturbation, and then select rank-k subset randomly to enhance the randomness. The crude performance estimation can be obtained by M-fold cross validation, where estimation is the average of test errors on M data subsets. Further M can be used to control the scale of the randomness. If M is smaller, the randomness is larger due to fewer test error samples are used. If M is bigger, the randomness is smaller. But M is limited by the number of data samples on a given data set. An alternative approach for crude estimation is bootstrapping. IV. EXPERIMENT RESULTS 4.1 Problem 1 Let’s consider an artificially constructed feature subset selection problem1 to explain the idea of RAGS more clearly, where we can obtain enough test data to validate the goodness of resulting feature subsets. This is a linear regression problem. There are 100 features totally as described in Table 1. They belong to three classes: basic features, derived features, and random features. The real response value (i.e., the model) is the linear combination of the 20 basic features, and the coefficients are generated randomly by a normal distribution with mean 0.5 and standard deviation one. The observed response value is the real response value with an i.i.d. noise with normal distribution. Due to the existence of derived features and random noises, it is hard to distinguish basic features from the others. However some derived features can provide more useful information than single basic features. So our target of this experiment is to find the effective feature subsets within 25-feature limit. That means we conduct 25 stages of forward greedy searching. In the following running of experiment, we find some irrelevant random features are chosen in the end several stages, that is an evidence of enough features. Thus we think 25 features are a reasonable choice for this problem. 1 The Matlab program to generate the data set can be downloaded from http://www.sei.xjtu.edu.cn/seien/staff.files/fgao/ linearfs.html. F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection Initialize: i=1, K=4. Set initial feature select vector s0 = 0. Set feature in threshold ratio = 1.02 i-th forward stage: Determine the k randomly in [1, K]. Construct the current forward subset population: for each feature of unselected set I0(si-1), combine it with current feature vector si-1 to form a new feature subset design Determing the rank-k feature subset: Calculate the crude estimated performance of each feature subset, find the rank-k good feature subset, and let stage performance perfi be the crude estimated performance of this rank-k subset. hAccept rank-k extension, If stage performance set i = i + 1 perfi is improved at If sta comparleast at rate ge perfi-1 ing with per Yes for 0 ma nce Accept rank-k per No extension, fi is set i = i + 1 im pro ved at extension Reject the new lea of selected set, and -2 feature st algorithm terminates. at rat e co mp ari ng Fig. 2. Flow chart of RAGS wit for feature subset selection. h per fi-1 Table 1. Problem definition of experiment 1. sDeterming the rank-k feature subsets: Cal Feature Type Description c No. u #1~4: i.i.d. uniform/normal random numl bers; #5~12: square/square root of features a Basic 1~20 #1-4; #13~18: cross-product of features t features #1-4; #19~20: uniform/normal random e 21~93 Derived features 94~100 Random features numbers. #21~38: cross-product t of features #5-12; h combinations of #39~68: random linear e #69~93: random #1-38 with i.i.d noise; linear combinations of #1-38 and real response with i.i.d noise. c r #94~95: independent random numbers with a little relevance to uobserved response; d #96~100: irrelative independent random e numbers. e s t i m a t e d p e 443 In this experiment, we generate two data sets: 1. Working data set: it is a small data set with 200 data samples. Feature subset selection algorithm is run on this data set to obtain the resulting feature subset. 2. Testing data set: it is a large data set with 100,000 data samples. It is used to test the generalization error of a feature subset. The first half of the samples are used to calculate regression coefficients for a given feature subset (i.e., training with the selected features from #1), and then its generalizing performance is calculated from the other half samples. The experiment was performed using MATLAB. The linear regression is used as inner modeling method in wrapper approach. The MSE (mean square error) is used as performance criterion. We use leave-one-out cross validation (LOOCV) as the “best” performance estimation, because this is the best way one can do to use 100 data samples sufficiently. And we treat K-fold CV as its approximation. The whole experiment contains 4 steps: 1. On the working data set, we execute forward greedy search feature selecting procedure by LOOCV criterion, and we obtain a feature subset FSLOO. This feature subset is treated as the baseline for comparison. 2. On the working data set, we execute RAGS revised forward greedy search. We compare two different precision criteria: 10-fold CV and 4-fold CV, where 10-fold CV is treated as a better approximation to LOOCV than 4-fold CV. And we compare the cases with/without random perturbation respectively. So we obtain four groups to feature subsets by independent repetitive running, as shown in Table 2. And the size of groups is controlled to spend same computation cost to give a fair comparison. 3. The four groups of feature subsets are assessed by LOOCV, and only the top 25 feature subsets that remained in each groups as the final result. 4. The four groups with top 25 feature subsets each and feature subset FSLOO in step 1 are assessed on the test data set to compare their generalization errors. The experiment results are shown in Fig. 3 and Table 2. In Fig. 3, a box and whisker plot is produced to display performance of top-25 feature subsets in each group by MATLAB function “boxplot,” where the notched-box has lines at the lower quartile, median, and upper quartile values of performance; notches represent a robust estimate of the uncertainty about the means, the whiskers are lines extending from each end of the box to show the extent of the rest of the performance data, and outliers are marked with plus signs beyond the ends of the whiskers if they exist. The resulting analysis are as following: Asian Journal of Control, Vol. 6, No. 3, September 2004 444 1. The feature similarity is significant among feature subset FSLOO and four final resulting groups of feature subsets. The counts of feature subsets that contain at least 80% features of FSLOO are 15, 11, 14, and 8 in four resulting groups respectively. And the total number of features appear in each of four resulting groups are 63, 78, 69, and 82, respectively. The frequency of each feature appeared in feature subset group with 4-fold CV criterion and top-4 random perturbation is shown in Fig. 3(b). So RAGS extends searching scope to try more different candidates around FSLOO. That is the basic idea of RAGS. 2. From Fig. 3(a), it can be seen that the performance diversity increased when cruder estimation is used and random perturbation is added. That is the result of broader searching scope. And the performance median values of four groups are better than feature subset FSLOO in step 1. It shows that improvement to the “best” performance forward greedy searching procedure can really be obtained by random sampling around it through “cruder performance” and random perturbation. That is just what we hope for RAGS. In this experiment, the group with cruder CV criterion plus random perturbation, which tries broadest searching scope, has the best mean performance and owns best feature subset under the same computation burden. 3. The MSE of the feature subset with 20 basic features on the test data set is 1.0562. It is much better than all feature subsets founded. However, its LOOCV performance estimation on working data set is only 1.0009, while the feature subset FSLOO in step 1 has the value 0.7855. So it is neglected because its attraction is not large enough from the viewpoint for LOOCV criterion. In summary, there really exist some better solutions in neighborhood of the “best” performance estimation greedy searching solution in this given problem, and RAGS really found part of these better solutions under comparable time limit or computational load. Table 2. The results of experiment 1. Group No. Performance assessing criteria #1 LOOCV Random perturbation Group size Times of regression need to assess each candidate middle feature subset Computation burden totally (The total times of regression in 25 stages forward greedy search) #2 10-fold CV #3 10-fold CV No No 1 200 200 940,000 MSE of LOOCV criterion on working set (mean/std) 0.7855 MSE of resulting feature subsets on testing set (mean/std) 1.4443 #4 4-fold CV #5 4-fold CV within top-4 No within top-4 200 500 500 10 10 4 4 9,400,000 9,400,000 9,400,000 9,400,000 0.7861 /0.0061 1.3970 /0.1037 0.8128 /0.0071 1.3561 /0.1391 0.8015 /0.0049 1.3680 /0.1024 0.8230 /0.0123 1.3529 /0.1416 1 0.9 1.6 frequency in 25 resulting feature subsets 0.8 MSE on test data set 1.5 1.4 1.3 0.7 0.6 0.5 0.4 0.3 0.2 1.2 0.1 1.1 0 LOO 10-fold 10-fold & top-4 4-fold (a) comparison of test performance 10 20 4-fold & top-4 30 40 50 features 60 70 80 90 100 (b) feature distribution in resulting feature subset group with 4-fold CV criterion and top-4 random perturbation. Fig. 3. Comparison results in experiment 1. F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection This problem is the Problem 8 from NIPS2000 Unlabeled Data Supervised Learning Competition 2. It involves predicting the bioreactivity of pharmaceutical compounds based on properties of the compounds molecular structure. The inputs represent 787 real-valued features that are mixture of various types of descriptors for the molecules based on their one-dimensional, twodimensional, and electron density properties. The training set contains 147 examples; the test set contains 50 examples. With so many features and so few samples, it is very important to do feature selection. The best result of the competition is reported in [6]. They used polynomial modeling to get the suitable feature subset. Based on this feature subset, their reported best forecasting results on test set was obtained by a NN model training on the training set. They used polynomials up to degree 3 to build 2325 monomials omitting cross-product. The best feature subset they found are: x172, x36, x159, x1722, x2223, x5653. We test our RAGS approach on these 2325 monomials features too. The linear regression model is used. The 3-fold CV is used to estimate the crude performance for feature subset on training set. The forward greedy procedure extends to 10 features and the performance decreases continuously. We stop at 10 features, and then backward greedy is execute to check if any features can be removed while keep performance. We got 100 features subsets by independent running. The features appear in more than 10% feature subsets are: (x36, 100), (x1723, 72), (x2223, 54), (x172, 45), (x159, 35), (x26, 27), (x7793, 26), (x173, 21), (x5943, 20), (x1722, 18), (x326, 18), (x5653, 13), (x4242, 12), (x261, 12), (x4463, 11), (x260, 11), where the number in the bracket is the frequency the feature appearing. It can be seen the 6 features in the final feature subset of [6] are all in the list. Then LOOCV is used to estimate the accurate performance of each feature subset on the training set. The final feature subsets will be selected according to LOOCV performance. When we test 100 feature subsets we obtained on the test data set, we found the test errors are not in totally alignment with LOOCV performances, and there are some explicit outliers. The test result of top 25 feature subsets according to LOOCV performance is shown in Fig. 4. The dots on each vertical line indicate the predict results of a given point in the test set by the models based on top 25 feature subsets. There exist a feature subset with the performance MSEtest = 0.203. It is very good for this test set, but MSE test cannot be a criterion to select features, because the true re2 The data set can be found from http://q.cis.uoguelph.ca/ ~skremer/Research/NIPS2000/ with further details. sponse values in test set are suppose unknown while select features and build model. So this feature subset cannot be the answer. The plus sign in Fig. 4 is the predict value by the bagging approaches, where we use median instead of mean to avoid outlier affection. And we got MSEtest = 0.456 from bagged polynomial models, and MSEtest = 0.378 from bagged NN models. Both results are better than what reported in [6]. The results are summarized in Table 3. In summary, the important features are founded by RAGS according to numerical computation. And the test performance is better. Table 3. The experiment result of problem 2. Polynomial in [6] NN in [6] RAGS polynomial RAGS NN MSEtrain 0.266 0.241 MSELOO 0.289 0.384 MSEtest 0.523 0.417 0.456 0.378 2 0 predictive response 4.2 Problem 2 445 -2 -4 -6 -8 -10 -12 -12 -10 -8 -6 -4 real response -2 0 2 (On each vertical line, the dots indicate the predict results by the models based on top 25 feature sets, while the plus sign is their bagging.) Fig. 4. Test result of problem 2. 4.3 Problem 3 This problem is another drug design problem [7], molecular Caco-2 permeability. There are 713 features and 27 samples in the given data set. Bi et al. studied the feature selection of this problem by the VS-SSVM (Variable Selection Via Sparse SVMs) approach they proposed. We test RAGS on this problem too. The linear regression model is used. The 3-fold CV is used to estimate the crude performance as in problem 2. The forward greedy searching procedure extends to 20-stage Asian Journal of Control, Vol. 6, No. 3, September 2004 446 then the backward greedy searching procedure is executed. We get 200 feature subsets independently, and their LOOCV performances are estimated. There are 11.5 features in each feature subset on average. Finally a forward greedy searching by LOO performance is conducted directly too. The mean value of LOOCV Q2 of 200 feature subsets is 0.0854, it is much smaller than the result 0.293 in [7]. As the basic idea of RAGS, we treat LOOCV as accurate performance estimation here, and 3-fold CV as crude performance estimation. From Fig. 5, it can be seen that the 3-fold CV really means good LOOCV performance usually. But the best LOOCV of Q2 among 200 features is 0.0037, while the feature subset by LOOCV directly has Q2 as 0.0015. That means the accurate greedy search obtains the best solution in this experiment. It seems that the RAGS fails to obtain a better result, but this is not the fail of the idea of RAGS, because this best solution is the global optimum probably, and RAGS provide some evidences to lead that conclusion. So in the cases like this problem, the results of RAGS are also helpful. 0.5 performance 0.4 0.3 0.2 0.1 0 0 50 100 ordered feature sets 150 200 (The upper curve is 3-fold CV performance, and the lower curve is LOO performance.) Fig. 5. Ordered performance curves in problem 3. V. CONCLUSION We propose a new approach named Random Approximated Greedy Search (RAGS) to solve stochastic combinatorial optimization problems. It is an extension of GRASP/Super-heuristics approach to stochastic problems where the performance estimation is very expensive. The RAGS can be applied to the situation where the object function of a problem is expressed as the expected value of an evaluation function, evaluation pro- cedure is very time-consuming, and only the solving approaches based on some heuristic greedy search algorithms are available. Instead of spending a lot of computation to perform the heuristic greedy search strictly by estimate performance accurately, we suggest using a crude estimation combined with some extent of random perturbation in the heuristic greedy search procedure. Therefore one can expect to generate opportunities to find better solutions at the comparable computation effort. In fact, RAGS is a biased random sampling in the neighborhood of strict heuristic greedy search solution. The complex and large-scale feature subset selection problem is just the case where RAGS can be used. In this paper, we apply RAGS to feature subset selection. We show its effective by several experiments. In the next step of research, we will apply RAGS to solve other real feature subset selection problems, and then try to extend it to other complex stochastic optimization applications. REFERENCES 1. Kohavi, R. and G. John, “Wrappers for Feature Subset Selection,” Artif. Intell. J., Special Issue on Relevance, Vol. 97, No. 1-2, pp. 273-324 (1997). 2. Stone, M. “Cross-Validatory Choice and Assessment of Statistical Predictions,” J. R. Statist. Soc. B, 36, pp. 111-147 (1974). 3. Langley, P., “Selection of Relevant Features in Machine Learning,” Proc. AAAI Fall Symp. on Relevance, pp. 140-144 (1994). 4. Guyon, I. and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach. Learn. Res., Vol. 3, pp. 1157-1182 (2003). 5. Miller, A.J., “Selection of Subsets of Regression Variables,” J. R. Statist. Soc. A, 147, Part 3, pp. 389-425 (1984). 6. Rivals, I. and L. Personnaz, “MLPs for Nonlinear Modeling,” J. Mach. Learn. Res., Vol. 3, pp. 1383-1398 (2003). 7. Bi, J., K.P. Bennett, etc., “Dimensionality Reduction via Sparse Support Vector Machines,” J. Mach. Learn. Res., Vol. 3, pp. 1229-1243 (2003). 8. Feo, T.A. and M.G.C. Resende, “Greedy Randomized Adaptive Search Procedures,” J. Glob. Optim., Vol. 6, pp. 109-133 (1995). 9. Lau, E. and Y.-C. Ho, “Super-Heuristics and Its Applications to Combinatorial Problems,” Asian J. Contr., Vol. 1, No. 1, pp. 1-13 (1999). 10. Ho, Y.C., “An Explanation of Ordinal Optimization – Soft Optimization for Hard Problems,” Inf. Sci., Vol. 113, pp. 169-192 (1999). 11. Shi, L. and C.H. Chen, “A New Algorithm for Stochastic Discrete Resource Allocation Optimization,” Discrete Event Dyn. S., Vol. 10, pp. 271-294 (2000). -1 -0.5 64210.5 -8 -6 8-4 0-2 F. Gao and Y.C. Ho.: Random Approximated Greedy Search for Feature Subset Selection 447