A New Feature Selection Algorithm for Two-Class Classification Problems and Application to Endometrial Cancer M. Eren Ahsen1 , Nitin K. Singh1 , Todd Boren2 , M. Vidyasagar1 and Michael A. White2 Abstract— In this paper, we introduce a new algorithm for feature selection for two-class classification problems, called `1 StaR. The algorithm consists of first extracting the statistically relevant features using the Student t-test, and then passing the reduced feature set to an `1 -norm support vector machine (SVM) with recursive feature elimination (RFE). The final number of features chosen by the `1 -StaR algorithm can be smaller than the number of samples, unlike with `1 -norm regression where the final number of features is bounded below by the number of samples. The algorithm is illustrated by applying it to the problem of determining which endometrial cancer patients are at risk of having the cancer spreading to their lymph nodes. The data consisted of 1,428 micro-RNAs measured on a data set of 94 patient samples (divided evenly between those with lymph node metastasis and those without). Using the algorithm, we identified a subset of just 15 microRNAs and a linear classifier based on these, that achieved twofold cross validation accuracies in excess of 80%, and combined accuracy, sensitivity and specificity in excess of 93%. I. I NTRODUCTION Biological data sets are characterized by having far more features than samples, making it a challenge to identify the most relevant features in a given problem. In this paper, we introduce a new algorithm for feature selection in twoclass classification problems, called `1 -StaR. It is a ‘secondorder’ acronym, and stands for “`1 SVM t-test and RFE”, where SVM (Support Vector Machine) and RFE (Recursive Feature Elimination) are themselves acronyms. The name can be pronounced either as ‘ell-one star’, or as ‘lone star’. Out of deference to the domicile of the authors, perhaps the second pronunciation is to be preferred. The first step in the algorithm is to determine the features that show a statistically significant difference between the means of the two classes using the Student t-test, and to discard the rest. The reduced data set is then passed on to an `1 -norm SVM, which forces many features to be assigned a weight of zero. The features with zero weight are discarded and the algorithm is run again (RFE), until no further reduction is possible. While both the `1 -norm SVM (see e.g. [1]) and an `2 -norm SVM with RFE 1 Department of Bioengineering, University of Texas at Dallas, 800 W. Campbell Road, Richardson, TX 75080. 2 Department of Cell Biology, UT Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX 75390. MEA, NKS and MV are supported by National Science Foundation Award #1001643, the Cecil & Ida Green Endowment, and by a Developmental Award from the Harold Simmons Comprehensive Cancer Center, UT Southwestern Medical Center. The work of TB and MAW is supported by the Welch Foundation Grant #I-1414 and the National Cancer Institute Grant #CA71443. (see e.g. [2]) are standard algorithms, it appears that our approach of applying RFE to the `1 -norm SVM is new. The algorithm produces a final feature set that can be smaller than the number of samples, in contrast with `1 -norm regression [3], where the final number of features is bounded below by the number of samples [4]. The algorithm is applied to a problem in endometrial cancer. The endometrium is the lining of the uterus, and a patient with endometrial cancer will have her uterus, ovaries and fallopian tubes removed. However, if the cancer has spread beyond these to the lymph nodes, then the patient would run a serious risk to her life. Consequently, the GOG (Gynecological Oncology Group) recommends that any patient with a tumor larger than 2cm must also have her pelvic lymph nodes removed. However, post-surgery analysis of the removed lymph nodes at UT Southwestern Medical Center reveals that an astounding 78% of the lymph node resections (removals) were unnecessary! The objective of our study is to predict which patients are at risk of lymph node metastasis, The data consists of measurements of 1,428 micro-RNAs on 94 patients, divided evenly between those with lymph node metastasis and those without. Using the lone-star algorithm, a final set of 15 micro-RNAs is identified, together with a linear classifier that is able to achieve accuracy, sensitivity and specificity in excess of 93%. II. P ROBLEM F ORMULATION AND L ITERATURE S URVEY A. Support Vector Machines Suppose we are given a set of doubly indexed real numbers {xij , i = 1, . . . , n, j = 1, . . . , m}, where n denotes the number of features and m denotes the number of samples. It is generally the case in biology problems that n m. The first m1 samples belong to Class 1, while the remaining m2 = m − m1 samples belong to Class 2. Let us introduce the symbols N = {1, . . . , n}, M1 = {1, . . . , m1 }, M2 = {m1 + 1, . . . , m1 + m2 }, M = {1, . . . , m} = M1 ∪ M2 . Further, let us define xj := (xij , i = 1, . . . , n) ∈ Rn to be the set of feature values associated with the j-th sample. Then the data is said to be linearly separable if there exists a weight vector w ∈ Rn and a threshold θ ∈ R such that wt xj ≥ θ + 1 ∀j ∈ M1 , wt xj ≤ θ − 1 ∀j ∈ M2 . (1) In other words, if we define a hyperplane H in Rn by the equation H = {z ∈ Rk : wt z = θ}, then H separates the two classes {xj , j ∈ M1 } and {xj , j ∈ M2 }. Note that the constant 1 is introduced purely to ensure that there is a ‘zone of separation’ between the two classes. It is easy to see that, if there exists one linear classifier for a given data set, then there exist infinitely many. The question therefore is: Which amongst these is ‘optimal’ in some sense? One of the most successful and widely used linear classifiers is the Support Vector Machine (SVM) introduced in [5]. In that paper, the authors choose a classifier that maximizes the minimum `2 -distance of any vector xj to the separating hyperplane. Mathematically this is equivalent to minimizing kwk22 over the set of all weight vectors w and thresholds θ that satisfy (3). This is a quadratic programming problem, in that the objective function to be minimized is quadratic and the constraints are linear. Once w and θ are determined in this fashion, a new test input x is assigned to class 1 if wt x − θ > 0 and to class 2 if wt x − θ < 0. The SVM is attractive for two reasons. First, since it is a quadratic programming problem, it can be applied to truly enormous data sets (m is quite large). Second, the optimal classifier is ‘supported’ on a relatively small number of samples. Thus adding more samples to the data set often does not change the classifier. This property is extremely useful in situations where m n, that is, there are far more samples than features. However, in biological problems where the situation is the inverse, the second property is not so useful. Now we discuss the existence of linear classifiers. With advances in Vapnik-Chervonenkis theory, the situation is by now quite clear. Suppose as before that we are given m samples of n features. Then there are 2m different ways of assigning the m samples to two classes. It is known [6], [7] that if n ≥ m + 1, then for each of the 2m possible assignments of the n-dimensional vectors to two classes, generically it is possible to find a linear classifier.1 Thus, if in a given problem n < m + 1, then we increase the dimension of the feature vector by, for example, taking higher powers of the components of the feature vector, until the (enlarged) number of features exceeds m + 1. The resulting classifier is referred to as a higher-order SVM, or a polynomial SVM. The advantage of a higher-order SVM is that linear separability is always guaranteed, whereas the disadvantage is that it is no longer a linear classifier in the original feature space. The standard SVM formulation addresses the situation where the data is linearly separable. Given a data set, it is possible to determine in polynomial time whether it is linearly separable, since that is equivalent to testing the feasibility of a linear programming problem. In case the data is not linearly separable, there are several competing approaches, some of which are described below, including in the one used in this paper. We have already discussed the possibility of augmenting the dimension of the feature space by using higher powers. Another approach, known as the ‘soft margin’ classifier, is to replace (3) by 1 The precise statement is that, unless all m of the n-dimensional vectors belong to a lower-dimensional hyperplane, a linear classifier exists for each of the 2m different assignments of the m vectors to two classes. subject to wt xj ≥ θ − ∀j ∈ M1 , wt xj ≤ θ + ∀j ∈ M2 , (2) where > 0, while simultaneously minimizing kwk2 and . Note that, in contrast to (3), if the soft margin classifier is used, then any points x such that |wt x − θ| ≤ cannot be unambiguously assigned to either class. This is because the two hyperplanes defined by H1 := {x ∈ Rn : wt x ≥ θ − }, H2 := {x ∈ Rn : wt x ≤ θ + } are not disjoint but actually overlap. Thus a soft margin classifier is in reality a three-class classifier, in that it assigns a test input x to class 1 if wt x − θ > , to class 2 if wt x−θ < −, and to ‘don’t know’ if wt x−θ ∈ [−, ]. Soft margin classification works well in cases where it is possible to achieve the ‘overlapping linear separation’ with relatively small values of compared to θ. However, depending on the nature of the data set, it is possible that the entire training data set falls into the ‘don’t know’ category. Another possibility is to take the given non-separable data set, and find a linear classifier of the form (3) that misclassifies the fewest number of data vectors. Mathematically this is equivalent to the ‘minimum flipping problem’, that is, determining the smallest number of data vectors that need to have their labels flipped, in order to make the resulting set linearly separable. Unfortunately, it is known that this problem is NP-hard [8]. Another approach, which is the one adopted here, is to find a tractable version of the ‘minimum flipping problem’ described above. Choose some parameter λ ∈ (0, 1), and let e denote the column vector of all ones, with the subscript denoting its dimension. Then the problem is: min (1 − λ)(yt em1 + zt em2 ) + λkwk22 , w,θ,y,z (3) subject to the following constraints: wt xj ≥ θ + 1 − yj ∀j ∈ M1 , wt xj ≤ θ − 1 + zj ∀j ∈ M2 , yj ≥ 0 ∀j ∈ M1 , zj ≥ 0 ∀j ∈ M2 . It is obvious that the variables yj , zj play the role of slack variables so as to achieve linear separability after the slack variables have been introduced. If the data is linearly separable, then the constraints can be satisfied with yj = 0 for all j ∈ M1 , zj = 0 for all j ∈ M2 , and any choice of w and θ that achieves linear separation. In this case the cost function becomes λkwk22 . Let w0 , θ0 denote the solution to the standard SVM formulation, that is, kw0 k22 = min kwk22 w,θ wt xj ≥ θ + 1 ∀j ∈ M1 , wt xj ≤ θ − 1 ∀j ∈ M2 . Then it is obvious that the minimum value for the problem in (3) cannot be larger than λkw0 k22 . Moreover, if λ is sufficiently small, the benefit of reducing kwk22 below kw0 k22 will be offset by the penalty due to having nonzero slack variables yj , zj . In other words, for linearly separable data sets, there exists a λ0 such that for all λ ∈ (0, λ0 ) the solution to the problem in (3) is w = w0 , θ = θ0 , y = 0m1 , z = 0m2 . On the other hand, if the data is not linearly separable, the problem is one of trading off the norm of the weight vector w as measured by the second term in the objective function with the extent of the misclassification as measured by the first term. For this reason, we should choose λ to be quite close to zero but not exactly equal to zero. Note that, unlike the minimum flipping problem, this problem is a quadratic programming problem, and can thus be solved in polynomial time. However, there is no guarantee that the classifier found in this fashion is optimal in the sense of the number of misclassified points. B. `1 -Norm SVM Now we discuss the `1 -norm SVM following the notation and content of Section 3 of [1]. Let k · k denote any norm on Rn , and recall that its dual norm is defined by kwkd = max |wt x|. kxk≤1 For an arbitrary norm k · k on Rn , the corresponding support vector machine can be found by solving the following problem: See Equation (11) of [1], and note that in [1] the data vectors are taken as row vectors, whereas we denote them as column vectors. Choose some parameter λ ∈ (0, 1). Then the problem is min (1 − λ)(yt em1 + zt em2 ) + λkwkd , w,θ,y,z (4) subject to the following constraints: t w xj ≥ θ + 1 − yj ∀j ∈ M1 , wt xj ≤ θ − 1 + zj ∀j ∈ M2 , C. Trading Off Sensitivity and Specificity Next we discuss a simple modification of the objective function, first suggested by Veropoulos et al. [9] for ‘automatically’ trading off sensitivity versus specificity. Recall the definitions of these concepts for a two-class classifier. For such a classifier, there are four combinations, namely T P, T N, F P, F N , representing true-positive through falsenegative. The sensitivity is defined as the ratio T P/(T P + F N ), while the specificity is defined as T N/(T N + F P ). In other words, the sensitivity is the fraction of samples that are classified as belonging to the ‘true’ class that are in fact true, while the specificity is the fraction of samples that are classified as belonging to the ‘false’ class that are in fact false. It is well-known that no classifier can achieve both 100% sensitivity as well as 100% specificity (except in contrived examples). The curve that plots the maximum sensitivity achievable by any classifier as a function of the specificity (or vice versa) is referred to for historical reasons as the ROC (Receiver Operating Characteristic) curve. In a given problem, the ROC curve is not known in general. However, it is possible to plot the maximum sensitivity achievable by the class of classifiers under study, and use that as an approximation of the ROC curve. This is what we do here. To trade-off sensitivity versus specificity, we just make the following substitution in the cost function: yt em1 + zt em2 ← αyt em1 + (1 − α)zt em2 , where α ∈ (0, 1). Clearly, if α < 0.5, then larger values of y are tolerated, and the classifier will have higher specificity than sensitivity; if α > 0.5 then the classifier will have higher sensitivity than specificity. If α = 0.5 then this reduces to the earlier problem formulation, because the scale factor of 0.5 clearly does not change the problem. With this modification, the final optimization problem now becomes the following: min (1 − λ)(αyt em1 + (1 − α)zt em2 ) + λkwkd , w,θ,y,z (5) subject to the following constraints: wt xj ≥ θ + 1 − yj ∀j ∈ M1 , yj ≥ 0 ∀j ∈ M1 , zj ≥ 0 ∀j ∈ M2 . wt xj ≤ θ − 1 + zj ∀j ∈ M2 , Since the `2 -norm is its own dual, in the special case where the distance to the separating hyperplane is measured using the `2 -norm, (4) reduces to (3) if k · k is the `2 -norm (after replacing the term λkwkd by λkwk2d ). In general, the optimal weight vector w for the minimization problem (3) will contain all nonzero components, as that is a quadratic programming problem. To ensure that the optimal weight vector w has a large number of zero entries, it is suggested in [1] that the distance to the separating hyperplane should be measured using the `1 -norm. If we use the `1 -norm to measure distances in Rn , then the dual norm is the `∞ -norm, and the problem (4) is a linear programming formulation. Therefore the number of nonzero components of the optimal weight vector is bounded by m, the number of samples. yj ≥ 0 ∀j ∈ M1 , zj ≥ 0 ∀j ∈ M2 . D. Recursive Feature Elimination in `2 -Norm SVMs Until now we have been discussing various ways of generating linear classifiers that use all the components of the vectors xj , that is, all the features. Now we discuss the `2 -norm SVM RFE as defined in [2]. In that paper, the authors begin with all n features, and divide the two classes M1 and M2 into five roughly equal subsets each. Then one subset is left out from each class, leaving roughly 80% of the original samples from within each class. For each such choice, a traditional `2 -norm SVM is computed, and the associated weight vector is determined. By cycling through all five choices for the subset to be omitted, the authors generate a total of five optimal weight vectors. These are then averaged to come up with a five-fold cross-validated weight vector. From this weight vector, the least significant (in terms of magnitude) component of the weight vector is dropped, resulting in a reduction of one in the number of features used, from n to n − 1. Then the exercise is repeated. If several components of the averaged (or cross-validated) weight vector are small in magnitude, then more than one component can be dropped at each step. However, since the `2 -norm SVM formulation is a quadratic programming problem, in general there is no reason why any components of the weight vector should be small, let alone equal to zero. III. T HE L ONE S TAR A LGORITHM In this section we present our new algorithm, which consists of two steps. First we reduce the number of features using the ‘Student’ t-test, and then we apply recursive feature elimination but to the `1 -norm SVM not the `2 -norm SVM as in [2]. A. Pre-Processing the Feature Set For each index i, we compute the average of the two classes for the i-th feature. Thus for all i = 1, . . . , n, we compute the means 1 X xi,j , l = 1, 2. µi,l = m1 j∈Ml Then we use the standard ‘Student’ t-test to determine whether or not there is a statistically significant difference between the two means. The significance level can be anything we wish, but in biology it is common to accept the difference as being significant if the likelihood of it occurring by chance is less than 0.05, that is, the null hypothesis that the two means are equal can be rejected at a 95% level of confidence. This results in a reduction in the number of features. For convenience, we continue to use the symbol n to denote the reduced number of features. B. `1 -Norm SVM with Recursive Feature Elimination The next step in the algorithm is to combine the `1 -norm SVM with recursive feature elimination. Specifically, • Choose at random a ‘training set’ of samples of size k1 from M1 and size k2 from M2 , such that kl ≤ ml /2, and k1 , k2 are roughly equal. In the endometrial cancer application below, m1 = m2 , so that size of the training set is half of the total samples within each class. Then compute an optimal `1 -norm SVM for the chosen training set. • Repeat the above exercise several times, with different randomized choices of training and testing data. (We repeated this step with 80 randomized choices and with 1,000 choices, and there was hardly any difference in the outcomes.) This is unlike in [2], where there is only one randomized division. • For each randomized assignment of the data to the two classes, the number of nonzero entries in the optimal weight vector is more or less the same, whereas the location of nonzero entries in the optimal weight vector varies from one run to another. • Let k denote the average number of nonzero entries in the optimal weight vector across all randomized runs. Average all the optimal weight vectors, and choose the largest k entries and corresponding feature set. This results in reducing the number of features from the original n to k in one shot. (See Section VI for an alternate method for choosing the features, and the observation that both methods lead to essentially the same results.) • Repeat the process with the reduced feature set, and with several sets of k1 , k2 randomly selected training samples, until no further reduction is possible in the number of features. This determines the final set of features to be used. • Once the final feature set is determined, carry out two-fold cross validation by dividing the data into a training set of k1 , k2 randomly selected samples and assessing the performance of the resulting `1 -norm classifier on the testing data set, which is the remainder of the samples. Average the weights generated by 20 (or whatever number) best-performing classifiers, and call that the final classifier. At this point there is no distinction between the training and testing data sets, so run the final classifier on the entire data set to arrive at the final accuracy, sensitivity and specificity figures. The advantage of the above approach vis-a-vis the `2 norm SVM-RFE is that the number of features reduces significantly at each step, and the algorithm converges in just a few steps. This is because, with the `1 -norm, many components of the weight vector are ‘naturally’ zero, and need not be truncated. In contrast, in general all the components of the weight vector resulting from the `2 -norm SVM will be nonzero; as a result the features can only be eliminated one at a time, and in general the number of iterations is equal to (or comparable to) n, the initial number of features. IV. C ASE S TUDY: E NDOMETRIAL C ANCER The endometrium is the lining of the uterus. Endometrial cancer is the most common gynecological malignancy, afflicting up to 48,000 women annually. Due to early detection, endometrial cancer results in ‘only’ about 8,000 deaths annually. The presence of pelvic and/or para-aortic lymph node metastasis decreases the 5-year survival rate from 85% to 58% and 41% respectively [10]. Based upon surgical staging studies in the 1970s and 1980s that suggested frequent errors in the clinical staging of endometrial cancers [11], [12], the International Federation of Obstetrics and Gynecology adopted a surgical staging system in 1988 which has been recently updated in 2009 [13]. Currently, primary staging surgery for endometrial cancer consists of removal of the uterus, ovaries, fallopian tubes, and pelvic and para-aortic lymph node dissection, as well as omentectomy when indicated. The incidence of pelvic and para-aortic node metas- tasis in patients with stage I endometrial cancer is 4-22% and varies with grade, depth of invasion, lymphovascular space invasion, and histologic subtype [12]. Therefore, 9678% of patients with endometrial cancer will not benefit from a lymphatic dissection. Morbidities associated with pelvic and para-aortic lymph node dissection include increased operative times, increased blood loss, ileus, increased number of thromboembolic events, lymphocyst formation, and major wound dehiscence, all of which adversely affect the patients health and quality of life [14]. Efforts have been made to identify patients pre-operatively or intra-operatively who are at greatest risk for lymph node metastasis and would benefit from a formal lymph node assessment. In the two largest studies to date, patients with tumors grade I/II, tumors with < 50% uterine invasion, and tumors < 2cm in size did not have any evidence of lymph node metastasis and could be spared a lymph node dissection [15], [16]. In patients who do not meet these criteria, lymph node resection is the accepted practice. However, postsurgery analysis shows that, even in patients not meeting the aforementioned criteria, lymph node metastasis was identified in only 22% of patients. Therefore, 78% of patients underwent a morbid procedure that they ultimately did not need despite the use of the most up to date recommended practice. The incidence of pelvic and para-aortic lymph node metastasis in endometrial cancer is likely related to the biologic aggressiveness of the tumor reflected in the genetic determinants of cellular mechanisms that control metastasis. MicroRNAs (miRNAs) are 19 to 25-nucleotide, non-coding, RNA transcripts, thought to be instrumental in controlling eukaryotic cell function via modulation of post-transcriptional activity of multiple target messenger RNA (mRNA) genes by repression of translation or regulation of mRNA degradation [17], [18], [19]. As such, miRNAs may impact critical control mechanisms in tumor cells that affect metastatic potential. MicroRNA expression analysis can identify differentially expressed miRNAs in patient populations with different clinical characteristics and has been utilized in endometrial and ovarian cancers. MicroRNA expression patterns have been identified that can predict benign vs. malignant disease, histologic subtypes, survival, and response to chemotherapy [20], [21], [22]. Since lymph node metastasis is likely driven by genetic mechanisms, we propose that miRNA expression techniques can be used to elucidate patterns of miRNA expression associated with lymphatic metastasis. The information identified in this study can be used twofold. First, novel miRNA expression patterns associated with lymphatic metastasis can be incorporated into prospective translational-clinical trials to test their validity in the clinical setting and hopefully improve upon the 22% predictive accuracy of clinical-pathologic parameters. Second, the individual miRNAs identified can inform on the specific genes responsible for lymphatic metastasis and direct future research into tailored therapeutics. V. T HE NATURE OF THE DATA Fifty stage IA or IB (1988 FIGO staging) and 50 stage IIIC frozen endometrial cancer samples were obtained from the Gynecologic Oncology Group (GOG) tumor bank. The samples were collected from patients enrolled in GOG tissue acquisition protocol 210 which established a repository of clinical specimens with detailed clinical and epidemiologic data from patients with surgically staged endometrial carcinoma. The stage I and stage IIIC samples were matched for age, grade, presence of lymphvascular space invasion, and raced when possible. All patients enrolled in GOG 210 have undergone comprehensive surgical staging consisting of total abdominal hysterectomy, bilateral salpingoophorectomy, pelvic and para-aortic lymphadenectomy. Patients included in this study had no gross or pathologic evidence of extrauterine disease aside from lymph node metastasis and could be considered clinical stage I tumors. All tumors have undergone central pathologic review by the GOG. Out of these 100 tumors, six were rejected for unrelated reasons, leaving 47 tumors each of stage IA/IB and of stage IIIC. MicroRNA expression analysis was performed on all these 94 samples, using a commercial experimental apparatus that measured the average abundance of 1,428 miRNA molecules in all 94 tissue samples. The raw measured quantity underwent several quality control checks, and finally the binary logarithm of the measured quantity was taken as the output of the miRNA expression analysis. Because the measurements were taken using a commercial microarray chip, many of the miRNAs measured were neither relevant nor of interest to us. Out of the total of 1428 × 94 measurements, about 42% were shown as ‘NAN’ (or ‘Not a Number’). Therefore the first issue was how to treat all these NAN entries. In view of the manner in which the raw data was generated, it was evident that a reading of NAN resulted when the quantity of miRNA produced is not sufficient to be detectable by the measuring device. Therefore a decision was taken to replace all NAN entries by a zero entry. This is consistent with the physics of the problem. With this replacement, the data at hand consisted of samples of the form xij , i = 1, . . . , n, j = 1, . . . , m, where i = 1428 is the number of miRNAs measured, and j = 94 is the number of samples. The set of samples can be further subdivided into two classes. Samples 1, . . . , m1 belong to one class whereas samples m1 + 1, . . . , m1 + m2 belong to the second class. In the present instance, m1 = m2 = 47. VI. R ESULTS As mentioned earlier, the first step of the lone star algorithm is to apply the student t-test to choose those features that showed a statistically significant difference between the means of the two classes. This resulted in a choice of 165 features from the original list of 1,428. The reduced data set with 165 features and 94 samples, 47 of each type, was analyzed withthe `1 -norm SVM-RFE. For comparison purposes, the same data set (with 165 features) was also analyzed using the `2 -norm SVM RFE. This section reports on the results. We first present the outcome of the `2 -norm SVM RFE. Note that the algorithm presented in [2] is not truly randomized, in the following sense. Only the initial division of each sample set into five roughly equal classes is random; after that, the procedure is deterministic. Thus if one application of the RFE algorithm does not yield satisfactory results the only option is to try again with a different initial division. After several runs of the RFE algorithm, we could finally manage to find ten out of the original 165 features that had acceptable performance. Each run of the `2 -norm SVM RFEtook approximately six hours on a Intel Xeon 2.8 GHz Quad Core processor with 8 GB RAM. As stated above, we had to repeat the runs several times before getting a satisfactory classifier. For the `1 -norm SVM-RFE, we began with the two sample classes, and at random chose half (23 samples) of each class as the training data and the other half (24 samples) as the testing data. For each random choice, we computed the associated optimal weight vector for the `1 -norm SVM. This randomization step was repeated 80 times. We found that the number of nonzero weights was consistently around 31, though naturally the locations or indices of these nonzero weights changed from one run to another. Thus we could reduce the number of features from 165 to 31 in a single iteration. Note that in the case of the `2 -norm SVM this type of reduction would have taken multiple iterations and much more computing time. At this point we had one of two possible ways to proceed. First, we could simply average the weight vectors of all 80 runs, and choose the largest 31 components. Second, for each of the 165 features, we could compute the number of times (out of 80) that the particular feature had a nonzero weight; then we could rank the features in terms of the number of times the feature had a nonzero weight. A happy outcome is that both approaches yield more or less the same set of features. Specifically, out of the top-ranked 31 features selected according to their average weight, the top 26 were also the top-ranked features according to the number of times that the feature had a nonzero weight; moreover, the ordering of these 26 features was the same in both lists. Only the last five indices differed. With this observation, we opted for the first approach. Therefore we used each randomized run of the `1 -norm SVM to determine the average number of nonzero features, call it k; then we ranked all the features according to their average weight across all randomized runs; and then chose the top k-ranked features for the next round. Subsequent runs of the above process resulted in the number of features decreasing in successive stages as follows: 165 ← 31 ← 21 ← 18 ← 15. At this point there was no further reduction in the number of nonzero features, so the algorithm was terminated. It is interesting to note that nine out of the final fifteen features were the top nine-ranking features at the very first run. For Norm Attribute Accuracy Sensitivity Specificity Testing 0.8178 0.7913 0.8443 `2 Combined 0.9043 0.8936 0.9149 Testing 0.8854 0.9104 0.8604 `1 Combined 0.9410 0.9532 0.9287 the `1 -norm SVM RFE algorithm, the CPU time on the same processor as above was 3.5 minutes! – roughly a hundredfold reduction compared to the `2 -norm SVM RFE. Just to reduce the gap, we increased the number of randomized partitionings of the data at each iteration from 80 to 1,000. This did not change the final results (reported below) but increased the CPU time to 34 minutes – still ten times faster than the `2 -norm SVM RFE. Now we report the accuracy etc. of the classifiers found via the `1 -norm and the `2 -norm SVM RFE algorithms. As stated above, the results reported below for the `2 -norm SVM RFE are the best of many runs. This is because the recursive feature elimination is highly sensitive to the original (random) splitting of the data into five equal parts. In contrast, the results reported for the `1 -norm SVM RFE are the result of the one and only time we ran the algorithm.2 In the last round, after we reduced the number of features from 18 to 15, we partitioned the samples at random 80 times into two sets each, one for training and the other for testing (two-fold cross-validation) and computed the optimal `1 norm SVM classifier. For each randomized run, we computed the accuracy, sensitivity and specificity on the testing data alone, and on the entire data set including both the testing as well as the training data. Then the weights of the 20 bestperforming classifiers were averaged, and chosen as the final classifier. For each of these best-performing 20 classifiers, it makes sense to talk about the performance on the training set (usually close to 100%) and on the test set. However, since the final set of classifier weights is the average of these, there is no distinction between the training set and the testing set, and the only way to assess the classifier is to run it on the entire data set. These figures are reported below. For comparison purposes, the same type of numbers are reported for the `2 -norm SVM RFE as well. Next, we created a ‘heat map’ of the performance of the best-performing 20 classifiers as well as the ‘average’ classifier on all 94 test samples. The diagram below shows the result. Rows 1 to 20 are the best classifiers; thus the bottom row has the best performance, while row 21 is that of the average classifier. It can be seen that one sample, namely UTSW#88-IIIC (though this label is difficult to see in the figure), is misclassified 11 times out of 20. This suggests that possibly this sample might be an outlier. Thus, in addition to providing a good method for finding a reduced set of features for two-class classification, the lone star algorithm also has the potential to detect outliers in a relatively simple manner. 2 To test the robustness of the algorithm, we ran it a second time and got very similar results. VII. C ONCLUDING R EMARKS In this paper we have introduced a new method for finding a classifier between two classes, in the case where the number of features is far in excess of the number of samples. While the algorithm is completely general, we have illustrated its utility by applying it to the problem of determining a prognostic classifier for endometrial cancer, specifically, whether or not metastasis has occurred into the lymph nodes thus necessitating their removal via surgery. The new algorithm consists of first using the student t-test to choose only those features that demonstrate a significant difference between the two classes and then using the `1 -norm support vector machine. The last step is to keep on reducing the number of features (recursive feature elimination) until no further reduction is possible. For the endometrial cancer application studied here, we began with 1,428 features and finally wound up with just fifteen features. The resulting classifier achieved accuracy, sensitivity and specificity of nearly 90% on two-fold cross-validated test data, and in excess of 93% on the entire data set. We have already used this approach in ovarian cancer to determine which patients are likely to respond to platinum therapy, and plan to apply the approach to lung cancer to determine the responsiveness of cell lines to chemotherapy. Note that a somewhat similar approach to ensuring that many components of the weight vector are zero is also the objective of the so-called LASSO (Least Absolute Shrinkage and Selection Operator) method proposed in [3]. The LASSO method minimizes the `2 -norm of the regression vector subject to an `1 -norm constraint on the model parameters. We believe that LASSO is more suited to problems where the labels are not binary as in the present case, but real numbers in the interval [−1, 1] (or some compact set). Therefore LASSO is better-suited to the problem of predicting the lethality of chemotherapy which is never binary, than to deciding whether surgery is necessary or not, which is a binary decision. We plan to study the applicability of LASSO in addition to the `1 -norm SVM to the problem of predicting the responsiveness of lung cancer cell lines to chemotherapy. The findings should be interesting. R EFERENCES [1] P. S. Bradley and O. L. Mangasarian, “Feature selection via concave minimization and support vector machines,” in Machine Learning: Proceedings of the Fifteenth International Conference (ICML ’98). Morgan Kaufmann, San Francisco, 1998, pp. 82–90. [2] I. Guyon, J. Weston, and S. Barnhill, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, pp. 389–422, 2002. [3] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society, vol. 58(1), 1996. [4] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Royal Stat. Soc. B, vol. 67, pp. 301–320, 2005. [5] C. Cortes and V. N. Vapnik, “Support vector networks,” Machine Learning, vol. 20, 1997. [6] R. S. Wenocur and R. M. Dudley, “Some special vapnik-chervonenkis classes,” Discrete Mathematics, vol. 33, pp. 313–318, 1981. [7] E. D. Sontag, “Feedforward nets for interpolation and classification,” J. Comp. Sys. Sci., vol. 45(1), pp. 20–48, 1992. [8] K.-U. Höffgen, H.-U. Simon, and K. S. V. Horn, “Robust trainability of single neurons,” Journal of Computer and System Science, vol. 50(1), pp. 114–125, 1995. [9] K. Veropoulos, C. Campbell, and N. Cristianini, “Controlling the sensitivity of support vector machines,” in IJCAI Workshop on Support Vector Machines, 1999. [10] C. P. Morrow, B. N. Bundy, R. N. Kurman, W. T. Creasman, P. Heller, H. D. Homesley, et al., “Relationship between surgical-pathological risk factors and outcome in clinical stage i and ii carcinoma of the endometrium: a gynecologic oncology group study,” Gynecological Oncology, vol. 40(1), pp. 55–65, 1991. [11] R. C. Boronow, C. P. Morrow, W. T. Creasman, P. J. Disaia, S. G. Silverberg, A. Miller, et al., “Surgical staging in endometrial cancer: clinical-pathologic findings of a prospective study,” Obstet Gynecol., vol. 63(6), pp. 825–32, 1984 Jun. [12] W. T. Creasman, C. P. Morrow, B. N. Bundy, H. D. Homesley, J. E. Graham, and P. B. Heller, “Surgical pathologic spread patterns of endometrial cancer. a gynecologic oncology group study,” Cancer, vol. 60(8 Suppl), pp. 2035–41, 1987 Oct 15. [13] S. N. Lewin, T. J. Herzog, N. I. B. Medel, I. Deutsch, W. M. Burke, X. Sun, et al., “Comparative performance of the 2009 international federation of gynecology and obstetrics’ staging system for uterine corpus cancer,” Obstet Gynecol., vol. 116(5), pp. 1141–9, 2010 Nov. [14] H. Kitchener, A. M. Swart, Q. Qian, C. Amos, and M. K. Parmar, “Efficacy of systematic pelvic lymphadenectomy in endometrial cancer (mrc astec trial): a randomised study,” Lancet, vol. 373(9658), pp. 125–36, 2009 Jan 10. [15] A. Mariani, S. C. Dowdy, W. A. Cliby, B. S. Gostout, M. B. Jones, T. O. Wilson, et al., “Prospective assessment of lymphatic dissemination in endometrial cancer: a paradigm shift in surgical staging,” Gynecol Oncol., vol. 109(1), pp. 11–8, 2008 Apr. [16] A. Mariani, M. J. Webb, G. L. Keeney, M. G. Haddock, G. Calori, and K. C. Podratz, “Low-risk corpus cancer: is lymphadenectomy or radiotherapy necessary?” Am J Obstet Gynecol., vol. 182(6), pp. 1506– 19, 2000 Jun. [17] D. P. Bartel, “Micrornas: genomics, biogenesis, mechanism, and function,” Cell, vol. 116(2), pp. 281–97, 2004 Jan. [18] W. Filipowicz, L. Jaskiewicz, F. A. Kolb, and R. S. Pillai, “Posttranscriptional gene silencing by sirnas and mirnas,” Curr Opin Struct Biol., vol. 15(3), pp. 331–41, 2005 Jun. [19] E. J. Sontheimer and R. W. Carthew, “Silence from within: endogenous sirnas and mirnas,” Cell, vol. 122(1), pp. 9–12, 2005 Jul. [20] T. Boren, Y. Xiong, A. Hakam, R. Wenham, S. Apte, Z. Wei, et al., “Micrornas and their target messenger rnas associated with endometrial carcinogenesis,” Gynecol Oncol., vol. 110(2), pp. 206– 15, 2008 Aug. [21] H. K. Dressman, A. Berchuck, G. Chan, J. Zhai, A. Bild, R. Sayer, et al., “An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer,” J Clin Oncol., vol. 25(5), pp. 517–25, 2007 Feb 10. [22] M. V. Iorio, R. Visone, G. D. Leva, V. Donati, F. Petrocca, P. Casalini, et al., “Microrna signatures in human ovarian cancer,” Cancer Res., vol. 67(18), pp. 8699–707, 2007 Sep 15. [23] N. Cristianini and J. Shawe-Taylor, Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.