A New Feature Selection Algorithm for Two

advertisement
A New Feature Selection Algorithm for
Two-Class Classification Problems and
Application to Endometrial Cancer
M. Eren Ahsen1 , Nitin K. Singh1 , Todd Boren2 , M. Vidyasagar1 and Michael A. White2
Abstract— In this paper, we introduce a new algorithm for
feature selection for two-class classification problems, called `1 StaR. The algorithm consists of first extracting the statistically
relevant features using the Student t-test, and then passing
the reduced feature set to an `1 -norm support vector machine
(SVM) with recursive feature elimination (RFE). The final
number of features chosen by the `1 -StaR algorithm can be
smaller than the number of samples, unlike with `1 -norm
regression where the final number of features is bounded below
by the number of samples. The algorithm is illustrated by
applying it to the problem of determining which endometrial
cancer patients are at risk of having the cancer spreading to
their lymph nodes. The data consisted of 1,428 micro-RNAs
measured on a data set of 94 patient samples (divided evenly
between those with lymph node metastasis and those without).
Using the algorithm, we identified a subset of just 15 microRNAs and a linear classifier based on these, that achieved twofold cross validation accuracies in excess of 80%, and combined
accuracy, sensitivity and specificity in excess of 93%.
I. I NTRODUCTION
Biological data sets are characterized by having far more
features than samples, making it a challenge to identify the
most relevant features in a given problem. In this paper,
we introduce a new algorithm for feature selection in twoclass classification problems, called `1 -StaR. It is a ‘secondorder’ acronym, and stands for “`1 SVM t-test and RFE”,
where SVM (Support Vector Machine) and RFE (Recursive
Feature Elimination) are themselves acronyms. The name
can be pronounced either as ‘ell-one star’, or as ‘lone star’.
Out of deference to the domicile of the authors, perhaps the
second pronunciation is to be preferred. The first step in the
algorithm is to determine the features that show a statistically
significant difference between the means of the two classes
using the Student t-test, and to discard the rest. The reduced
data set is then passed on to an `1 -norm SVM, which forces
many features to be assigned a weight of zero. The features
with zero weight are discarded and the algorithm is run again
(RFE), until no further reduction is possible. While both the
`1 -norm SVM (see e.g. [1]) and an `2 -norm SVM with RFE
1 Department of Bioengineering, University of Texas at Dallas, 800 W.
Campbell Road, Richardson, TX 75080. 2 Department of Cell Biology, UT
Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, TX
75390. MEA, NKS and MV are supported by National Science Foundation
Award #1001643, the Cecil & Ida Green Endowment, and by a Developmental Award from the Harold Simmons Comprehensive Cancer Center,
UT Southwestern Medical Center. The work of TB and MAW is supported
by the Welch Foundation Grant #I-1414 and the National Cancer Institute
Grant #CA71443.
(see e.g. [2]) are standard algorithms, it appears that our
approach of applying RFE to the `1 -norm SVM is new. The
algorithm produces a final feature set that can be smaller than
the number of samples, in contrast with `1 -norm regression
[3], where the final number of features is bounded below by
the number of samples [4].
The algorithm is applied to a problem in endometrial
cancer. The endometrium is the lining of the uterus, and a
patient with endometrial cancer will have her uterus, ovaries
and fallopian tubes removed. However, if the cancer has
spread beyond these to the lymph nodes, then the patient
would run a serious risk to her life. Consequently, the GOG
(Gynecological Oncology Group) recommends that any patient with a tumor larger than 2cm must also have her pelvic
lymph nodes removed. However, post-surgery analysis of the
removed lymph nodes at UT Southwestern Medical Center
reveals that an astounding 78% of the lymph node resections
(removals) were unnecessary! The objective of our study is to
predict which patients are at risk of lymph node metastasis,
The data consists of measurements of 1,428 micro-RNAs on
94 patients, divided evenly between those with lymph node
metastasis and those without. Using the lone-star algorithm,
a final set of 15 micro-RNAs is identified, together with a
linear classifier that is able to achieve accuracy, sensitivity
and specificity in excess of 93%.
II. P ROBLEM F ORMULATION AND L ITERATURE S URVEY
A. Support Vector Machines
Suppose we are given a set of doubly indexed real numbers
{xij , i = 1, . . . , n, j = 1, . . . , m}, where n denotes the
number of features and m denotes the number of samples.
It is generally the case in biology problems that n m.
The first m1 samples belong to Class 1, while the remaining
m2 = m − m1 samples belong to Class 2. Let us introduce
the symbols N = {1, . . . , n}, M1 = {1, . . . , m1 }, M2 =
{m1 + 1, . . . , m1 + m2 }, M = {1, . . . , m} = M1 ∪ M2 .
Further, let us define xj := (xij , i = 1, . . . , n) ∈ Rn to
be the set of feature values associated with the j-th sample.
Then the data is said to be linearly separable if there exists
a weight vector w ∈ Rn and a threshold θ ∈ R such that
wt xj ≥ θ + 1 ∀j ∈ M1 , wt xj ≤ θ − 1 ∀j ∈ M2 .
(1)
In other words, if we define a hyperplane H in Rn by the
equation
H = {z ∈ Rk : wt z = θ},
then H separates the two classes {xj , j ∈ M1 } and {xj , j ∈
M2 }. Note that the constant 1 is introduced purely to ensure
that there is a ‘zone of separation’ between the two classes.
It is easy to see that, if there exists one linear classifier for
a given data set, then there exist infinitely many. The question
therefore is: Which amongst these is ‘optimal’ in some
sense? One of the most successful and widely used linear
classifiers is the Support Vector Machine (SVM) introduced
in [5]. In that paper, the authors choose a classifier that
maximizes the minimum `2 -distance of any vector xj to the
separating hyperplane. Mathematically this is equivalent to
minimizing kwk22 over the set of all weight vectors w and
thresholds θ that satisfy (3). This is a quadratic programming
problem, in that the objective function to be minimized is
quadratic and the constraints are linear. Once w and θ are
determined in this fashion, a new test input x is assigned to
class 1 if wt x − θ > 0 and to class 2 if wt x − θ < 0.
The SVM is attractive for two reasons. First, since it is a
quadratic programming problem, it can be applied to truly
enormous data sets (m is quite large). Second, the optimal
classifier is ‘supported’ on a relatively small number of
samples. Thus adding more samples to the data set often does
not change the classifier. This property is extremely useful in
situations where m n, that is, there are far more samples
than features. However, in biological problems where the
situation is the inverse, the second property is not so useful.
Now we discuss the existence of linear classifiers. With
advances in Vapnik-Chervonenkis theory, the situation is by
now quite clear. Suppose as before that we are given m
samples of n features. Then there are 2m different ways of
assigning the m samples to two classes. It is known [6],
[7] that if n ≥ m + 1, then for each of the 2m possible
assignments of the n-dimensional vectors to two classes,
generically it is possible to find a linear classifier.1 Thus, if in
a given problem n < m + 1, then we increase the dimension
of the feature vector by, for example, taking higher powers
of the components of the feature vector, until the (enlarged)
number of features exceeds m + 1. The resulting classifier
is referred to as a higher-order SVM, or a polynomial
SVM. The advantage of a higher-order SVM is that linear
separability is always guaranteed, whereas the disadvantage
is that it is no longer a linear classifier in the original feature
space.
The standard SVM formulation addresses the situation
where the data is linearly separable. Given a data set, it
is possible to determine in polynomial time whether it is
linearly separable, since that is equivalent to testing the
feasibility of a linear programming problem. In case the
data is not linearly separable, there are several competing
approaches, some of which are described below, including
in the one used in this paper. We have already discussed
the possibility of augmenting the dimension of the feature
space by using higher powers. Another approach, known as
the ‘soft margin’ classifier, is to replace (3) by
1 The precise statement is that, unless all m of the n-dimensional vectors
belong to a lower-dimensional hyperplane, a linear classifier exists for each
of the 2m different assignments of the m vectors to two classes.
subject to
wt xj ≥ θ − ∀j ∈ M1 , wt xj ≤ θ + ∀j ∈ M2 ,
(2)
where > 0, while simultaneously minimizing kwk2 and .
Note that, in contrast to (3), if the soft margin classifier is
used, then any points x such that |wt x − θ| ≤ cannot be
unambiguously assigned to either class. This is because the
two hyperplanes defined by
H1 := {x ∈ Rn : wt x ≥ θ − },
H2 := {x ∈ Rn : wt x ≤ θ + }
are not disjoint but actually overlap. Thus a soft margin
classifier is in reality a three-class classifier, in that it assigns
a test input x to class 1 if wt x − θ > , to class 2 if
wt x−θ < −, and to ‘don’t know’ if wt x−θ ∈ [−, ]. Soft
margin classification works well in cases where it is possible
to achieve the ‘overlapping linear separation’ with relatively
small values of compared to θ. However, depending on the
nature of the data set, it is possible that the entire training
data set falls into the ‘don’t know’ category.
Another possibility is to take the given non-separable data
set, and find a linear classifier of the form (3) that misclassifies the fewest number of data vectors. Mathematically
this is equivalent to the ‘minimum flipping problem’, that is,
determining the smallest number of data vectors that need
to have their labels flipped, in order to make the resulting
set linearly separable. Unfortunately, it is known that this
problem is NP-hard [8].
Another approach, which is the one adopted here, is to
find a tractable version of the ‘minimum flipping problem’
described above. Choose some parameter λ ∈ (0, 1), and let
e denote the column vector of all ones, with the subscript
denoting its dimension. Then the problem is:
min (1 − λ)(yt em1 + zt em2 ) + λkwk22 ,
w,θ,y,z
(3)
subject to the following constraints:
wt xj ≥ θ + 1 − yj ∀j ∈ M1 ,
wt xj ≤ θ − 1 + zj ∀j ∈ M2 ,
yj ≥ 0 ∀j ∈ M1 , zj ≥ 0 ∀j ∈ M2 .
It is obvious that the variables yj , zj play the role of slack
variables so as to achieve linear separability after the slack
variables have been introduced.
If the data is linearly separable, then the constraints can be
satisfied with yj = 0 for all j ∈ M1 , zj = 0 for all j ∈ M2 ,
and any choice of w and θ that achieves linear separation.
In this case the cost function becomes λkwk22 . Let w0 , θ0
denote the solution to the standard SVM formulation, that
is,
kw0 k22 = min kwk22
w,θ
wt xj ≥ θ + 1 ∀j ∈ M1 , wt xj ≤ θ − 1 ∀j ∈ M2 .
Then it is obvious that the minimum value for the problem
in (3) cannot be larger than λkw0 k22 . Moreover, if λ is
sufficiently small, the benefit of reducing kwk22 below kw0 k22
will be offset by the penalty due to having nonzero slack
variables yj , zj . In other words, for linearly separable data
sets, there exists a λ0 such that for all λ ∈ (0, λ0 ) the solution
to the problem in (3) is w = w0 , θ = θ0 , y = 0m1 , z = 0m2 .
On the other hand, if the data is not linearly separable, the
problem is one of trading off the norm of the weight vector
w as measured by the second term in the objective function
with the extent of the misclassification as measured by the
first term. For this reason, we should choose λ to be quite
close to zero but not exactly equal to zero. Note that, unlike
the minimum flipping problem, this problem is a quadratic
programming problem, and can thus be solved in polynomial
time. However, there is no guarantee that the classifier found
in this fashion is optimal in the sense of the number of
misclassified points.
B. `1 -Norm SVM
Now we discuss the `1 -norm SVM following the notation
and content of Section 3 of [1]. Let k · k denote any norm
on Rn , and recall that its dual norm is defined by
kwkd = max |wt x|.
kxk≤1
For an arbitrary norm k · k on Rn , the corresponding support
vector machine can be found by solving the following
problem: See Equation (11) of [1], and note that in [1] the
data vectors are taken as row vectors, whereas we denote
them as column vectors. Choose some parameter λ ∈ (0, 1).
Then the problem is
min (1 − λ)(yt em1 + zt em2 ) + λkwkd ,
w,θ,y,z
(4)
subject to the following constraints:
t
w xj ≥ θ + 1 − yj ∀j ∈ M1 ,
wt xj ≤ θ − 1 + zj ∀j ∈ M2 ,
C. Trading Off Sensitivity and Specificity
Next we discuss a simple modification of the objective
function, first suggested by Veropoulos et al. [9] for ‘automatically’ trading off sensitivity versus specificity. Recall
the definitions of these concepts for a two-class classifier.
For such a classifier, there are four combinations, namely
T P, T N, F P, F N , representing true-positive through falsenegative. The sensitivity is defined as the ratio T P/(T P +
F N ), while the specificity is defined as T N/(T N + F P ).
In other words, the sensitivity is the fraction of samples
that are classified as belonging to the ‘true’ class that are
in fact true, while the specificity is the fraction of samples
that are classified as belonging to the ‘false’ class that are
in fact false. It is well-known that no classifier can achieve
both 100% sensitivity as well as 100% specificity (except
in contrived examples). The curve that plots the maximum
sensitivity achievable by any classifier as a function of the
specificity (or vice versa) is referred to for historical reasons
as the ROC (Receiver Operating Characteristic) curve. In
a given problem, the ROC curve is not known in general.
However, it is possible to plot the maximum sensitivity
achievable by the class of classifiers under study, and use
that as an approximation of the ROC curve. This is what we
do here.
To trade-off sensitivity versus specificity, we just make the
following substitution in the cost function:
yt em1 + zt em2 ← αyt em1 + (1 − α)zt em2 ,
where α ∈ (0, 1). Clearly, if α < 0.5, then larger values of
y are tolerated, and the classifier will have higher specificity
than sensitivity; if α > 0.5 then the classifier will have higher
sensitivity than specificity. If α = 0.5 then this reduces to the
earlier problem formulation, because the scale factor of 0.5
clearly does not change the problem. With this modification,
the final optimization problem now becomes the following:
min (1 − λ)(αyt em1 + (1 − α)zt em2 ) + λkwkd ,
w,θ,y,z
(5)
subject to the following constraints:
wt xj ≥ θ + 1 − yj ∀j ∈ M1 ,
yj ≥ 0 ∀j ∈ M1 , zj ≥ 0 ∀j ∈ M2 .
wt xj ≤ θ − 1 + zj ∀j ∈ M2 ,
Since the `2 -norm is its own dual, in the special case
where the distance to the separating hyperplane is measured
using the `2 -norm, (4) reduces to (3) if k · k is the `2 -norm
(after replacing the term λkwkd by λkwk2d ). In general, the
optimal weight vector w for the minimization problem (3)
will contain all nonzero components, as that is a quadratic
programming problem. To ensure that the optimal weight
vector w has a large number of zero entries, it is suggested
in [1] that the distance to the separating hyperplane should
be measured using the `1 -norm. If we use the `1 -norm to
measure distances in Rn , then the dual norm is the `∞ -norm,
and the problem (4) is a linear programming formulation.
Therefore the number of nonzero components of the optimal
weight vector is bounded by m, the number of samples.
yj ≥ 0 ∀j ∈ M1 , zj ≥ 0 ∀j ∈ M2 .
D. Recursive Feature Elimination in `2 -Norm SVMs
Until now we have been discussing various ways of
generating linear classifiers that use all the components of
the vectors xj , that is, all the features. Now we discuss
the `2 -norm SVM RFE as defined in [2]. In that paper,
the authors begin with all n features, and divide the two
classes M1 and M2 into five roughly equal subsets each.
Then one subset is left out from each class, leaving roughly
80% of the original samples from within each class. For
each such choice, a traditional `2 -norm SVM is computed,
and the associated weight vector is determined. By cycling
through all five choices for the subset to be omitted, the
authors generate a total of five optimal weight vectors. These
are then averaged to come up with a five-fold cross-validated
weight vector. From this weight vector, the least significant
(in terms of magnitude) component of the weight vector is
dropped, resulting in a reduction of one in the number of
features used, from n to n − 1. Then the exercise is repeated.
If several components of the averaged (or cross-validated)
weight vector are small in magnitude, then more than one
component can be dropped at each step. However, since
the `2 -norm SVM formulation is a quadratic programming
problem, in general there is no reason why any components
of the weight vector should be small, let alone equal to zero.
III. T HE L ONE S TAR A LGORITHM
In this section we present our new algorithm, which
consists of two steps. First we reduce the number of features
using the ‘Student’ t-test, and then we apply recursive feature
elimination but to the `1 -norm SVM not the `2 -norm SVM
as in [2].
A. Pre-Processing the Feature Set
For each index i, we compute the average of the two
classes for the i-th feature. Thus for all i = 1, . . . , n, we
compute the means
1 X
xi,j , l = 1, 2.
µi,l =
m1
j∈Ml
Then we use the standard ‘Student’ t-test to determine
whether or not there is a statistically significant difference
between the two means. The significance level can be anything we wish, but in biology it is common to accept the
difference as being significant if the likelihood of it occurring
by chance is less than 0.05, that is, the null hypothesis that
the two means are equal can be rejected at a 95% level
of confidence. This results in a reduction in the number of
features. For convenience, we continue to use the symbol n
to denote the reduced number of features.
B. `1 -Norm SVM with Recursive Feature Elimination
The next step in the algorithm is to combine the `1 -norm
SVM with recursive feature elimination. Specifically,
• Choose at random a ‘training set’ of samples of size
k1 from M1 and size k2 from M2 , such that kl ≤
ml /2, and k1 , k2 are roughly equal. In the endometrial
cancer application below, m1 = m2 , so that size of the
training set is half of the total samples within each class.
Then compute an optimal `1 -norm SVM for the chosen
training set.
• Repeat the above exercise several times, with different
randomized choices of training and testing data. (We
repeated this step with 80 randomized choices and with
1,000 choices, and there was hardly any difference in
the outcomes.) This is unlike in [2], where there is only
one randomized division.
• For each randomized assignment of the data to the two
classes, the number of nonzero entries in the optimal
weight vector is more or less the same, whereas the
location of nonzero entries in the optimal weight vector
varies from one run to another.
• Let k denote the average number of nonzero entries in
the optimal weight vector across all randomized runs.
Average all the optimal weight vectors, and choose the
largest k entries and corresponding feature set. This
results in reducing the number of features from the
original n to k in one shot. (See Section VI for an
alternate method for choosing the features, and the
observation that both methods lead to essentially the
same results.)
• Repeat the process with the reduced feature set, and
with several sets of k1 , k2 randomly selected training
samples, until no further reduction is possible in the
number of features. This determines the final set of
features to be used.
• Once the final feature set is determined, carry out
two-fold cross validation by dividing the data into a
training set of k1 , k2 randomly selected samples and
assessing the performance of the resulting `1 -norm
classifier on the testing data set, which is the remainder
of the samples. Average the weights generated by 20
(or whatever number) best-performing classifiers, and
call that the final classifier. At this point there is no
distinction between the training and testing data sets,
so run the final classifier on the entire data set to arrive
at the final accuracy, sensitivity and specificity figures.
The advantage of the above approach vis-a-vis the `2 norm SVM-RFE is that the number of features reduces
significantly at each step, and the algorithm converges in just
a few steps. This is because, with the `1 -norm, many components of the weight vector are ‘naturally’ zero, and need not
be truncated. In contrast, in general all the components of
the weight vector resulting from the `2 -norm SVM will be
nonzero; as a result the features can only be eliminated one
at a time, and in general the number of iterations is equal to
(or comparable to) n, the initial number of features.
IV. C ASE S TUDY: E NDOMETRIAL C ANCER
The endometrium is the lining of the uterus. Endometrial
cancer is the most common gynecological malignancy, afflicting up to 48,000 women annually. Due to early detection,
endometrial cancer results in ‘only’ about 8,000 deaths annually. The presence of pelvic and/or para-aortic lymph node
metastasis decreases the 5-year survival rate from 85% to
58% and 41% respectively [10]. Based upon surgical staging
studies in the 1970s and 1980s that suggested frequent errors
in the clinical staging of endometrial cancers [11], [12],
the International Federation of Obstetrics and Gynecology
adopted a surgical staging system in 1988 which has been
recently updated in 2009 [13]. Currently, primary staging
surgery for endometrial cancer consists of removal of the
uterus, ovaries, fallopian tubes, and pelvic and para-aortic
lymph node dissection, as well as omentectomy when indicated. The incidence of pelvic and para-aortic node metas-
tasis in patients with stage I endometrial cancer is 4-22%
and varies with grade, depth of invasion, lymphovascular
space invasion, and histologic subtype [12]. Therefore, 9678% of patients with endometrial cancer will not benefit from
a lymphatic dissection. Morbidities associated with pelvic
and para-aortic lymph node dissection include increased
operative times, increased blood loss, ileus, increased number
of thromboembolic events, lymphocyst formation, and major
wound dehiscence, all of which adversely affect the patients
health and quality of life [14].
Efforts have been made to identify patients pre-operatively
or intra-operatively who are at greatest risk for lymph node
metastasis and would benefit from a formal lymph node
assessment. In the two largest studies to date, patients with
tumors grade I/II, tumors with < 50% uterine invasion, and
tumors < 2cm in size did not have any evidence of lymph
node metastasis and could be spared a lymph node dissection
[15], [16]. In patients who do not meet these criteria, lymph
node resection is the accepted practice. However, postsurgery analysis shows that, even in patients not meeting
the aforementioned criteria, lymph node metastasis was
identified in only 22% of patients. Therefore, 78% of patients
underwent a morbid procedure that they ultimately did not
need despite the use of the most up to date recommended
practice.
The incidence of pelvic and para-aortic lymph node
metastasis in endometrial cancer is likely related to the
biologic aggressiveness of the tumor reflected in the genetic
determinants of cellular mechanisms that control metastasis.
MicroRNAs (miRNAs) are 19 to 25-nucleotide, non-coding,
RNA transcripts, thought to be instrumental in controlling eukaryotic cell function via modulation of post-transcriptional
activity of multiple target messenger RNA (mRNA) genes by
repression of translation or regulation of mRNA degradation
[17], [18], [19]. As such, miRNAs may impact critical
control mechanisms in tumor cells that affect metastatic
potential. MicroRNA expression analysis can identify differentially expressed miRNAs in patient populations with
different clinical characteristics and has been utilized in
endometrial and ovarian cancers.
MicroRNA expression patterns have been identified that
can predict benign vs. malignant disease, histologic subtypes,
survival, and response to chemotherapy [20], [21], [22].
Since lymph node metastasis is likely driven by genetic
mechanisms, we propose that miRNA expression techniques
can be used to elucidate patterns of miRNA expression
associated with lymphatic metastasis. The information identified in this study can be used twofold. First, novel miRNA
expression patterns associated with lymphatic metastasis can
be incorporated into prospective translational-clinical trials to
test their validity in the clinical setting and hopefully improve
upon the 22% predictive accuracy of clinical-pathologic
parameters. Second, the individual miRNAs identified can
inform on the specific genes responsible for lymphatic metastasis and direct future research into tailored therapeutics.
V. T HE NATURE OF THE DATA
Fifty stage IA or IB (1988 FIGO staging) and 50 stage
IIIC frozen endometrial cancer samples were obtained from
the Gynecologic Oncology Group (GOG) tumor bank. The
samples were collected from patients enrolled in GOG tissue
acquisition protocol 210 which established a repository of
clinical specimens with detailed clinical and epidemiologic
data from patients with surgically staged endometrial carcinoma. The stage I and stage IIIC samples were matched for
age, grade, presence of lymphvascular space invasion, and
raced when possible. All patients enrolled in GOG 210 have
undergone comprehensive surgical staging consisting of total abdominal hysterectomy, bilateral salpingoophorectomy,
pelvic and para-aortic lymphadenectomy. Patients included
in this study had no gross or pathologic evidence of extrauterine disease aside from lymph node metastasis and could
be considered clinical stage I tumors. All tumors have
undergone central pathologic review by the GOG.
Out of these 100 tumors, six were rejected for unrelated
reasons, leaving 47 tumors each of stage IA/IB and of
stage IIIC. MicroRNA expression analysis was performed
on all these 94 samples, using a commercial experimental
apparatus that measured the average abundance of 1,428
miRNA molecules in all 94 tissue samples. The raw measured quantity underwent several quality control checks, and
finally the binary logarithm of the measured quantity was
taken as the output of the miRNA expression analysis.
Because the measurements were taken using a commercial
microarray chip, many of the miRNAs measured were neither
relevant nor of interest to us. Out of the total of 1428 × 94
measurements, about 42% were shown as ‘NAN’ (or ‘Not a
Number’). Therefore the first issue was how to treat all these
NAN entries. In view of the manner in which the raw data
was generated, it was evident that a reading of NAN resulted
when the quantity of miRNA produced is not sufficient to
be detectable by the measuring device. Therefore a decision
was taken to replace all NAN entries by a zero entry. This
is consistent with the physics of the problem.
With this replacement, the data at hand consisted of
samples of the form xij , i = 1, . . . , n, j = 1, . . . , m, where
i = 1428 is the number of miRNAs measured, and j = 94
is the number of samples. The set of samples can be further
subdivided into two classes. Samples 1, . . . , m1 belong to
one class whereas samples m1 + 1, . . . , m1 + m2 belong to
the second class. In the present instance, m1 = m2 = 47.
VI. R ESULTS
As mentioned earlier, the first step of the lone star algorithm is to apply the student t-test to choose those features
that showed a statistically significant difference between the
means of the two classes. This resulted in a choice of 165
features from the original list of 1,428. The reduced data
set with 165 features and 94 samples, 47 of each type,
was analyzed withthe `1 -norm SVM-RFE. For comparison
purposes, the same data set (with 165 features) was also
analyzed using the `2 -norm SVM RFE. This section reports
on the results.
We first present the outcome of the `2 -norm SVM RFE.
Note that the algorithm presented in [2] is not truly randomized, in the following sense. Only the initial division of each
sample set into five roughly equal classes is random; after
that, the procedure is deterministic. Thus if one application
of the RFE algorithm does not yield satisfactory results the
only option is to try again with a different initial division.
After several runs of the RFE algorithm, we could finally
manage to find ten out of the original 165 features that
had acceptable performance. Each run of the `2 -norm SVM
RFEtook approximately six hours on a Intel Xeon 2.8 GHz
Quad Core processor with 8 GB RAM. As stated above,
we had to repeat the runs several times before getting a
satisfactory classifier.
For the `1 -norm SVM-RFE, we began with the two sample
classes, and at random chose half (23 samples) of each
class as the training data and the other half (24 samples)
as the testing data. For each random choice, we computed
the associated optimal weight vector for the `1 -norm SVM.
This randomization step was repeated 80 times. We found
that the number of nonzero weights was consistently around
31, though naturally the locations or indices of these nonzero
weights changed from one run to another. Thus we could
reduce the number of features from 165 to 31 in a single
iteration. Note that in the case of the `2 -norm SVM this
type of reduction would have taken multiple iterations and
much more computing time.
At this point we had one of two possible ways to proceed.
First, we could simply average the weight vectors of all 80
runs, and choose the largest 31 components. Second, for each
of the 165 features, we could compute the number of times
(out of 80) that the particular feature had a nonzero weight;
then we could rank the features in terms of the number of
times the feature had a nonzero weight. A happy outcome
is that both approaches yield more or less the same set
of features. Specifically, out of the top-ranked 31 features
selected according to their average weight, the top 26 were
also the top-ranked features according to the number of times
that the feature had a nonzero weight; moreover, the ordering
of these 26 features was the same in both lists. Only the last
five indices differed. With this observation, we opted for the
first approach. Therefore we used each randomized run of the
`1 -norm SVM to determine the average number of nonzero
features, call it k; then we ranked all the features according
to their average weight across all randomized runs; and then
chose the top k-ranked features for the next round.
Subsequent runs of the above process resulted in the number of features decreasing in successive stages as follows:
165 ← 31 ← 21 ← 18 ← 15.
At this point there was no further reduction in the number
of nonzero features, so the algorithm was terminated. It is
interesting to note that nine out of the final fifteen features
were the top nine-ranking features at the very first run. For
Norm
Attribute
Accuracy
Sensitivity
Specificity
Testing
0.8178
0.7913
0.8443
`2
Combined
0.9043
0.8936
0.9149
Testing
0.8854
0.9104
0.8604
`1
Combined
0.9410
0.9532
0.9287
the `1 -norm SVM RFE algorithm, the CPU time on the same
processor as above was 3.5 minutes! – roughly a hundredfold reduction compared to the `2 -norm SVM RFE. Just
to reduce the gap, we increased the number of randomized
partitionings of the data at each iteration from 80 to 1,000.
This did not change the final results (reported below) but
increased the CPU time to 34 minutes – still ten times faster
than the `2 -norm SVM RFE.
Now we report the accuracy etc. of the classifiers found
via the `1 -norm and the `2 -norm SVM RFE algorithms. As
stated above, the results reported below for the `2 -norm
SVM RFE are the best of many runs. This is because
the recursive feature elimination is highly sensitive to the
original (random) splitting of the data into five equal parts.
In contrast, the results reported for the `1 -norm SVM RFE
are the result of the one and only time we ran the algorithm.2
In the last round, after we reduced the number of features
from 18 to 15, we partitioned the samples at random 80 times
into two sets each, one for training and the other for testing
(two-fold cross-validation) and computed the optimal `1 norm SVM classifier. For each randomized run, we computed
the accuracy, sensitivity and specificity on the testing data
alone, and on the entire data set including both the testing
as well as the training data. Then the weights of the 20 bestperforming classifiers were averaged, and chosen as the final
classifier. For each of these best-performing 20 classifiers, it
makes sense to talk about the performance on the training
set (usually close to 100%) and on the test set. However,
since the final set of classifier weights is the average of
these, there is no distinction between the training set and
the testing set, and the only way to assess the classifier is
to run it on the entire data set. These figures are reported
below. For comparison purposes, the same type of numbers
are reported for the `2 -norm SVM RFE as well.
Next, we created a ‘heat map’ of the performance of
the best-performing 20 classifiers as well as the ‘average’
classifier on all 94 test samples. The diagram below shows
the result. Rows 1 to 20 are the best classifiers; thus the
bottom row has the best performance, while row 21 is that of
the average classifier. It can be seen that one sample, namely
UTSW#88-IIIC (though this label is difficult to see in the
figure), is misclassified 11 times out of 20. This suggests that
possibly this sample might be an outlier. Thus, in addition to
providing a good method for finding a reduced set of features
for two-class classification, the lone star algorithm also has
the potential to detect outliers in a relatively simple manner.
2 To test the robustness of the algorithm, we ran it a second time and got
very similar results.
VII. C ONCLUDING R EMARKS
In this paper we have introduced a new method for
finding a classifier between two classes, in the case where
the number of features is far in excess of the number of
samples. While the algorithm is completely general, we
have illustrated its utility by applying it to the problem of
determining a prognostic classifier for endometrial cancer,
specifically, whether or not metastasis has occurred into the
lymph nodes thus necessitating their removal via surgery.
The new algorithm consists of first using the student t-test to
choose only those features that demonstrate a significant difference between the two classes and then using the `1 -norm
support vector machine. The last step is to keep on reducing
the number of features (recursive feature elimination) until
no further reduction is possible. For the endometrial cancer
application studied here, we began with 1,428 features and
finally wound up with just fifteen features. The resulting
classifier achieved accuracy, sensitivity and specificity of
nearly 90% on two-fold cross-validated test data, and in
excess of 93% on the entire data set. We have already used
this approach in ovarian cancer to determine which patients
are likely to respond to platinum therapy, and plan to apply
the approach to lung cancer to determine the responsiveness
of cell lines to chemotherapy.
Note that a somewhat similar approach to ensuring that
many components of the weight vector are zero is also the
objective of the so-called LASSO (Least Absolute Shrinkage
and Selection Operator) method proposed in [3]. The LASSO
method minimizes the `2 -norm of the regression vector
subject to an `1 -norm constraint on the model parameters.
We believe that LASSO is more suited to problems where the
labels are not binary as in the present case, but real numbers
in the interval [−1, 1] (or some compact set). Therefore
LASSO is better-suited to the problem of predicting the
lethality of chemotherapy which is never binary, than to
deciding whether surgery is necessary or not, which is a
binary decision. We plan to study the applicability of LASSO
in addition to the `1 -norm SVM to the problem of predicting
the responsiveness of lung cancer cell lines to chemotherapy.
The findings should be interesting.
R EFERENCES
[1] P. S. Bradley and O. L. Mangasarian, “Feature selection via concave
minimization and support vector machines,” in Machine Learning:
Proceedings of the Fifteenth International Conference (ICML ’98).
Morgan Kaufmann, San Francisco, 1998, pp. 82–90.
[2] I. Guyon, J. Weston, and S. Barnhill, “Gene selection for cancer
classification using support vector machines,” Machine Learning,
vol. 46, pp. 389–422, 2002.
[3] R. Tibshirani, “Regression shrinkage and selection via the lasso,”
Journal of the Royal Statistical Society, vol. 58(1), 1996.
[4] H. Zou and T. Hastie, “Regularization and variable selection via the
elastic net,” J. Royal Stat. Soc. B, vol. 67, pp. 301–320, 2005.
[5] C. Cortes and V. N. Vapnik, “Support vector networks,” Machine
Learning, vol. 20, 1997.
[6] R. S. Wenocur and R. M. Dudley, “Some special vapnik-chervonenkis
classes,” Discrete Mathematics, vol. 33, pp. 313–318, 1981.
[7] E. D. Sontag, “Feedforward nets for interpolation and classification,”
J. Comp. Sys. Sci., vol. 45(1), pp. 20–48, 1992.
[8] K.-U. Höffgen, H.-U. Simon, and K. S. V. Horn, “Robust trainability of
single neurons,” Journal of Computer and System Science, vol. 50(1),
pp. 114–125, 1995.
[9] K. Veropoulos, C. Campbell, and N. Cristianini, “Controlling the
sensitivity of support vector machines,” in IJCAI Workshop on Support
Vector Machines, 1999.
[10] C. P. Morrow, B. N. Bundy, R. N. Kurman, W. T. Creasman, P. Heller,
H. D. Homesley, et al., “Relationship between surgical-pathological
risk factors and outcome in clinical stage i and ii carcinoma of the
endometrium: a gynecologic oncology group study,” Gynecological
Oncology, vol. 40(1), pp. 55–65, 1991.
[11] R. C. Boronow, C. P. Morrow, W. T. Creasman, P. J. Disaia, S. G.
Silverberg, A. Miller, et al., “Surgical staging in endometrial cancer:
clinical-pathologic findings of a prospective study,” Obstet Gynecol.,
vol. 63(6), pp. 825–32, 1984 Jun.
[12] W. T. Creasman, C. P. Morrow, B. N. Bundy, H. D. Homesley, J. E.
Graham, and P. B. Heller, “Surgical pathologic spread patterns of
endometrial cancer. a gynecologic oncology group study,” Cancer, vol.
60(8 Suppl), pp. 2035–41, 1987 Oct 15.
[13] S. N. Lewin, T. J. Herzog, N. I. B. Medel, I. Deutsch, W. M. Burke,
X. Sun, et al., “Comparative performance of the 2009 international
federation of gynecology and obstetrics’ staging system for uterine
corpus cancer,” Obstet Gynecol., vol. 116(5), pp. 1141–9, 2010 Nov.
[14] H. Kitchener, A. M. Swart, Q. Qian, C. Amos, and M. K. Parmar,
“Efficacy of systematic pelvic lymphadenectomy in endometrial cancer
(mrc astec trial): a randomised study,” Lancet, vol. 373(9658), pp.
125–36, 2009 Jan 10.
[15] A. Mariani, S. C. Dowdy, W. A. Cliby, B. S. Gostout, M. B.
Jones, T. O. Wilson, et al., “Prospective assessment of lymphatic
dissemination in endometrial cancer: a paradigm shift in surgical
staging,” Gynecol Oncol., vol. 109(1), pp. 11–8, 2008 Apr.
[16] A. Mariani, M. J. Webb, G. L. Keeney, M. G. Haddock, G. Calori,
and K. C. Podratz, “Low-risk corpus cancer: is lymphadenectomy or
radiotherapy necessary?” Am J Obstet Gynecol., vol. 182(6), pp. 1506–
19, 2000 Jun.
[17] D. P. Bartel, “Micrornas: genomics, biogenesis, mechanism, and
function,” Cell, vol. 116(2), pp. 281–97, 2004 Jan.
[18] W. Filipowicz, L. Jaskiewicz, F. A. Kolb, and R. S. Pillai, “Posttranscriptional gene silencing by sirnas and mirnas,” Curr Opin Struct
Biol., vol. 15(3), pp. 331–41, 2005 Jun.
[19] E. J. Sontheimer and R. W. Carthew, “Silence from within: endogenous
sirnas and mirnas,” Cell, vol. 122(1), pp. 9–12, 2005 Jul.
[20] T. Boren, Y. Xiong, A. Hakam, R. Wenham, S. Apte, Z. Wei,
et al., “Micrornas and their target messenger rnas associated with
endometrial carcinogenesis,” Gynecol Oncol., vol. 110(2), pp. 206–
15, 2008 Aug.
[21] H. K. Dressman, A. Berchuck, G. Chan, J. Zhai, A. Bild, R. Sayer,
et al., “An integrated genomic-based approach to individualized treatment of patients with advanced-stage ovarian cancer,” J Clin Oncol.,
vol. 25(5), pp. 517–25, 2007 Feb 10.
[22] M. V. Iorio, R. Visone, G. D. Leva, V. Donati, F. Petrocca, P. Casalini,
et al., “Microrna signatures in human ovarian cancer,” Cancer Res.,
vol. 67(18), pp. 8699–707, 2007 Sep 15.
[23] N. Cristianini and J. Shawe-Taylor, Support Vector Machines. Cambridge University Press, Cambridge, UK, 2000.
Download