Exploratory Basis Pursuit Classification

advertisement
Exploratory Basis Pursuit Classification
M. Brown* and N.P. Costen+
*
Corresponding author
Control Systems Centre,
Department of Electrical Engineering and Electronics,
UMIST,
PO Box 88,
Manchester,
M60 1QD,
UK
Tel: +44 (0)161 200 4672
Fax: +44 (0)161 200 4647
Email: martin.brown@umist.ac.uk
+
Dept Computing and Mathematics,
Manchester Metropolitan University,
Manchester,
UK
Email: n.costen@mmu.ac.uk
Keywords: feature selection, sparse classification, regularization
Abstract
Feature selection is a fundamental process in many classifier design problems. However, it is NP-complete and
approximate approaches often require requires extensive exploration and evaluation. This paper describes a novel
approach that represents feature selection as a continuous regularization problem which has a single, global minimum,
where the model’s complexity is measured using a 1-norm on the parameter vector. A new exploratory design process
is also described that allows the designer to efficiently construct the complete locus of sparse, kernel-based classifiers.
It allows the designer to investigate the optimal parameters’ trajectories as the regularization parameter is altered and
look for effects, such as Simpson’s paradox, that occur in many multivariate data analysis problems. The approach is
demonstrated on the well-known Australian Credit data set.
1. Introduction
Feature selection is a fundamental process within many classification algorithms, as many classification problems are
not well understood and a large dictionary of potential, flexible features exists, from which it is necessary to select a
relevant subset [Weston et al 2001]. This is typical in the vision and biometrics fields, where new measurement or
transformation techniques are explored in order to boost the classifier’s performance. The aims of feature selection
include:
 improved, more robust parameter estimation and
 improved insight into the decision making process.
Feature selection is generally an empirical process that is performed prior to, or jointly with, the parameter estimation
process. Training data is used to assess different model combinations and the most appropriate one for the classification
task is selected. However, it should also be recognized that feature selection often introduces problems because it is an
NP-complete problem and subtle parametric interactions are often not appreciated by the designer.
To perform feature selection, often multiple classifiers are constructed by using some form of forwards selection,
where a classifier’s complexity is incrementally increased in a step-wise optimal manner, or backwards elimination,
where a classifier’s complexity is incrementally decreased. These approaches are known to be sub-optimal, although
many heuristics have been proposed to make the procedures more robust. In addition, in recent years, there has been a
surge of interest in using Bayesian or hyperparameter approaches [Weston et al 2001], where relevance terms are
associated with the features and those features that have small posterior relevance are thresholded to zero. Again, this
multiple thresholding operation is not guaranteed to generate an optimal, sparse solution. This paper interprets feature
selection as the process of minimising a continuous 1-norm measure of the magnitude of the parameter vector.
To gain qualitative insights into the structure of the classification problem, it is possible to assess the trained
classifiers, by investigating whether the parameters are positive, negative or zero. This provides information about
whether certain effects may occur, and what type of effect it has. However, training multiple candidate classifiers can
produce very different qualitative interpretations in some cases, due to a phenomenon known as Simpson’s paradox
[Simpson 1951]. Simpson’s paradox refers to a change in the direction of an association between two variables when a
third variable is introduced. In the context of classification problems, a parameter may change in sign (or become zero
after being non-zero) when other variables are considered in the model. A qualitative interpretation of the role of the
original parameter is no longer valid in the presence of the third parameter and so the interpretation may be misleading
or incorrect. Therefore it is necessary to adopt a full multivariate approach where the designer is aware of all possible
sparse models and how sensitive the structure is to changes in the degree of sparseness; a central theme of this paper.
This paper describes a novel approach to building a complete family of sparse, optimal classifiers using an efficient,
iterative procedure, where the aim is generate every optimal classifier that range from robust and simple designs to data
sensitive, complex models. Feature selection is performed at the same time as parameter estimation. Therefore, Section
2 describes the basic theory behind building a sparse classifier using a regularization design framework and a 1-norm
model complexity measure. Section 3 analyses the properties of the resulting optimal, sparse classifiers, by deriving both
the normal form and a set of necessary and sufficient constraints that define the active sets of parameters and data. This
theory is then utilized in Section 4, where an efficient, iterative procedure for constructing the complete locus of
optimal, sparse classifiers is described. This procedure is then applied to the Australian Credit data set, [Michie et al
1994], in Section 5, and the sparse initial models and Simpson’s paradox are both demonstrated.
2. Sparse classifier design
This section describes a sparse regularization approach for designing classifiers. A regularization framework is
introduced and a simple but naïve method for calculating the model’s parameters is also briefly described.
The empirical classifiers considered in this paper are linear in their adjustable parameters. This includes a wide range
of linear, polynomial and kernel [Cristianini, and Shawe-Taylor 2000] machines:
y  sgn φT θ  b
where  and  are the n-dimensional parameter and feature vectors, respectively, b represents the bias term and the
classifier’s output, y, lies in {–1, 1}. The input features determine the classifier’s capabilities and generalization abilities,
and some techniques, such as support vector machines, assume flexible kernels are available and use regularization
methods to control complexity. An exemplar training set of the form {, t} is assumed to exist where the input data, 
is l*n, and each feature is typically scaled to be zero mean and unit variance and the classification target vector, t, is an
l-dimensional bipolar vector.
2.1 A regularization approach to sparse classification
A lot of research/application in statistical modelling has taken place in recent years, motivated by Vapnik’s work on
Statistical Learning Theory [Cristianini, and Shawe-Taylor 2000,Weston et al 2001]. Using this approach, one way of
formulating a classification regularization function is known as basis pursuit classification and is given by:
2
(1)
min θ,b ,ζ f θ, b, ζ,    1 2 ζ 2   θ 1
where ζ and  are the output residual vector and the regularization parameter, respectively. The non-negative output
residual vector is defined by:
(2)
diag (t)Φθ  1b  ζ  1
and the set of active data points corresponds to those output residuals that are strictly positive and hence lie in or on the
wrong side of the classification margin. By optimizing the classifier’s parameters, the aim is to maximize the so-called
classification margin and thus ensure that the classifier provides a graded fit to the training data. In addition, it should
also be noted that this is a piecewise quadratic programming (QP) problem in n+l+1 parameters as the output residuals
are a thresholded linear function of the classifier’s active parameters.
In this regularization approach to machine learning, the loss function, l ζ  
1
2
2
ζ 2 , which measures the data fitting
error is balanced against a model complexity function, pθ  θ , and the optimal parameters are determined by the
1
regularization parameter. A small value of  will ensure that greater weight is given to minimizing the output errors
by reducing the classification margin, at the expense of a model with a larger parameter vector. A large value of 
ensures that more importance is given to smaller parameter vectors although the output errors are larger and the
classification margin is wider. Therefore the optimal classifiers’ complexity and parameter estimates are a function of 
and this is further investigated in Section 4.
It is worthwhile noting that the regularization function, f(), is a continuous function of all its parameters and is
piecewise differentiable with respect to the classifier’s parameters, due to the presence of the 1-norm model complexity
function. It is also important to note that the model complexity term is a form of Bayesian prior. Normal ridge regression
techniques and support vector machines usually assume that the model complexity is measured with respect to a 2-norm
[Mackay 1998, Hong et al 2003]. However, basis pursuit classification, or robust classification, [Bradley and
Mangasarian 1998, Fung and Mangasarian 2002] refers to the fact that a 1-norm is used to measure the parameter
vector’s size. Most p-norms will produce parameter values that are small, but non-zero. It can be shown, [Bradley and
Mangasarian 1998, Fung and Mangasarian 2002,Rosset et al 2003], that using a 1-norm produces sparse solutions while
still producing a continuous optimization problem, as will be demonstrated in the next section. In fact, p=1 is the only pnorm (p1) which produces a truly sparse solution without some form of thresholding procedure and has been developed
for regression algorithms [Chen et al 2000, Fung and Mangasarian 2002] and recently been considered for support
vector machines [Rosset et al 2003] and weighting schemes have been used to specify prior knowledge [Brown and
Costen 2003].
2.2 Direct calculation of a sparse classifier
The sparse, regularization criterion (1) is a piecewise QP problem in . A well-known method [Chen et al 2001] to
overcome this non-linearity is to introduce a set of 2n non-negative parameters, z, where the first n values are the
corresponding values of  when  is positive (and zero otherwise) and the second n values are – when  is negative
(and zero otherwise). This allows the original cost function to be expressed in terms of z:
min z ,b,ζ f z, b, ζ,    1 2 i i2   i z i
(3)
st
z0
diag (t )Φ  Φz  1b  ζ  1
which is now a standard QP problem in 2n+l+1 variables. While this approach is conceptually simple, it is not
necessarily very efficient. More importantly, in order to explore neighbouring classifiers in order to explore how
sensitive the calculated parameters are with respect to the regularization parameters, the QP algorithm would need to be
re-calculated, thus ignoring knowledge about the problem’s structure.
3. Direct Calculation of the Sparse Classifier
The aim of this section is to derive an expression for directly calculating the sparse, optimal parameters. This is
achieved by considering the local, normal form of the regularization function. The expression is valid when the sets of
active parameters and data have already been correctly identified. While this may seem to be equivalent to the original
feature selection problem, it is shown that the calculation can be validated using a set of linear equality/inequality
constraints that represent the sets of active/inactive data and parameters.
In the following sections, an active parameter or datum is actively considered in the calculation. Therefore, the active
parameter set is the subset of parameters whose values are non-zero. The active data set refers to those data lying in, or
on the wrong side of, the classification margin and have a positive output error, i. The active data matrix refers to the
data matrix that retains only the active features (columns) and active data (rows) and is denoted by Φ A, A .
3.1. Normal form
Assuming that the active set of parameters and data are known, by differentiating (1) with respect to the active
parameters and setting equal to zero to find the minima, the gradients of the loss and model complexity functions are
related as:
(4)
ΦTA, A t A  Φ A, Aθ A  1b   sgn( θ A )
where the sub-scripts A are used to denote the sets of active data and parameters. This is the normal form of basis pursuit
classification, and is similar to the normal form for least squares regression where the inner product between the features
and residual is zero. The regularization parameter is “nearly” decoupled from the parameter vector, as it appears on the
right hand side of (4) as a multiplier of the locally constant sgn() vector. Therefore, the magnitude of the inner product
between each active feature and the active residuals is equal to 
3.2 Optimal parameter calculation
Equation (4) was derived on the assumption that the active parameter and data sets are known, and re-arranging
gives:
(5)
θ A  Η 1 p   sgn( θ A )
T
T
T
where H  Φ A, A I  11 / lA Φ A, A is the local, active Hessian which has had the bias term subtracted, p  Φ A, A I  11T / l A t A
represents the transformed target vector minus the bias, I is the identify matrix, 1 is a column vector of 1s and lA is the
number of active data. This expression is a closed form solution for the optimal sparse parameter vector when the sign of
the optimal parameter vector has been identified. However estimating the sign of the parameters which lie in {-1,0,1}n is
equivalent to deciding which are the active features which in general, is a non-trivial problem, but will be addressed in
Section 4.
To show how the bias term is removed from (5), it should be noted that the bias term is not included in the model
complexity function, therefore, it can be calculated separately as the value that minimizes the squared output residual
across the active data set as:
(6)
b  1T t A  Φ A, Aθ A  l A
Therefore the bias term is a function of the optimal parameters, which is why the Hessian matrix and target vector have
been modified in (5).
3.3. Necessary and sufficient conditions on the parameters
Expression (5) is valid as long as the set of active parameters (columns of Φ A, A ) and data (rows of Φ A, A ) are known
and correct, for the value of . In this section, a set of linear equality/inequality constraints is derived that validate this
assumption for the sets of active and inactive parameters.
From (4), for the active residual vector, A, the conditions on the active parameters are:
(7)
ΦTA, Aζ A   sgn θ A 
This gives a set of nA equality constraints that the calculated parameters must satisfy.
Also, by considering small perturbations of the inactive (zero) parameters, it can also be shown that the absolute
value of the corresponding inner product between their features and the residuals is bounded above by . Therefore the
set of inactive features must satisfy:
(8)
 1  ΦTA, I ζ A  1
where Φ A, I is the matrix of inactive features. This gives a set of 2nI linear inequality constraints that must be satisfied,
where nI is the number of inactive parameters.
Combining (7) and (8) gives a set of n+nI linear equality and inequality constraints that must be satisfied.
3.4. Necessary and sufficient conditions on the data
Just as an active parameter has a non-zero value, an active datum has a strictly positive residual defined by (2). When
a datum becomes inactive, a row will be dropped from the feature matrix,  and this will also change the inverse
Hessian matrix in (5).
For basis pursuit classification, an active datum lies inside, or on the wrong side of, the margin. Therefore, a datum
no longer affects the solution when:
(9)
t j φTj θ  b  1
as the corresponding residual,
 j , is zero.
Using (6) and (9), the set of active data must satisfy:
diag (t A )I  11T / l A Φ A, Aθ A  1  t A  (t A )
(10)
where tA) represents the mean of the target vector for the current active data, the transformed features are being used
to remove the effect of the bias term. This gives a set of lA constraints, one for each active datum.
A similar set of constraints occurs for the inactive data. The linear inequality constraints on the inactive data are:
(11)
- diag (t I )ΦI , A  11T Φ A, A / l A θ A  1  t I  (t A )
which are simply the negated coefficients for the active data and the I subscript is used to denote the inactive data set.
This gives a set of lI constraints, one for each inactive datum.
Therefore, there is a set of l inequality constraints, one for each datum and overall, the parameter and data linear
equality and inequality constraints specify necessary and sufficient conditions for the optimal direct calculation given by
(5) and (6).
4. Calculating the optimal parameter locus
The insights derived in Section 3 are now used to calculate the complete, optimal parameter locus, (). This refers
to the curve of parameter values where the regularization parameter is the independent variable. It allows a designer to
explore the optimal parameter trajectories and understand how features get introduced into classifier as well as exploring
their interaction. This provides the designer with important knowledge about the stability of the feature selection
process. An equivalent procedure for regression has also recently been proposed [Efron et al 2003].
It is shown that the parameter locus is a piecewise linear curve. It is locally linear in the regions where the active
parameter and data sets do not change, as  is altered slightly. When either a parameter or a datum is added/removed
from the active sets, a knot occurs in the piecewise linear curve. The proposed algorithm starts at the simplest model
(zero parameters and all the data points active) and iteratively calculates the locus of globally optimal models.
4.1. Optimal parameter locus
The regularization function and the classifier’s parameters are parameterized by the regularization parameter ,
where 0  arg min f (θ, b, ζ, ) represents the first boundary condition and the maximum likelihood solution,
θ
^
θ  arg min f (θ, b, ζ,0) , represents the second. The optimal parameter locus can be expressed as (), i.e. their values
θ
are a function of  and they trace out a continuous path or locus of optimal sparse classifiers between these two points.
These comments hold for any p-norm model complexity function, where p  1 . The only difference is the actual path
traced through parameter space as  varies between ∞ and 0. When p=2, the curve is a weighted quadratic where every
parameter is always non-zero, hence the classifiers are not sparse. Whereas using a p=1 norm, the parameter locus is
piecewise linear (as will be demonstrated) and is generally sparse, as illustrated for a simple two parameter problem in
Figure 1.
()
Figure 1. An illustration of the piecewise linear optimal parameter locus, (), as the regularization parameter 
is altered from  to 0. The contours of the quadratic loss function are also shown.
4.2. Piecewise linear parameter locus
From equation (5), it can be seen that when the sets of active parameters and data do not change, the optimal
parameter values is a linear function of  and the corresponding (local) inverse Hessian is constant.
To see how this insight can be combined with the linear equality/inequality constraints derived in Section 3.3 and 3.4,
consider what happens as  is reduced from ∞. At = ∞, the active parameter set is empty, all the parameters are zero,
(∞)=0, and the active data set contains all the training data as the classifier’s unthresholded output simply represents
the (bipolar) class priors. As  is reduced, all the parameters will remain zero until one of the linear inequality
constraints in (8) is violated and this occurs when:
(12)
max i φTi (t  1b)  
Hence one parameter will become active, it can be calculated using (5) and is locally a linear function of . As  is
reduced further, the linear expression is valid as long as the sets of active parameters and data do not change. If they do,
the input matrix,  is altered by adding or removing a row or column and a corresponding rank 1 update occurs to the
Hessian. Therefore, it is possible to efficiently calculate the new Hessian without performing a full matrix update.
4.3. Local parameter updates
Given that θ 0A is a knot value on the optimal parameter locus in the current active parameter sub-space, from Section
3.2, the local change in the (active) parameters is given by:
(13)
θ A    θ 0A  H 1 sgn( θ A )
where  is the step size and the update direction is s  H 1 sgn( θ A ) . The aim is to step along the optimal parameter
locus as long as all the parameter and data constraints, derived in Sections 3.3 and 3.4, are valid. When the active sets
change a new inverse Hessian is computed and a new step direction is calculated. It is worthwhile noting that there is a
negative monotonic relationship between  and , i.e. as  increases along the optimal parameter locus towards the
maximum likelihood solution,  decreases.
4.4. Classifier design using sequential linear programming
The proposed “shooting algorithm” generates a sequence of Linear Programming (LP) sub-problems where the aim is
to iteratively step along the piecewise linear parameter trajectories, as the regularization parameter is reduced from ∞ to
0. The parameters change from zero to their maximum likelihood values. In addition, when the problem is reasonably
well conditioned, the number of iterations will be approximately n+l. However, it should be noted that even when the
problem is badly conditioned, the proposed procedure will find every optimal, sparse classifier.
Each knot value can now be iteratively calculated using the sequence of LP sub-problems in terms of the unknown
step length :
min  
st   1
  sgn( θ 0A ). * s  θ 0A
 1φ TA(1) sgn(  A(1) )  Φ TA, I I  11T / l A Φ A, As  1φ TA(1) sgn(  A(1) )  Φ TA, I I  11T / l A t A  Φ A, Aθ 0A 
 1φ TA(1) sgn(  A(1) )  Φ TA, I I  11T / l A Φ A, As  1φ TA(1) sgn(  A(1) )  Φ TA, I I  11T / l A t A  Φ A, Aθ 0A 
diag (t A )I  11T / l A Φ A, As  1  t A  (t A )  diag (t A )I  11T / l A Φ A, Aθ 0A
 diag (t I )Φ I , A  11T Φ A, A / l A s  1  t I  (t A )  diag (t I )Φ I , A  11T Φ A, A / l A θ 0A
The first constraint is a single bound of the form 0. The next three constraints deal with the active/inactive parameter
set, as described in Section 3.3. The last two constraints are concerned with the active/inactive data, as described in
Section 3.4. While this LP problem may appear quite complex, it is important to note that this is simply a set of
n+nI+l+1 linear inequality constraints on a single variable of the form a  b , where only non-negative values of  are
acceptable. Therefore, once the coefficient vectors have been determined, calculating the next knot value is simply the
case of finding the minimum valid value of a vector of maximum length n+nI+l+1.
5. Australian credit approval data set
The Australian Credit data set is a real-world data set that has been used by Quinlan and Statlog [Michie et al 1994]
to validate a wide number of machine learning, statistical and neural classification algorithms. In the Statlog study, the
linear classification rules were ranked highly, thus demonstrating that there was little non-linearity in the Bayes
discriminant. The data set consisted of 690 observations about whether or not credit should be granted and there were 14
potential attributes - 6 continuous and 8 categorical. The categorical variables ranged from simple binary indicators to
14 separate values. The categorical variables were encoded using a “1 of C-1” mapping which produced a potential set
of 35 features (continuous features + encoded categorical features). Note that the bias term is implicitly calculated, using
(6), and not included in this variable count, although its trajectory is shown in Figure 3 and is the only non-zero
parameter when  is large as it represents the class priors.
The error curve and parameter trajectory for a linear model are shown in Figures 2 and 3, respectively. Here the
values are plotted against log(+1). As can be clearly seen, for a large range of sparse models, the error rate is
approximately 100 and there is a single dominant parameter. The first two parameters to become non-zero, 28 and 29,
are related to the 8th and 9th variables in the original data set. Both are binary variables and contain similar information.
It is also worthwhile noting that 28 remains almost constant while new features are introduced into the classifier, thus
demonstrating that it remains an important, stable parameters across the complete set of sparse, optimal classifiers. For
small value of , the number of classification errors goes down from 100 to around 80, however this is at the expense of
over 30 variables being introduced into the model and their values increasing significantly, so their relevance is
questionable. When  was small, a number of parameters started to reverse the sign as the regularization parameter was
reduced. In practice, approximately 10 of the 35 parameters were subject to this effect and the behaviour of one such
parameter is shown in Figure 4. It was interesting to note that this example of Simpson’s paradox was largely restricted
to the parameters associated with the two categorical input variables which a large number of values. In addition, it only
occurred when the number of active data points was becoming small due to the margin decreasing in size. It will be
instructive to identify how prevalent Simpson’s paradox is on other classification problems and whether it is largely
restricted to categorical inputs. It is important because it limits the qualitative interpretation of the parameter values.
The advantage of using an approach such as this is that the properties for all the sparse models can be directly
visualized and the behaviour of the parameters discussed. In addition, it was found that the approach was efficient,
compared to re-training a single classifier, even though the approach generates every sparse classifier. However, in order
to select a particular model, two critical points can be identified. When the first parameter is introduced into the
classifier, the number of errors is substantially reduced to 100 and this is for a classifier that just contains one input.
Alternatively, if a more complex classifier is required, when log( +1) = 1, the number of errors have been reduced to
approximately 80, and this is before the onset of the effects of Simpson’s paradox. The parameter values are reasonably
stable. So this may be considered an alternative classifier. Automatic model selection is currently being explored using
this framework.
Figure 2. Number of classification errors verses the transformed regularization parameter, log( +1). As the
model becomes more complex and  decreases, the number of errors on the training set decreases, but not
monotonically. There is a large reduction when the first feature enters the classifier.


Figure 3. A plot of the parameter trajectories verses the transformed regularization parameter, log(+1).
Initially, a single feature/parameter dramatically decreases the number of classification errors, and the bias term
reflects the non-zero class priors for the active data.
Figure 4. A plot of 21 verses log(+1). When the parameter enters the classifer, it is initially positive, however
when other features enter the classifier, it becomes negative.
6. Conclusions
A novel and efficient iterative method for calculating the complete set of sparse classifiers has been proposed. It
provides designers with more information about the sensitivity of the problem as they can immediately visualize how the
parameters change as the regularization parameter is altered from infinity to zero. One advantage of using a basis pursuit
classification framework is that there exists a global minimum when calculating the sparse feature set. In addition, when
the regularization parameter is altered slightly, at most one of the classification parameter will change state. This fully
supports a forwards selection (or alternatively backwards elimination) approach to optimal feature selection. The one
major difference between using basis pursuit for regression or classification problems is the extra set of constraints
imposed by the output thresholding for classification problems that determine the active set of data points. An important
part of this work is to explore parameter interaction and the selection stability and an example of this was seen in the
Australian credit data set when parameter values can change sign, and hence interpretation because the model
complexity is altered slightly.
Future work will investigate robust and efficient implementations of the iterative LP sub-problems, estimating the
number of segments/knots in the set of sparse models, specifying non-uniform priors as well as investigating how the
technique can be used for feature selection with non-linear kernels and developing error bar estimates for the sparse
regularization framework.
7. References
Bradley, P.S., Mangasarian, O.L.: Feature Selection via Concave Minimization and Support Vector Machines. Proc 15 th Intl. Conf.
Machine Learning, Morgan Kaufmann, San Francisco, California, 82-90 (1998)
Brown, M., Costen, N.: Exploratory Sparse Models for Face Recognition, BMVC, Norwich, 1, 13-22 (2003)
Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic Decomposition by Basis Pursuit. SIAM Review 43(1) 129-159, (2001)
Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press
(2000)
Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R., Least Angle Regression. Annals of Statistics, to appear, (2003)
Fung. G., Mangasarian, O.L.: Incremental Support Vector Machine Classification. Proc 2 nd SIAM Intl. Conf. Data Mining, 247-260
(2002)
Hong, X., Sharkey, P.M., Warwick, K.: A robust nonlinear identification algorithm using PRESS statistic and forward regression,
IEEE Trans on Neural Networks, Vol.14, No.2, pp454-458, (2003)
MacKay, D.J.C. Introduction to Gaussian processes. In Bishop, C., (ed.) Neural Networks and Machine Learning. NATO Asi Series.
Series F, Computer and Systems Sciences, Vol. 168, (1998)
Michie, D.,Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood, (1994)
Rosset S, Zhu J, Hastie T and Tibshirani R. 1-Norm Support Vector Machines, accepted for NIPS 2003.
Simpson, E.H. The Interpretation of Interaction in Contingency Tables," Journal of the Royal Statistical Society, Ser. B, 13, 238241, (1951)
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik V.: Feature Selection for SVMs”. In: Advances in Neural
Information Processing Systems, NIPS, 13, 668-674, (2001)
Download