Exploratory Basis Pursuit Classification M. Brown* and N.P. Costen+ * Corresponding author Control Systems Centre, Department of Electrical Engineering and Electronics, UMIST, PO Box 88, Manchester, M60 1QD, UK Tel: +44 (0)161 200 4672 Fax: +44 (0)161 200 4647 Email: martin.brown@umist.ac.uk + Dept Computing and Mathematics, Manchester Metropolitan University, Manchester, UK Email: n.costen@mmu.ac.uk Keywords: feature selection, sparse classification, regularization Abstract Feature selection is a fundamental process in many classifier design problems. However, it is NP-complete and approximate approaches often require requires extensive exploration and evaluation. This paper describes a novel approach that represents feature selection as a continuous regularization problem which has a single, global minimum, where the model’s complexity is measured using a 1-norm on the parameter vector. A new exploratory design process is also described that allows the designer to efficiently construct the complete locus of sparse, kernel-based classifiers. It allows the designer to investigate the optimal parameters’ trajectories as the regularization parameter is altered and look for effects, such as Simpson’s paradox, that occur in many multivariate data analysis problems. The approach is demonstrated on the well-known Australian Credit data set. 1. Introduction Feature selection is a fundamental process within many classification algorithms, as many classification problems are not well understood and a large dictionary of potential, flexible features exists, from which it is necessary to select a relevant subset [Weston et al 2001]. This is typical in the vision and biometrics fields, where new measurement or transformation techniques are explored in order to boost the classifier’s performance. The aims of feature selection include: improved, more robust parameter estimation and improved insight into the decision making process. Feature selection is generally an empirical process that is performed prior to, or jointly with, the parameter estimation process. Training data is used to assess different model combinations and the most appropriate one for the classification task is selected. However, it should also be recognized that feature selection often introduces problems because it is an NP-complete problem and subtle parametric interactions are often not appreciated by the designer. To perform feature selection, often multiple classifiers are constructed by using some form of forwards selection, where a classifier’s complexity is incrementally increased in a step-wise optimal manner, or backwards elimination, where a classifier’s complexity is incrementally decreased. These approaches are known to be sub-optimal, although many heuristics have been proposed to make the procedures more robust. In addition, in recent years, there has been a surge of interest in using Bayesian or hyperparameter approaches [Weston et al 2001], where relevance terms are associated with the features and those features that have small posterior relevance are thresholded to zero. Again, this multiple thresholding operation is not guaranteed to generate an optimal, sparse solution. This paper interprets feature selection as the process of minimising a continuous 1-norm measure of the magnitude of the parameter vector. To gain qualitative insights into the structure of the classification problem, it is possible to assess the trained classifiers, by investigating whether the parameters are positive, negative or zero. This provides information about whether certain effects may occur, and what type of effect it has. However, training multiple candidate classifiers can produce very different qualitative interpretations in some cases, due to a phenomenon known as Simpson’s paradox [Simpson 1951]. Simpson’s paradox refers to a change in the direction of an association between two variables when a third variable is introduced. In the context of classification problems, a parameter may change in sign (or become zero after being non-zero) when other variables are considered in the model. A qualitative interpretation of the role of the original parameter is no longer valid in the presence of the third parameter and so the interpretation may be misleading or incorrect. Therefore it is necessary to adopt a full multivariate approach where the designer is aware of all possible sparse models and how sensitive the structure is to changes in the degree of sparseness; a central theme of this paper. This paper describes a novel approach to building a complete family of sparse, optimal classifiers using an efficient, iterative procedure, where the aim is generate every optimal classifier that range from robust and simple designs to data sensitive, complex models. Feature selection is performed at the same time as parameter estimation. Therefore, Section 2 describes the basic theory behind building a sparse classifier using a regularization design framework and a 1-norm model complexity measure. Section 3 analyses the properties of the resulting optimal, sparse classifiers, by deriving both the normal form and a set of necessary and sufficient constraints that define the active sets of parameters and data. This theory is then utilized in Section 4, where an efficient, iterative procedure for constructing the complete locus of optimal, sparse classifiers is described. This procedure is then applied to the Australian Credit data set, [Michie et al 1994], in Section 5, and the sparse initial models and Simpson’s paradox are both demonstrated. 2. Sparse classifier design This section describes a sparse regularization approach for designing classifiers. A regularization framework is introduced and a simple but naïve method for calculating the model’s parameters is also briefly described. The empirical classifiers considered in this paper are linear in their adjustable parameters. This includes a wide range of linear, polynomial and kernel [Cristianini, and Shawe-Taylor 2000] machines: y sgn φT θ b where and are the n-dimensional parameter and feature vectors, respectively, b represents the bias term and the classifier’s output, y, lies in {–1, 1}. The input features determine the classifier’s capabilities and generalization abilities, and some techniques, such as support vector machines, assume flexible kernels are available and use regularization methods to control complexity. An exemplar training set of the form {, t} is assumed to exist where the input data, is l*n, and each feature is typically scaled to be zero mean and unit variance and the classification target vector, t, is an l-dimensional bipolar vector. 2.1 A regularization approach to sparse classification A lot of research/application in statistical modelling has taken place in recent years, motivated by Vapnik’s work on Statistical Learning Theory [Cristianini, and Shawe-Taylor 2000,Weston et al 2001]. Using this approach, one way of formulating a classification regularization function is known as basis pursuit classification and is given by: 2 (1) min θ,b ,ζ f θ, b, ζ, 1 2 ζ 2 θ 1 where ζ and are the output residual vector and the regularization parameter, respectively. The non-negative output residual vector is defined by: (2) diag (t)Φθ 1b ζ 1 and the set of active data points corresponds to those output residuals that are strictly positive and hence lie in or on the wrong side of the classification margin. By optimizing the classifier’s parameters, the aim is to maximize the so-called classification margin and thus ensure that the classifier provides a graded fit to the training data. In addition, it should also be noted that this is a piecewise quadratic programming (QP) problem in n+l+1 parameters as the output residuals are a thresholded linear function of the classifier’s active parameters. In this regularization approach to machine learning, the loss function, l ζ 1 2 2 ζ 2 , which measures the data fitting error is balanced against a model complexity function, pθ θ , and the optimal parameters are determined by the 1 regularization parameter. A small value of will ensure that greater weight is given to minimizing the output errors by reducing the classification margin, at the expense of a model with a larger parameter vector. A large value of ensures that more importance is given to smaller parameter vectors although the output errors are larger and the classification margin is wider. Therefore the optimal classifiers’ complexity and parameter estimates are a function of and this is further investigated in Section 4. It is worthwhile noting that the regularization function, f(), is a continuous function of all its parameters and is piecewise differentiable with respect to the classifier’s parameters, due to the presence of the 1-norm model complexity function. It is also important to note that the model complexity term is a form of Bayesian prior. Normal ridge regression techniques and support vector machines usually assume that the model complexity is measured with respect to a 2-norm [Mackay 1998, Hong et al 2003]. However, basis pursuit classification, or robust classification, [Bradley and Mangasarian 1998, Fung and Mangasarian 2002] refers to the fact that a 1-norm is used to measure the parameter vector’s size. Most p-norms will produce parameter values that are small, but non-zero. It can be shown, [Bradley and Mangasarian 1998, Fung and Mangasarian 2002,Rosset et al 2003], that using a 1-norm produces sparse solutions while still producing a continuous optimization problem, as will be demonstrated in the next section. In fact, p=1 is the only pnorm (p1) which produces a truly sparse solution without some form of thresholding procedure and has been developed for regression algorithms [Chen et al 2000, Fung and Mangasarian 2002] and recently been considered for support vector machines [Rosset et al 2003] and weighting schemes have been used to specify prior knowledge [Brown and Costen 2003]. 2.2 Direct calculation of a sparse classifier The sparse, regularization criterion (1) is a piecewise QP problem in . A well-known method [Chen et al 2001] to overcome this non-linearity is to introduce a set of 2n non-negative parameters, z, where the first n values are the corresponding values of when is positive (and zero otherwise) and the second n values are – when is negative (and zero otherwise). This allows the original cost function to be expressed in terms of z: min z ,b,ζ f z, b, ζ, 1 2 i i2 i z i (3) st z0 diag (t )Φ Φz 1b ζ 1 which is now a standard QP problem in 2n+l+1 variables. While this approach is conceptually simple, it is not necessarily very efficient. More importantly, in order to explore neighbouring classifiers in order to explore how sensitive the calculated parameters are with respect to the regularization parameters, the QP algorithm would need to be re-calculated, thus ignoring knowledge about the problem’s structure. 3. Direct Calculation of the Sparse Classifier The aim of this section is to derive an expression for directly calculating the sparse, optimal parameters. This is achieved by considering the local, normal form of the regularization function. The expression is valid when the sets of active parameters and data have already been correctly identified. While this may seem to be equivalent to the original feature selection problem, it is shown that the calculation can be validated using a set of linear equality/inequality constraints that represent the sets of active/inactive data and parameters. In the following sections, an active parameter or datum is actively considered in the calculation. Therefore, the active parameter set is the subset of parameters whose values are non-zero. The active data set refers to those data lying in, or on the wrong side of, the classification margin and have a positive output error, i. The active data matrix refers to the data matrix that retains only the active features (columns) and active data (rows) and is denoted by Φ A, A . 3.1. Normal form Assuming that the active set of parameters and data are known, by differentiating (1) with respect to the active parameters and setting equal to zero to find the minima, the gradients of the loss and model complexity functions are related as: (4) ΦTA, A t A Φ A, Aθ A 1b sgn( θ A ) where the sub-scripts A are used to denote the sets of active data and parameters. This is the normal form of basis pursuit classification, and is similar to the normal form for least squares regression where the inner product between the features and residual is zero. The regularization parameter is “nearly” decoupled from the parameter vector, as it appears on the right hand side of (4) as a multiplier of the locally constant sgn() vector. Therefore, the magnitude of the inner product between each active feature and the active residuals is equal to 3.2 Optimal parameter calculation Equation (4) was derived on the assumption that the active parameter and data sets are known, and re-arranging gives: (5) θ A Η 1 p sgn( θ A ) T T T where H Φ A, A I 11 / lA Φ A, A is the local, active Hessian which has had the bias term subtracted, p Φ A, A I 11T / l A t A represents the transformed target vector minus the bias, I is the identify matrix, 1 is a column vector of 1s and lA is the number of active data. This expression is a closed form solution for the optimal sparse parameter vector when the sign of the optimal parameter vector has been identified. However estimating the sign of the parameters which lie in {-1,0,1}n is equivalent to deciding which are the active features which in general, is a non-trivial problem, but will be addressed in Section 4. To show how the bias term is removed from (5), it should be noted that the bias term is not included in the model complexity function, therefore, it can be calculated separately as the value that minimizes the squared output residual across the active data set as: (6) b 1T t A Φ A, Aθ A l A Therefore the bias term is a function of the optimal parameters, which is why the Hessian matrix and target vector have been modified in (5). 3.3. Necessary and sufficient conditions on the parameters Expression (5) is valid as long as the set of active parameters (columns of Φ A, A ) and data (rows of Φ A, A ) are known and correct, for the value of . In this section, a set of linear equality/inequality constraints is derived that validate this assumption for the sets of active and inactive parameters. From (4), for the active residual vector, A, the conditions on the active parameters are: (7) ΦTA, Aζ A sgn θ A This gives a set of nA equality constraints that the calculated parameters must satisfy. Also, by considering small perturbations of the inactive (zero) parameters, it can also be shown that the absolute value of the corresponding inner product between their features and the residuals is bounded above by . Therefore the set of inactive features must satisfy: (8) 1 ΦTA, I ζ A 1 where Φ A, I is the matrix of inactive features. This gives a set of 2nI linear inequality constraints that must be satisfied, where nI is the number of inactive parameters. Combining (7) and (8) gives a set of n+nI linear equality and inequality constraints that must be satisfied. 3.4. Necessary and sufficient conditions on the data Just as an active parameter has a non-zero value, an active datum has a strictly positive residual defined by (2). When a datum becomes inactive, a row will be dropped from the feature matrix, and this will also change the inverse Hessian matrix in (5). For basis pursuit classification, an active datum lies inside, or on the wrong side of, the margin. Therefore, a datum no longer affects the solution when: (9) t j φTj θ b 1 as the corresponding residual, j , is zero. Using (6) and (9), the set of active data must satisfy: diag (t A )I 11T / l A Φ A, Aθ A 1 t A (t A ) (10) where tA) represents the mean of the target vector for the current active data, the transformed features are being used to remove the effect of the bias term. This gives a set of lA constraints, one for each active datum. A similar set of constraints occurs for the inactive data. The linear inequality constraints on the inactive data are: (11) - diag (t I )ΦI , A 11T Φ A, A / l A θ A 1 t I (t A ) which are simply the negated coefficients for the active data and the I subscript is used to denote the inactive data set. This gives a set of lI constraints, one for each inactive datum. Therefore, there is a set of l inequality constraints, one for each datum and overall, the parameter and data linear equality and inequality constraints specify necessary and sufficient conditions for the optimal direct calculation given by (5) and (6). 4. Calculating the optimal parameter locus The insights derived in Section 3 are now used to calculate the complete, optimal parameter locus, (). This refers to the curve of parameter values where the regularization parameter is the independent variable. It allows a designer to explore the optimal parameter trajectories and understand how features get introduced into classifier as well as exploring their interaction. This provides the designer with important knowledge about the stability of the feature selection process. An equivalent procedure for regression has also recently been proposed [Efron et al 2003]. It is shown that the parameter locus is a piecewise linear curve. It is locally linear in the regions where the active parameter and data sets do not change, as is altered slightly. When either a parameter or a datum is added/removed from the active sets, a knot occurs in the piecewise linear curve. The proposed algorithm starts at the simplest model (zero parameters and all the data points active) and iteratively calculates the locus of globally optimal models. 4.1. Optimal parameter locus The regularization function and the classifier’s parameters are parameterized by the regularization parameter , where 0 arg min f (θ, b, ζ, ) represents the first boundary condition and the maximum likelihood solution, θ ^ θ arg min f (θ, b, ζ,0) , represents the second. The optimal parameter locus can be expressed as (), i.e. their values θ are a function of and they trace out a continuous path or locus of optimal sparse classifiers between these two points. These comments hold for any p-norm model complexity function, where p 1 . The only difference is the actual path traced through parameter space as varies between ∞ and 0. When p=2, the curve is a weighted quadratic where every parameter is always non-zero, hence the classifiers are not sparse. Whereas using a p=1 norm, the parameter locus is piecewise linear (as will be demonstrated) and is generally sparse, as illustrated for a simple two parameter problem in Figure 1. () Figure 1. An illustration of the piecewise linear optimal parameter locus, (), as the regularization parameter is altered from to 0. The contours of the quadratic loss function are also shown. 4.2. Piecewise linear parameter locus From equation (5), it can be seen that when the sets of active parameters and data do not change, the optimal parameter values is a linear function of and the corresponding (local) inverse Hessian is constant. To see how this insight can be combined with the linear equality/inequality constraints derived in Section 3.3 and 3.4, consider what happens as is reduced from ∞. At = ∞, the active parameter set is empty, all the parameters are zero, (∞)=0, and the active data set contains all the training data as the classifier’s unthresholded output simply represents the (bipolar) class priors. As is reduced, all the parameters will remain zero until one of the linear inequality constraints in (8) is violated and this occurs when: (12) max i φTi (t 1b) Hence one parameter will become active, it can be calculated using (5) and is locally a linear function of . As is reduced further, the linear expression is valid as long as the sets of active parameters and data do not change. If they do, the input matrix, is altered by adding or removing a row or column and a corresponding rank 1 update occurs to the Hessian. Therefore, it is possible to efficiently calculate the new Hessian without performing a full matrix update. 4.3. Local parameter updates Given that θ 0A is a knot value on the optimal parameter locus in the current active parameter sub-space, from Section 3.2, the local change in the (active) parameters is given by: (13) θ A θ 0A H 1 sgn( θ A ) where is the step size and the update direction is s H 1 sgn( θ A ) . The aim is to step along the optimal parameter locus as long as all the parameter and data constraints, derived in Sections 3.3 and 3.4, are valid. When the active sets change a new inverse Hessian is computed and a new step direction is calculated. It is worthwhile noting that there is a negative monotonic relationship between and , i.e. as increases along the optimal parameter locus towards the maximum likelihood solution, decreases. 4.4. Classifier design using sequential linear programming The proposed “shooting algorithm” generates a sequence of Linear Programming (LP) sub-problems where the aim is to iteratively step along the piecewise linear parameter trajectories, as the regularization parameter is reduced from ∞ to 0. The parameters change from zero to their maximum likelihood values. In addition, when the problem is reasonably well conditioned, the number of iterations will be approximately n+l. However, it should be noted that even when the problem is badly conditioned, the proposed procedure will find every optimal, sparse classifier. Each knot value can now be iteratively calculated using the sequence of LP sub-problems in terms of the unknown step length : min st 1 sgn( θ 0A ). * s θ 0A 1φ TA(1) sgn( A(1) ) Φ TA, I I 11T / l A Φ A, As 1φ TA(1) sgn( A(1) ) Φ TA, I I 11T / l A t A Φ A, Aθ 0A 1φ TA(1) sgn( A(1) ) Φ TA, I I 11T / l A Φ A, As 1φ TA(1) sgn( A(1) ) Φ TA, I I 11T / l A t A Φ A, Aθ 0A diag (t A )I 11T / l A Φ A, As 1 t A (t A ) diag (t A )I 11T / l A Φ A, Aθ 0A diag (t I )Φ I , A 11T Φ A, A / l A s 1 t I (t A ) diag (t I )Φ I , A 11T Φ A, A / l A θ 0A The first constraint is a single bound of the form 0. The next three constraints deal with the active/inactive parameter set, as described in Section 3.3. The last two constraints are concerned with the active/inactive data, as described in Section 3.4. While this LP problem may appear quite complex, it is important to note that this is simply a set of n+nI+l+1 linear inequality constraints on a single variable of the form a b , where only non-negative values of are acceptable. Therefore, once the coefficient vectors have been determined, calculating the next knot value is simply the case of finding the minimum valid value of a vector of maximum length n+nI+l+1. 5. Australian credit approval data set The Australian Credit data set is a real-world data set that has been used by Quinlan and Statlog [Michie et al 1994] to validate a wide number of machine learning, statistical and neural classification algorithms. In the Statlog study, the linear classification rules were ranked highly, thus demonstrating that there was little non-linearity in the Bayes discriminant. The data set consisted of 690 observations about whether or not credit should be granted and there were 14 potential attributes - 6 continuous and 8 categorical. The categorical variables ranged from simple binary indicators to 14 separate values. The categorical variables were encoded using a “1 of C-1” mapping which produced a potential set of 35 features (continuous features + encoded categorical features). Note that the bias term is implicitly calculated, using (6), and not included in this variable count, although its trajectory is shown in Figure 3 and is the only non-zero parameter when is large as it represents the class priors. The error curve and parameter trajectory for a linear model are shown in Figures 2 and 3, respectively. Here the values are plotted against log(+1). As can be clearly seen, for a large range of sparse models, the error rate is approximately 100 and there is a single dominant parameter. The first two parameters to become non-zero, 28 and 29, are related to the 8th and 9th variables in the original data set. Both are binary variables and contain similar information. It is also worthwhile noting that 28 remains almost constant while new features are introduced into the classifier, thus demonstrating that it remains an important, stable parameters across the complete set of sparse, optimal classifiers. For small value of , the number of classification errors goes down from 100 to around 80, however this is at the expense of over 30 variables being introduced into the model and their values increasing significantly, so their relevance is questionable. When was small, a number of parameters started to reverse the sign as the regularization parameter was reduced. In practice, approximately 10 of the 35 parameters were subject to this effect and the behaviour of one such parameter is shown in Figure 4. It was interesting to note that this example of Simpson’s paradox was largely restricted to the parameters associated with the two categorical input variables which a large number of values. In addition, it only occurred when the number of active data points was becoming small due to the margin decreasing in size. It will be instructive to identify how prevalent Simpson’s paradox is on other classification problems and whether it is largely restricted to categorical inputs. It is important because it limits the qualitative interpretation of the parameter values. The advantage of using an approach such as this is that the properties for all the sparse models can be directly visualized and the behaviour of the parameters discussed. In addition, it was found that the approach was efficient, compared to re-training a single classifier, even though the approach generates every sparse classifier. However, in order to select a particular model, two critical points can be identified. When the first parameter is introduced into the classifier, the number of errors is substantially reduced to 100 and this is for a classifier that just contains one input. Alternatively, if a more complex classifier is required, when log( +1) = 1, the number of errors have been reduced to approximately 80, and this is before the onset of the effects of Simpson’s paradox. The parameter values are reasonably stable. So this may be considered an alternative classifier. Automatic model selection is currently being explored using this framework. Figure 2. Number of classification errors verses the transformed regularization parameter, log( +1). As the model becomes more complex and decreases, the number of errors on the training set decreases, but not monotonically. There is a large reduction when the first feature enters the classifier. Figure 3. A plot of the parameter trajectories verses the transformed regularization parameter, log(+1). Initially, a single feature/parameter dramatically decreases the number of classification errors, and the bias term reflects the non-zero class priors for the active data. Figure 4. A plot of 21 verses log(+1). When the parameter enters the classifer, it is initially positive, however when other features enter the classifier, it becomes negative. 6. Conclusions A novel and efficient iterative method for calculating the complete set of sparse classifiers has been proposed. It provides designers with more information about the sensitivity of the problem as they can immediately visualize how the parameters change as the regularization parameter is altered from infinity to zero. One advantage of using a basis pursuit classification framework is that there exists a global minimum when calculating the sparse feature set. In addition, when the regularization parameter is altered slightly, at most one of the classification parameter will change state. This fully supports a forwards selection (or alternatively backwards elimination) approach to optimal feature selection. The one major difference between using basis pursuit for regression or classification problems is the extra set of constraints imposed by the output thresholding for classification problems that determine the active set of data points. An important part of this work is to explore parameter interaction and the selection stability and an example of this was seen in the Australian credit data set when parameter values can change sign, and hence interpretation because the model complexity is altered slightly. Future work will investigate robust and efficient implementations of the iterative LP sub-problems, estimating the number of segments/knots in the set of sparse models, specifying non-uniform priors as well as investigating how the technique can be used for feature selection with non-linear kernels and developing error bar estimates for the sparse regularization framework. 7. References Bradley, P.S., Mangasarian, O.L.: Feature Selection via Concave Minimization and Support Vector Machines. Proc 15 th Intl. Conf. Machine Learning, Morgan Kaufmann, San Francisco, California, 82-90 (1998) Brown, M., Costen, N.: Exploratory Sparse Models for Face Recognition, BMVC, Norwich, 1, 13-22 (2003) Chen, S.S., Donoho, D.L., Saunders, M.A.: Atomic Decomposition by Basis Pursuit. SIAM Review 43(1) 129-159, (2001) Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press (2000) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R., Least Angle Regression. Annals of Statistics, to appear, (2003) Fung. G., Mangasarian, O.L.: Incremental Support Vector Machine Classification. Proc 2 nd SIAM Intl. Conf. Data Mining, 247-260 (2002) Hong, X., Sharkey, P.M., Warwick, K.: A robust nonlinear identification algorithm using PRESS statistic and forward regression, IEEE Trans on Neural Networks, Vol.14, No.2, pp454-458, (2003) MacKay, D.J.C. Introduction to Gaussian processes. In Bishop, C., (ed.) Neural Networks and Machine Learning. NATO Asi Series. Series F, Computer and Systems Sciences, Vol. 168, (1998) Michie, D.,Spiegelhalter, D.J., Taylor, C.C.: Machine Learning, Neural and Statistical Classification. Ellis Horwood, (1994) Rosset S, Zhu J, Hastie T and Tibshirani R. 1-Norm Support Vector Machines, accepted for NIPS 2003. Simpson, E.H. The Interpretation of Interaction in Contingency Tables," Journal of the Royal Statistical Society, Ser. B, 13, 238241, (1951) Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik V.: Feature Selection for SVMs”. In: Advances in Neural Information Processing Systems, NIPS, 13, 668-674, (2001)