Smith 1 Jordan Smith MUMT 611 Written summary of classifiers 18 February 2008 A Review of Support Vector Machines Abstract A support vector machine (SVM) is a learning machine that can be used for classification problems (Cortes 1995) as well as for regression and novelty detection (Bennett 2000). SVMs look for the hyperplane that optimally separates two classes of data. Important features of SVMs are the absence of local minima, the well-controlled capacity of the solution (Christiani 2000), and the ability to handle high-dimensional input data efficiently (Cortes 1995). It is conceptually quite simple, but also very powerful: in its infancy, it has performed well against other popular classifiers (Meyer 2002, 2003), and has been applied to problems in several fields, including that of music information retrieval. 1. Introduction 1.1 History The support vector machine was developed quite recently, emerging only in the early 1990s. However, it is also the product of decades of research in computational learning theory by Russian mathematicians Vladimir Vapnik and Alexey Chervonenkis. Their resulting theory, summarized in Vapnik’s 1982 book Estimation of Dependences Based on Empirical Data, has been called Vapnik-Chervonenkis or VC theory (Vapnik 2006). That book describes the implementation of a support vector machine for linearly separable training data (Cortes 1995). Beginning in the early 1990s by researchers at Bell Labs, a number of important extensions were made to the SVM: in 1992, Boser, Guyon and Vapnik proposed using Aizerman's kernel trick to classify data perhaps only separable by polynomial or radial basis functions; in 1995, Vapnik and Cortes extended the theory to handle non-separable training data by using a cost function; finally, a method of support vector regression was developed in 1996 (Drucker). 1.2 Summary This very brief introduction to SVMs will first describe in historical order: the case of using a SVM to classify linear, separable data; the case of using a kernel function to make a non-linear classification; and the case of using a cost function to allow for non-separable data. In section 3, a number of studies using SVMs will be described, including several related to music information retrieval. Finally, some studies evaluating the performance of SVMs are summarized. 2. 2.1 Support Vector Machines Linear, separable data The basic problem that a SVM learns and solves is a two-category classification problem. Following the method of Bennett’s discussion (2000), suppose we have a set of l observations. Smith 2 Each observation can be represented by a pair {xi, yi} where xi є RN and yi є {-1, 1}. That is, each observation contains an N-dimensional vector x and a class assignment y. Our goal is to find the optimal separating hyperplane; that is, the flat (N-1)-dimensional surface that best separates the data. For the time being we assume that a separating hyperplane exists, and is defined by the normal vector w. On either side of this plane we construct a pair of parallel planes such that: Figure 1. Two data sets, represented by squares and circles, w·xi ≥ b + 1 for yi = 1 are separated by two parallel hyperplanes subtended by w·xi ≤ b – 1 for yi = -1 support vectors (circled). The distance between these planes – where b indicates the offset of the margin – is the quantity maximized by a SVM. The solid the plane from the origin. This line is the optimal separating hyperplane. situation is pictured in Figure 1, where the separating plan is the solid line and the two parallel planes are the dashed lines. The dashed lines ‘push up’ against some of the training data points: these points are called ‘support vectors,’ and in fact they completely determine the solution. The gap between these lines is called the margin, and we wish to maximize the size of this gap. In terms of w, we wish to maximize: ½||w||2 subject to the constraint: yi (w·xi – b) ≥ 1 The solution can be obtained using Lagrange multipliers (Burges 1998). 2.2 Kernel functions Often, a non-linear solution plane is required to separate data. To repeat the above steps and maximize the separation between two non-linear functions can be computationally expensive. Instead, the kernel trick is used: input data are mapped into a higher dimensional Figure 2. Visualization of the kernel trick. Input data are mapped into a feature space via a specified higher dimensional feature space using a kernel function, resulting in kernel function. The data linearly-separable training data. Source: Holbrey, R. “Dimension Reduction Algorithms for Data Mining and Visualization.” are linearly separable in the <http://www.comp.leeds.ac.uk/richardh/astro/index.html> Accessed 12 higher dimensional space. February 2008. Furthermore, if a good kernel function is selected, the dot product will be preserved in the feature space (Cortes 1995) so that the mathematical Smith 3 approach outlined in section 2.1 is still applicable. The important kernel functions who have been used and whose properties have been studied most extensively are linear and polynomial functions, the radial-basis function, and the sigmoid function (Sherrod 2008). 2.3 Non-separable data A method of accommodating errors and outliers in the input data was developed in 1995 (Cortes), and can be implemented simply by allowing an error of up to ξ in each dimension (resulting in a ‘fuzzy margin’) and adding a cost function C(i) to the optimization equation (Burges). We then want to minimize: ½||w||2 + C·(Σ ξi) subject to the constraint: yi (w·xi – b) + ξi ≥ 1 (Bennett 2000). This is substantially harder to solve than the separable case. In Chang and Lin’s LIBSVM manual, the minimization conditions, constraints, and resulting decision functions are defined for each type of classification, along with algorithms for solving the required quadratic programming problems (2007). 3 Studies using SVMs 3.1 Applications Throughout his early papers, Vapnik often used optical text recognition as an experimental example application (Boser 1992, Cortes 1995, Schölkopf 1996). (See also Sebastiani 1999, Joachims 1997.) Since then, many authors have since used SVMs to develop classifiers in other disciplines: see, for instance, the work on face detection by Osuna et al. (1997b) or on gene expression data by Brown et al. (2000). In the field of music information retrieval, Dhanaraj and Logan used SVMs in their automatic identification of hit songs based on lyrics and acoustic features (2005), Laurier and Herrera submitted a second-place finishing mood classifier to MIREX 2007 that relied on SVMs and acoustic features, and Meng used SVMs at multiple stages in his dissertation: first to perform temporal feature integration and second to perform automatic genre identification based on these features (2006). Both Mandel (2005, 2006) and Xu (2003) have studied musical genre classification using SVMs based on acoustic features. The free software package LIB-SVM is a library of tools for implementing various types of SVMs (Chang 2007) while DTREG can implement a number of predictive models, from SVMs to various types of neural nets and decision trees (Sherrod 2008). 3.2 Performance According to Vapnik, the performance of his SVM hand-written digit classifier easily outperforms state-of-the-art classifiers based on other learning routines. However, since their rise in popularity in the 1990s, SVMs have been the object of closer scrutiny: a study by Meyer concluded that although SVMs performed very well in classification and regression tasks, other methods were as competitive (2002). While the two-category classification problem is the classic problem to study analytically, but in practice, more categories must be distinguished. Hsu (2002) compared the performance of various methods of combining binary classifiers, concluding that one-against-one and ‘directed acyclic graph SVM’ were better than one-against-all. Smith 4 Bibliography Bennett, K., and C. Campbell. 2000. “Support vector machines: Hype or hallelujah?” Special Interest Group on Knowledge Discovery and Data Mining Explorations. 2(2): 1–13. Boser, B., I. Guyon, and V. Vapnik. 1992. “A training algorithm for optimal margin classifiers.” Proceedings of the 5th Annual Workshop on Computational Learning Theory. 144–52. Brown, M., W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares Jr., and D. Haussler. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proceedings of the National Academy of Science. 97: 262–267. Burges, C. 1998. “A tutorial on support vector machines for pattern recognition.” Data Mining and Knowledge Discovery. 2(2): 955–74. Chang, C., and C. Lin. 2007. “LIBSVM: a library for support vector machines.” Manual for software available online: <http://www.csie.ntu.edu.tw/~cjlin/libsvm/> Christiani, N., and J. Shawe-Taylor. 2000. Chapter 6: Further reading and advanced topics. In An Introduction to Support Vector Machines. Cambridge: Cambridge University Press. <http://www.support-vector.net/chapter_6.html> Cortes, C., and V. Vapnik. 1995. Support-vector networks. Machine Learning. 20(3): 273–297. Dhanaraj, R., and B. Logan. 2005. “Automatic prediction of hit songs.” International Conference on Music Information Retrieval, London UK. 488–91. Drucker, H., C. Burges, L. Kaufman, A. Smola, and V. Vapnik. 1996. Support vector regression machine. Advances in Neural Information Processing Systems. Cambridge: MIT Press 9(9): 155–61. Hsu, C., and C. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks. 13(2): 415–425. Hsu, C., C. Chang, and C. Lin. 2007. A practical guide to support vector classification. <http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf> Joachims, T. 1997. Text categorization with support vector machines: Learning with many relevant features. Springer Lecture Notes in Computer Science. 1398: 137–42. Laurier, C., and P. Herrera. 2007. Audio music mood classification using support vector machine.” Proceedings of 8th International Conference on Music Information Retrieval. Mandel, M., and D. Ellis. 2005. Song-level features and support vector machines for music classification. Proceedings of the 6th International Conference on Music Information Smith 5 Retrieval. 594–599 Mandel, M., G. Poliner, and D. Ellis. 2006. Support vector machine active learning for music retrieval. Multimedia Systems. 12(1): 3–13. Meng, A. 2006. Temporal feature integration for music organization. PhD diss., Technical University of Denmark. Meyer, D., F. Leisch, and K. Hornik. 2002. Benchmarking support vector machines. Report Series SFB, Adaptive Information Systems and Modelling in Economics and Management Science. 78. Meyer, D., F. Leisch, and K. Hornik. 2003. The support vector machine under test. Neurocomputing. 55: 169–86. Osuna, E., R. Freund, and F. Girosi. 1997a. An improved training algorithm for support vector machines. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing. 276–85. Osuna, E., R. Freund, and F. Girosi. 1997b. Training support vector machines: an application to face detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 130–7. Schölkopf, B, C. Burges, and V. Vapnik. 1996. Incorporating invariances in support vector learning machines. Springer Lecture Notes in Computer Science. 1112: 47–52. Sebastiani, F. 1999. Machine learning in automated text categorization. Technical Report, Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59. Sherrod, P. 2008. “DTREG Predictive Modeling Software.” Manual for software available online: <www.dtreg.com> Smola, A., and B. Schölkopf. 1998. A tutorial on support vector regression. NeuroCOLT2 Technical Report NC2-TR-1998-030. Holloway College, London. Vapnik, V. 2006. Empirical Inference Science. Afterword in 1982 reprint of Estimation of Dependences Based on Empirical Data. Xu, C. N. Maddage, X. Shao, F. Cao, and Q. Tian. 2003. Musical genre classification using support vector machines. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing. 5: 429–32.