MS word document

advertisement
Smith 1
Jordan Smith
MUMT 611
Written summary of classifiers
18 February 2008
A Review of Support Vector Machines
Abstract
A support vector machine (SVM) is a learning machine that can be used for classification
problems (Cortes 1995) as well as for regression and novelty detection (Bennett 2000). SVMs
look for the hyperplane that optimally separates two classes of data. Important features of SVMs
are the absence of local minima, the well-controlled capacity of the solution (Christiani 2000),
and the ability to handle high-dimensional input data efficiently (Cortes 1995). It is conceptually
quite simple, but also very powerful: in its infancy, it has performed well against other popular
classifiers (Meyer 2002, 2003), and has been applied to problems in several fields, including that
of music information retrieval.
1.
Introduction
1.1
History
The support vector machine was developed quite recently, emerging only in the early
1990s. However, it is also the product of decades of research in computational learning theory by
Russian mathematicians Vladimir Vapnik and Alexey Chervonenkis. Their resulting theory,
summarized in Vapnik’s 1982 book Estimation of Dependences Based on Empirical Data, has
been called Vapnik-Chervonenkis or VC theory (Vapnik 2006). That book describes the
implementation of a support vector machine for linearly separable training data (Cortes 1995).
Beginning in the early 1990s by researchers at Bell Labs, a number of important extensions were
made to the SVM: in 1992, Boser, Guyon and Vapnik proposed using Aizerman's kernel trick to
classify data perhaps only separable by polynomial or radial basis functions; in 1995, Vapnik and
Cortes extended the theory to handle non-separable training data by using a cost function; finally,
a method of support vector regression was developed in 1996 (Drucker).
1.2
Summary
This very brief introduction to SVMs will first describe in historical order: the case of
using a SVM to classify linear, separable data; the case of using a kernel function to make a
non-linear classification; and the case of using a cost function to allow for non-separable data. In
section 3, a number of studies using SVMs will be described, including several related to music
information retrieval. Finally, some studies evaluating the performance of SVMs are
summarized.
2.
2.1
Support Vector Machines
Linear, separable data
The basic problem that a SVM learns and solves is a two-category classification problem.
Following the method of Bennett’s discussion (2000), suppose we have a set of l observations.
Smith 2
Each observation can be represented
by a pair {xi, yi} where xi є RN and yi є
{-1, 1}. That is, each observation
contains an N-dimensional vector x
and a class assignment y. Our goal is to
find the optimal separating hyperplane;
that is, the flat (N-1)-dimensional
surface that best separates the data.
For the time being we assume
that a separating hyperplane exists, and
is defined by the normal vector w. On
either side of this plane we construct a
pair of parallel planes such that:
Figure 1. Two data sets, represented by squares and circles,
w·xi ≥ b + 1 for yi = 1
are separated by two parallel hyperplanes subtended by
w·xi ≤ b – 1 for yi = -1
support vectors (circled). The distance between these planes –
where b indicates the offset of
the margin – is the quantity maximized by a SVM. The solid
the plane from the origin. This
line is the optimal separating hyperplane.
situation is pictured in Figure 1, where
the separating plan is the solid line and the two parallel planes are the dashed lines. The dashed
lines ‘push up’ against some of the training data points: these points are called ‘support vectors,’
and in fact they completely determine the solution. The gap between these lines is called the
margin, and we wish to maximize the size of this gap. In terms of w, we wish to maximize:
½||w||2
subject to the constraint:
yi (w·xi – b) ≥ 1
The solution can be obtained using Lagrange multipliers (Burges 1998).
2.2
Kernel functions
Often, a non-linear
solution plane is required to
separate data. To repeat the
above steps and maximize
the separation between two
non-linear functions can be
computationally expensive.
Instead, the kernel trick is
used: input data are mapped
into a higher dimensional
Figure 2. Visualization of the kernel trick. Input data are mapped into a
feature space via a specified higher dimensional feature space using a kernel function, resulting in
kernel function. The data
linearly-separable training data. Source: Holbrey, R. “Dimension
Reduction Algorithms for Data Mining and Visualization.”
are linearly separable in the
<http://www.comp.leeds.ac.uk/richardh/astro/index.html> Accessed 12
higher dimensional space.
February 2008.
Furthermore, if a good
kernel function is selected,
the dot product will be preserved in the feature space (Cortes 1995) so that the mathematical
Smith 3
approach outlined in section 2.1 is still applicable. The important kernel functions who have been
used and whose properties have been studied most extensively are linear and polynomial
functions, the radial-basis function, and the sigmoid function (Sherrod 2008).
2.3
Non-separable data
A method of accommodating errors and outliers in the input data was developed in 1995
(Cortes), and can be implemented simply by allowing an error of up to ξ in each dimension
(resulting in a ‘fuzzy margin’) and adding a cost function C(i) to the optimization equation
(Burges). We then want to minimize:
½||w||2 + C·(Σ ξi)
subject to the constraint:
yi (w·xi – b) + ξi ≥ 1
(Bennett 2000). This is substantially harder to solve than the separable case. In Chang and Lin’s
LIBSVM manual, the minimization conditions, constraints, and resulting decision functions are
defined for each type of classification, along with algorithms for solving the required quadratic
programming problems (2007).
3
Studies using SVMs
3.1
Applications
Throughout his early papers, Vapnik often used optical text recognition as an
experimental example application (Boser 1992, Cortes 1995, Schölkopf 1996). (See also
Sebastiani 1999, Joachims 1997.) Since then, many authors have since used SVMs to develop
classifiers in other disciplines: see, for instance, the work on face detection by Osuna et al.
(1997b) or on gene expression data by Brown et al. (2000). In the field of music information
retrieval, Dhanaraj and Logan used SVMs in their automatic identification of hit songs based on
lyrics and acoustic features (2005), Laurier and Herrera submitted a second-place finishing mood
classifier to MIREX 2007 that relied on SVMs and acoustic features, and Meng used SVMs at
multiple stages in his dissertation: first to perform temporal feature integration and second to
perform automatic genre identification based on these features (2006). Both Mandel (2005, 2006)
and Xu (2003) have studied musical genre classification using SVMs based on acoustic features.
The free software package LIB-SVM is a library of tools for implementing various types of
SVMs (Chang 2007) while DTREG can implement a number of predictive models, from SVMs
to various types of neural nets and decision trees (Sherrod 2008).
3.2
Performance
According to Vapnik, the performance of his SVM hand-written digit classifier easily
outperforms state-of-the-art classifiers based on other learning routines. However, since their rise
in popularity in the 1990s, SVMs have been the object of closer scrutiny: a study by Meyer
concluded that although SVMs performed very well in classification and regression tasks, other
methods were as competitive (2002).
While the two-category classification problem is the classic problem to study analytically,
but in practice, more categories must be distinguished. Hsu (2002) compared the performance of
various methods of combining binary classifiers, concluding that one-against-one and ‘directed
acyclic graph SVM’ were better than one-against-all.
Smith 4
Bibliography
Bennett, K., and C. Campbell. 2000. “Support vector machines: Hype or hallelujah?” Special
Interest Group on Knowledge Discovery and Data Mining Explorations. 2(2): 1–13.
Boser, B., I. Guyon, and V. Vapnik. 1992. “A training algorithm for optimal margin classifiers.”
Proceedings of the 5th Annual Workshop on Computational Learning Theory. 144–52.
Brown, M., W. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. Furey, M. Ares Jr., and D. Haussler.
2000. Knowledge-based analysis of microarray gene expression data by using support
vector machines. Proceedings of the National Academy of Science. 97: 262–267.
Burges, C. 1998. “A tutorial on support vector machines for pattern recognition.” Data Mining
and Knowledge Discovery. 2(2): 955–74.
Chang, C., and C. Lin. 2007. “LIBSVM: a library for support vector machines.” Manual for
software available online: <http://www.csie.ntu.edu.tw/~cjlin/libsvm/>
Christiani, N., and J. Shawe-Taylor. 2000. Chapter 6: Further reading and advanced topics. In An
Introduction to Support Vector Machines. Cambridge: Cambridge University Press.
<http://www.support-vector.net/chapter_6.html>
Cortes, C., and V. Vapnik. 1995. Support-vector networks. Machine Learning. 20(3): 273–297.
Dhanaraj, R., and B. Logan. 2005. “Automatic prediction of hit songs.” International Conference
on Music Information Retrieval, London UK. 488–91.
Drucker, H., C. Burges, L. Kaufman, A. Smola, and V. Vapnik. 1996. Support vector regression
machine. Advances in Neural Information Processing Systems. Cambridge: MIT Press
9(9): 155–61.
Hsu, C., and C. Lin. 2002. A comparison of methods for multiclass support vector machines.
IEEE Transactions on Neural Networks. 13(2): 415–425.
Hsu, C., C. Chang, and C. Lin. 2007. A practical guide to support vector classification.
<http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf>
Joachims, T. 1997. Text categorization with support vector machines: Learning with many
relevant features. Springer Lecture Notes in Computer Science. 1398: 137–42.
Laurier, C., and P. Herrera. 2007. Audio music mood classification using support vector
machine.” Proceedings of 8th International Conference on Music Information Retrieval.
Mandel, M., and D. Ellis. 2005. Song-level features and support vector machines for music
classification. Proceedings of the 6th International Conference on Music Information
Smith 5
Retrieval. 594–599
Mandel, M., G. Poliner, and D. Ellis. 2006. Support vector machine active learning for music
retrieval. Multimedia Systems. 12(1): 3–13.
Meng, A. 2006. Temporal feature integration for music organization. PhD diss., Technical
University of Denmark.
Meyer, D., F. Leisch, and K. Hornik. 2002. Benchmarking support vector machines. Report
Series SFB, Adaptive Information Systems and Modelling in Economics and Management
Science. 78.
Meyer, D., F. Leisch, and K. Hornik. 2003. The support vector machine under test.
Neurocomputing. 55: 169–86.
Osuna, E., R. Freund, and F. Girosi. 1997a. An improved training algorithm for support vector
machines. Proceedings of the IEEE Workshop on Neural Networks for Signal Processing.
276–85.
Osuna, E., R. Freund, and F. Girosi. 1997b. Training support vector machines: an application to
face detection. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 130–7.
Schölkopf, B, C. Burges, and V. Vapnik. 1996. Incorporating invariances in support vector
learning machines. Springer Lecture Notes in Computer Science. 1112: 47–52.
Sebastiani, F. 1999. Machine learning in automated text categorization. Technical Report,
Consiglio Nazionale delle Ricerche. Pisa, Italy. 1–59.
Sherrod, P. 2008. “DTREG Predictive Modeling Software.” Manual for software available
online: <www.dtreg.com>
Smola, A., and B. Schölkopf. 1998. A tutorial on support vector regression. NeuroCOLT2
Technical Report NC2-TR-1998-030. Holloway College, London.
Vapnik, V. 2006. Empirical Inference Science. Afterword in 1982 reprint of Estimation of
Dependences Based on Empirical Data.
Xu, C. N. Maddage, X. Shao, F. Cao, and Q. Tian. 2003. Musical genre classification using
support vector machines. Proceedings of IEEE International Conference on Acoustics,
Speech, and Signal Processing. 5: 429–32.
Download