Pattern Recognition Methods Part 1: Kernel Methods & SVM 5 10 15 20 London, UK 21 May 2012 25 30 35 40 45 5 10 15 20 25 30 35 40 45 Janaina Mourao-Miranda, Machine Learning and Neuroimaging Lab, University College London, UK Outline • Pattern Recognition: Concepts & Framework • Kernel Methods for Pattern Analysis • Support Vector Machine (SVM) Pattern Recognition Concepts • Pattern recognition aims to assign a label to a given pattern (test example) based either on a priori knowledge or on statistical information extracted from the previous seen patterns (training examples). • The patterns to be classified are usually groups of measurements or observations (e.g. brain image), defining points in an appropriate multidimensional vector space. Currently implemented in PRoNTo Types of learning procedure: • Supervised learning: The training data consist of pairs of input (typically vectors, X), and desired outputs or labels (y). The output of the function can be a continuous (regression) or a discrete (classification) value. • Unsupervised learning: The training data is not labeled. The aims is to to find inherent patterns in the data that can then be used to determine the correct output value for new data examples (e.g. clustering). • Semi-supervised learning, reinforcement learning. • Pattern Recognition Framework Input (brain scans) X1 X2 X3 No mathematical model available Machine Learning Methodology Output (control/patient) y1 y2 y3 Computer-based procedures that learn a function from a series of examples Learning/Training Phase Training Examples: (X1, y1), . . .,(Xs, ys) Generate a function or classifier f such that f f(xi) -> yi Test Example Xi Testing Phase Prediction f(Xi) = yi Standard Statistical Analysis (mass-univatiate) Input Output Voxel-wise GLM model estimation BOLD signal ... Independent statistical test at each voxel Correction for multiple comparisons Univariate Statistical Parametric Map Time Pattern Recognition Analysis (multivariate) Input Output … Volumes from task 1 Training Phase Multivarate Map (classifier’s or regression’s weights) … Volumes from task 2 New example Test Phase Predictions y = {+1, -1} or p(y = 1|X,θ) e.g. +1 = Patients and -1 = Healthy controls How to extract features from the fMRI? Brain volume fMRI/sMRI 3D matrix of voxels Feature Vector Dimensionality = number of voxels Binary classification: finding a hyperplane task 1 task 2 task 1 task 2 task ? 4 2 volume in t1 volume in t2 L volume in t3 volume in t4 Linear classifiers (hyperplanes) are parameterized by a weight vector w and a bias term b. voxel 2 R volume in t2 volume in t1 volume in t4 2 w volume in t3 4 volume from a new subject voxel 1 Different classifiers/algorithm s will compute different hyperplanes • In neuroimaging applications often the dimensionality of the data is greater than the number of examples (ill-conditioned problems). • Possible solutions: – Feature selection strategies (e.g. ROIS, select only activated voxels) – Searchlight – Kernel Methods Kernel Methods for Pattern Analysis The kernel methodology provides a powerful and unified framework for investigating general types of relationships in the data (e.g. classification, regression, etc). Kernel methods consist of two parts: •Computation of the kernel matrix (mapping into the feature space ). •A learning algorithm based on the kernel matrix (designed to discover linear patterns in the feature space). Advantages: •Represent a computational shortcut which makes possible to represent linear patterns efficiently in high dimensional space. •Using the dual representation with proper regularization* enables efficient solution of ill-conditioned problems. * e.g. restricting the choice of functions to favor functions that have small norm. Examples of Kernel Methods • • • • • Support Vector Machines (SVM) Gaussian Processes (GPs) Kernel Ridge Regression (KRR) Relevance Vector Regression (RVR) Etc Kernel Matrix (similarity measure) Brain scan 2 4 Brain scan 4 Dot product = (4*-2)+(1*3) = -5 1 5 10 15 -2 3 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 • Kernel is a function that, for given two pattern x and x*, returns a real number characterizing their similarity. •A simple type of similarity measure between two vectors is a dot product (linear kernel). Nonlinear Kernels Original Space Feature Space • Nonlinearof kernels used to map the data to a higher dimensional space as an Advantage linear are models: to makedata it linearly separable. •attempt Neuroimaging are extremely high-dimensional and the sample sizes are • Thesmall, kerneltherefore trick enable the computation of similarities thebenefit. feature space very non-linear kernels often don’t bringinany havingreduce to compute theofmapping explicitly. •without Linear model the risk overfitting the data and to allow direct extraction of the weight vector as an image (i.e. discrimination map). Linear classifiers • Linear classifiers (hyperplanes) are parameterized by a weight vector w and a bias term b. • The weight vector can be expressed as a linear combination of training examples xi (where i = 1,…,N and N is the number of training examples). N w i x i i1 How to make predictions? • The general equation for making predictions for a test example x* with kernel methods is f (x * ) w x * b Primal representation N f (x * ) i x i x * b i1 N f (x * ) iK(x i ,x * ) b Dual representation i1 • Where f(x*) is the predicted score for regression or the distance to the decision boundary for classification models. How to interpret the weight vector? Examples of class 1 Voxel 1 Weight vector or Discrimination map … Voxel 2 Voxel 1 Voxel 2 Training Examples of class 2 Voxel 1 Voxel 2 … Voxel 1 Voxel 2 w1 = +5 w2 = -3 Spatial representation of thepattern decision Multivariate -> function No local inferences should made! f(x)be = (w 1.v1+w2.v2)+b New example v1 = 0.5 v2 = 0.8 Testing = (+5.0.5-3.0.8)+0 = 0.1 Positive value -> Class 1 Example of Kernel Methods (1) Support Vector Machine Support Vector Machines (SVMs) • A classifier derived from statistical learning theory by Vapnik, et al. in 1992. • SVM became famous when, using images as input, it gave accuracy comparable to neural-network with hand-designed features in a handwriting recognition task. • Currently, SVM is widely used in object detection & recognition, content-based image retrieval, text recognition, biometrics, speech recognition, neuroimaging, etc. • Also used for regression. Largest Marging Classifier • Among all hyperplanes separating the data there is a unique optimal hyperplane, the one which presents the largest margin (the distance of the closest points to the hyperplane). •Let us consider that all test points are generated by adding bounded noise (r) to the training examples (test and training data are assumed to have been generate by the same underlying dependence). r • If the optimal hyperplane has margin >r it will correctly separate the test points. Linearly separable case (Hard Margin SVM) (w⊤xi + b) =+1 (w⊤xi + b) =-1 (w⊤xi + b) > 0 (w⊤xi + b) < 0 w • We assume that the data are linearly separable, that is, there exist w∈IRd and b∈IR such that yi(w⊤xi + b) > 0, i = 1,...,m. • Rescaling w and b such that the point(s) closest to the hyperplane satisfy |(w⊤xi + b)| =1 we obtain the canonical form of the hyperplane satisfying yi(w⊤xi + b) > 0. • The distance ρx(w,b) of a point x from a hyperplane Hw,b is given by ρx= |(w⊤xi + b)| / ||w|| • The quantity 1/||w|| is the margin of the Optimal Separating Hyperplane. • Our aim is to find the largest margin hyperplane. • Optimization problem Quadratic problem: unique optimal solution • The solution of this problem is equivalent to determine the saddle point of the Lagrangian function where αi ≥ 0 are the Lagrange multipliers. • We minimize L over (w,b) and maximize over α. Differentiating L w.r.t w and b we obtain: Substituting w in L leads to the dual problem where A is an m × m matrix Note that the complexity of this problem depends on m (number of examples), not on the number of input components d (number of dimensions). If α is a solution of the dual problem then the solution (w, b) of the primal problem is given by Note that w is a linear combination of only the xi for which αi > 0. These xi are called support vectors (SVs). Parameter b can be determined by b = yj - w⊤xj, where xi corresponds to a SV. A new point x is classified as Some remarks • The fact that that the Optimal Separating Hyperplane is determined only by the SVs is most remarkable. Usually, the support vectors are a small subset of the training data. • All the information contained in the data set is summarized by the support vectors. The whole data set could be replaced by only these points and the same hyperplane would be found. Linearly non-separable case (Soft Margin SVM) •If the data is not linearly separable the previous analysis can be generalized by looking at the problem • The idea is to introduce the slack variables ξi to relax the separation constraints (ξi > 0 ⇒ xi has margin less than 1). 1 2 New dual problem •A saddle point analysis (similar to that above) leads to the dual problem • This is like the previous dual problem except that now we have “box constraints” on αi. If the data is linearly separable, by choosing C large enough we obtain the Optimal Separating Hyperplane. • Again we have The role of the parameter C • The parameter C that controls the relative importance of minimizing the norm of w (which is equivalent to maximizing the margin) and satisfying the margin constraint for each data point. •If C is close to 0, then we don't pay that much for points violating the margin constraint. This is equivalent to creating a very wide tube or safety margin around the decision boundary (but having many points violate this safety margin). •If C is close to inf, then we pay a lot for points that violate the margin constraint, and we are close the hard-margin formulation we previously described - the difficulty here is that we may be sensitive to outlier points in the training data. •C is often selected by minimizing the leave-one-out cross validation error. Summary •SVMs are prediction devices known to have good performance in highdimensional settings • "The key features of SVMs are the use of kernels, the absence of local minima, the sparseness of the solution and the capacity control obtained by optimizing the margin.” Shawe-Taylor and Cristianini (2004). References •Shawe-Taylor J, Christianini N (2004) Kernel Methods for Pattern Analysis. Cambridge: Cambridge University Press. •Schölkopf, B., Smola, A., 2002. Learning with Kernels. MIT Press. •Burges, C., 1998. A tutorial on support vector machines for patternrecognition. Data Min. Knowl. Discov. 2 (2), 121–167. •Vapnik, V., 1995. The Nature of Statistical Learning Theory. SpringerVerlag, New York. •Boser, B.E., Guyon, I.M., Vapnik, V.N., 1992. A training algorithm for optimal margin classifiers. D. Proc. Fifth Ann. Workshop on Computa-tional Learning Theory, pp. 144–152.