Sparse Kernel Machines Christopher M. Bishop, Pattern Recognition and Machine Learning Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions 2 Supervised Learning In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning (x,t) (1,60,pass) (2,53,fail) (3,77,pass) (4,34,fail) ﹕ y(x) output 3 Classification x2 y<0 t=-1 y=0 y>0 t=1 x1 4 Regression t 1 0 -1 0 new x 1 x 5 Linear Models Linear models for regression and model parameter input classification: y( x) 0 1x1 D xD where x = (x1,...,x D ) if we apply feature extraction, M 1 y( x) 0 j j ( x) w ( x) 0 T j 1 6 Problems with Feature Space Why feature extraction? Working in high dimensional feature spaces solves the problem of expressing complex functions Problems: - there is a computational problem (working with very large vectors) - curse of dimensionality 7 Kernel Methods (1) Kernel function: inner products in some feature space nonlinear similarity measure k ( x, x ') ( x) ( x ') T Examples - polynomial: k ( x, x ') ( xT x ' c)d 2 - Gaussian: k ( x, x ') exp( x x ' / 2 2 ) 8 Kernel Methods (2) k ( x, z ) ( x T z ) 2 ( x z x z ) 2 1 1 2 2 x12 z12 2 x1 z1 x2 z2 x22 z22 ( x12 , 2 x1 x2 , x22 )( z12 , 2 z1 z2 , z22 )T ( x) ( z ) T Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally only require inner products between data (input) 9 Kernel Methods (3) We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but just working out the inner product in the data space 10 Kernel Methods (4) Kernel methods exploit information about the inner products between data items We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target 11 Kernel Methods (5) Two basic modules for kernel methods General purpose learning model Problem specific kernel function 12 Kernel Methods (6) Limitation: the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training points when making predictions for new data points Sparse kernel machine makes prediction only by a subset of the training data points 13 Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions 14 Support Vector Machines (1) Support Vector Machines are a system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory Generalization theory describes how to control the learning machines to prevent them from overfitting 15 Support Vector Machines (2) To avoid overfitting, SVM modify the error function to a “regularized form” E(w) ED (w) EW (w) where hyperparameter λ balances the trade-off The aim of EW is to limit the estimated functions to smooth functions As a side effect, SVM obtain a sparse model 16 Support Vector Machines (3) Fig. 1 Architecture of SVM 17 SVM for Classification (1) The mechanism to prevent overfitting in classification is “maximum margin classifiers” SVM is fundamentally a two-class classifier 18 Maximum Margin Classifiers (1) The aim of classification is to find a D-1 dimension hyperplane to classify data in a D dimension space 2D example: 19 Maximum Margin Classifiers (2) support vectors support vectors margin 20 Maximum Margin Classifiers (3) small margin large margin 21 Maximum Margin Classifiers (4) Intuitively it is a “robust” solution - If we’ve made a small error in the location of the boundary, this gives us least chance of causing a misclassification The concept of max margin is usually justified using Vapnik’s Statistical learning theory Empirically it works well 22 SVM for Classification (2) After the optimization process, we obtain the prediction model: N y( x) antn k ( x, x n ) b n 1 where (xn,tn) are N training data we can find that an will be zero except for that of the support vectors sparse 23 SVM for Classification (3) Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function 24 SVM for Classification (4) For overlapping class distributions, SVM allow some of the training points to be misclassified soft margin penalty 25 SVM for Classification (5) For multiclass problems, there are some methods to combine multiple two-class SVMs - one versus the rest - one versus one more training time Fig. 3 Problems in multiclass classification using multiple SVMs 26 SVM for Regression (1) For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function” quadratic error function ε-insensitive error funciton 27 SVM for Regression (2) × Error = |y(x)-t|- ε No error Fig . 4 ε-tube 28 SVM for Regression (3) After the optimization process, we obtain the prediction model: N y( x) (an aˆn )k ( x, x n ) b n 1 we can find that an will be zero except for that of the support vectors sparse 29 SVM for Regression (4) Fig . 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube 30 Disadvantages It’s not sparse enough since the number of support vectors required typically grows linearly with the size of the training set Predictions are not probabilistic The estimation of error/margin trade-off parameters must utilize cross-validation which is a waste of computation Kernel functions are limited Multiclass classification problems 31 Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions 32 Relevance Vector Machines (1) The relevance vector machine (RVM) is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM 33 Relevance Vector Machines (2) RVM intend to mirror the structure of the SVM and use a Bayesian treatment to remove the limitations of SVM N y ( x) wn k ( x, x n ) b n 1 the kernel functions are simply treated as basis functions, rather than dot-product in some space 34 Bayesian Inference Bayesian inference allows one to model uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence. 35 Relevance Vector Machines (3) In the Bayesian framework, we use a prior distribution over w to avoid overfitting 1/ 2 2 p( w | ) ( ) exp( wm ) 2 m 1 2 N where α is a hyperparameter which control the model parameter w 36 Relevance Vector Machines (4) Goal: find most probable α* and β* to compute the predictive distribution over tnew for a new input xnew, i.e. p(tnew | xnew, X, t, α*, β*) Training data and their target values Maximize the likelihood function to obtain α* and β* : p(t|X, α, β) 37 Relevance Vector Machines (5) RVM utilize the “automatic relevance determination” to achieve sparsity m 1/ 2 m 2 p( w | ) ( ) exp( wm ) 2 m 1 2 N where αm represents the precision of wm In the procedure of finding αm*, some αm will become infinity which leads the corresponding wm to be zero remain relevance vectors ! 38 Comparisons - Regression RVM (on standard deviation predictive distribution) SVM 39 Comparisons - Regression 40 Comparison - Classification RVM SVM 41 Comparison - Classification 42 Comparisons RVM are much sparser and make probabilistic prediction RVM gives better generalization in regression SVM gives better generalization in classification RVM is computationally demanding while learning 43 Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions 44 Applications (1) SVM for face detection 45 Applications (2) Marti Hearst, “ Support Vector Machines” ,1998 46 Applications (3) In the feature-matching based object tracking, SVM are used to detect false feature matches Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001 47 Applications (4) Recovering 3D human poses by RVM A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector 48 Regression” 2004 Outline Introduction to kernel methods Support vector machines (SVM) Relevance vector machines (RVM) Applications Conclusions 49 Conclusions The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions 50 References www.support-vector.net N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000 M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001 51 Underfitting and Overfitting underfitting-too simple overfitting-too complex new data Adapted from http://www.dtreg.com/svm.htm 52