Document

Sparse Kernel Machines Christopher M. Bishop, Pattern Recognition and Machine Learning Outline  Introduction to kernel methods  Support vector machines (SVM)  Relevance vector machines (RVM)  Applications  Conclusions 2 Supervised Learning  In machine learning, applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are called supervised learning (x,t) (1,60,pass) (2,53,fail) (3,77,pass) (4,34,fail) ﹕ y(x) output 3 Classification x2 y<0 t=-1 y=0 y>0 t=1 x1 4 Regression t 1 0 -1 0 new x 1 x 5 Linear Models  Linear models for regression and model parameter input classification: y( x)  0  1x1   D xD where x = (x1,...,x D ) if we apply feature extraction, M 1 y( x)  0    j j ( x)  w  ( x)  0 T j 1 6 Problems with Feature Space  Why feature extraction? Working in high dimensional feature spaces solves the problem of expressing complex functions  Problems: - there is a computational problem (working with very large vectors) - curse of dimensionality 7 Kernel Methods (1)  Kernel function: inner products in some feature space  nonlinear similarity measure k ( x, x ')   ( x)  ( x ') T  Examples - polynomial: k ( x, x ')  ( xT x ' c)d 2 - Gaussian: k ( x, x ')  exp( x  x ' / 2 2 ) 8 Kernel Methods (2)  k ( x, z )  ( x T z ) 2  ( x z  x z ) 2 1 1 2 2  x12 z12  2 x1 z1 x2 z2  x22 z22  ( x12 , 2 x1 x2 , x22 )( z12 , 2 z1 z2 , z22 )T   ( x)  ( z ) T  Many linear models can be reformulated using a “dual representation” where the kernel functions arise naturally  only require inner products between data (input) 9 Kernel Methods (3)  We can benefit from the kernel trick: - choosing a kernel function is equivalent to choosing φ  no need to specify what features are being used - We can save computation by not explicitly mapping the data to feature space, but just working out the inner product in the data space 10 Kernel Methods (4)  Kernel methods exploit information about the inner products between data items  We can construct kernels indirectly by choosing a feature space mapping φ, or directly choose a valid kernel function  If a bad kernel function is chosen, it will map to a space with many irrelevant features, so we need some prior knowledge of the target 11 Kernel Methods (5)  Two basic modules for kernel methods General purpose learning model Problem specific kernel function 12 Kernel Methods (6)  Limitation: the kernel function k(xn,xm) must be evaluated for all possible pairs xn and xm of training points when making predictions for new data points  Sparse kernel machine makes prediction only by a subset of the training data points 13 Outline  Introduction to kernel methods  Support vector machines (SVM)  Relevance vector machines (RVM)  Applications  Conclusions 14 Support Vector Machines (1)  Support Vector Machines are a system for efficiently training the linear machines in the kernel-induced feature spaces while respecting the insights provided by the generalization theory and exploiting the optimization theory  Generalization theory describes how to control the learning machines to prevent them from overfitting 15 Support Vector Machines (2)  To avoid overfitting, SVM modify the error function to a “regularized form” E(w)  ED (w)   EW (w) where hyperparameter λ balances the trade-off  The aim of EW is to limit the estimated functions to smooth functions  As a side effect, SVM obtain a sparse model 16 Support Vector Machines (3) Fig. 1 Architecture of SVM 17 SVM for Classification (1)  The mechanism to prevent overfitting in classification is “maximum margin classifiers”  SVM is fundamentally a two-class classifier 18 Maximum Margin Classifiers (1)  The aim of classification is to find a D-1 dimension hyperplane to classify data in a D dimension space  2D example: 19 Maximum Margin Classifiers (2) support vectors support vectors margin 20 Maximum Margin Classifiers (3) small margin large margin 21 Maximum Margin Classifiers (4)  Intuitively it is a “robust” solution - If we’ve made a small error in the location of the boundary, this gives us least chance of causing a misclassification  The concept of max margin is usually justified using Vapnik’s Statistical learning theory  Empirically it works well 22 SVM for Classification (2)  After the optimization process, we obtain the prediction model: N y( x)   antn k ( x, x n )  b n 1 where (xn,tn) are N training data we can find that an will be zero except for that of the support vectors  sparse 23 SVM for Classification (3) Fig. 2 data from twp classes in two dimensions showing contours of constant y(x) obtained from a SVM having a Gaussian kernel function 24 SVM for Classification (4)  For overlapping class distributions, SVM allow some of the training points to be misclassified  soft margin penalty 25 SVM for Classification (5)  For multiclass problems, there are some methods to combine multiple two-class SVMs - one versus the rest - one versus one  more training time Fig. 3 Problems in multiclass classification using multiple SVMs 26 SVM for Regression (1)  For regression problems, the mechanism to prevent overfitting is “ε-insensitive error function” quadratic error function ε-insensitive error funciton 27 SVM for Regression (2) × Error = |y(x)-t|- ε No error Fig . 4 ε-tube 28 SVM for Regression (3)  After the optimization process, we obtain the prediction model: N y( x)   (an  aˆn )k ( x, x n )  b n 1 we can find that an will be zero except for that of the support vectors  sparse 29 SVM for Regression (4) Fig . 5 Regression results. Support vectors are line on the boundary of the tube or outside the tube 30 Disadvantages  It’s not sparse enough since the number of     support vectors required typically grows linearly with the size of the training set Predictions are not probabilistic The estimation of error/margin trade-off parameters must utilize cross-validation which is a waste of computation Kernel functions are limited Multiclass classification problems 31 Outline  Introduction to kernel methods  Support vector machines (SVM)  Relevance vector machines (RVM)  Applications  Conclusions 32 Relevance Vector Machines (1)  The relevance vector machine (RVM) is a Bayesian sparse kernel technique that shares many of the characteristics of SVM whilst avoiding its principal limitations  RVM are based on Bayesian formulation and provides posterior probabilistic outputs, as well as having much sparser solutions than SVM 33 Relevance Vector Machines (2)  RVM intend to mirror the structure of the SVM and use a Bayesian treatment to remove the limitations of SVM N y ( x)   wn k ( x, x n )  b n 1 the kernel functions are simply treated as basis functions, rather than dot-product in some space 34 Bayesian Inference  Bayesian inference allows one to model uncertainty about the world and outcomes of interest by combining common-sense knowledge and observational evidence. 35 Relevance Vector Machines (3)  In the Bayesian framework, we use a prior distribution over w to avoid overfitting  1/ 2  2 p( w |  )   ( ) exp( wm ) 2 m 1 2 N where α is a hyperparameter which control the model parameter w 36 Relevance Vector Machines (4)  Goal: find most probable α* and β* to compute the predictive distribution over tnew for a new input xnew, i.e. p(tnew | xnew, X, t, α*, β*) Training data and their target values  Maximize the likelihood function to obtain α* and β* : p(t|X, α, β) 37 Relevance Vector Machines (5)  RVM utilize the “automatic relevance determination” to achieve sparsity  m 1/ 2 m 2 p( w |  )   ( ) exp( wm ) 2 m 1 2 N where αm represents the precision of wm  In the procedure of finding αm*, some αm will become infinity which leads the corresponding wm to be zero  remain relevance vectors ! 38 Comparisons - Regression RVM (on standard deviation predictive distribution) SVM 39 Comparisons - Regression 40 Comparison - Classification RVM SVM 41 Comparison - Classification 42 Comparisons  RVM are much sparser and make probabilistic prediction  RVM gives better generalization in regression  SVM gives better generalization in classification  RVM is computationally demanding while learning 43 Outline  Introduction to kernel methods  Support vector machines (SVM)  Relevance vector machines (RVM)  Applications  Conclusions 44 Applications (1)  SVM for face detection 45 Applications (2) Marti Hearst, “ Support Vector Machines” ,1998 46 Applications (3)  In the feature-matching based object tracking, SVM are used to detect false feature matches Weiyu Zhu et al., “Tracking of Object with SVM Regression” , 2001 47 Applications (4)  Recovering 3D human poses by RVM A. Agarwal and B. Triggs, “3D Human Pose from Silhouettes by Relevance Vector 48 Regression” 2004 Outline  Introduction to kernel methods  Support vector machines (SVM)  Relevance vector machines (RVM)  Applications  Conclusions 49 Conclusions  The SVM is a learning machine based on kernel method and generalization theory which can perform binary classification and real valued function approximation tasks  The RVM have the same model as SVM but provides probabilistic prediction and sparser solutions 50 References  www.support-vector.net  N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press,2000  M. E. Tipping, “Sparse Bayesian Learning and the Relevance Vector Machine,” Journal of Machine Learning Research, 2001 51 Underfitting and Overfitting underfitting-too simple overfitting-too complex new data Adapted from http://www.dtreg.com/svm.htm 52

Document

Related documents

Products

Support

Document

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib