Lecture set 7 - Department of Electrical and Computer Engineering

Introduction to Predictive Learning LECTURE SET 7 Support Vector Machines Electrical and Computer Engineering 1 OUTLINE • Objectives explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods • • • • • • Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 2 MOTIVATION for SVM • • • • Recall ‘conventional’ methods: - model complexity ~ dimensionality - nonlinear methods  multiple local minima - hard to control complexity ‘Good’ learning method: (a) tractable optimization formulation (b) tractable complexity control(1-2 parameters) (c) flexible nonlinear parameterization Properties (a), (b) hold for linear methods SVM solution approach 3 SVM APPROACH • • Linear approximation in Z-space using special adaptive loss function Complexity independent of dimensionality x gx       z wz yˆ 4 Motivation for Nonlinear Methods 1. Nonlinear learning algorithm proposed using ‘reasonable’ heuristic arguments. reasonable ~ statistical or biological 2. Empirical validation + improvement 3. Statistical explanation (why it really works) Examples: statistical, neural network methods. In contrast, SVM methods have been originally proposed under VC theoretic framework. 5 OUTLINE • • Objectives Motivation for margin-based loss Loss functions for regression Loss functions for classification Philosophical interpretation • • • • • Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 6 Main Idea • • • • • Model complexity controlled by a special loss function used for fitting training data Such empirical loss functions may be different from the loss functions used in a learning problem setting Such loss functions are adaptive, i.e. can adapt their complexity to particular data set Different loss functions for different learning problems (classification, regression etc) Model complexity(VC-dim.) is controlled independently of the number of features 7 Robust Loss Function for Regression • Squared loss ~ motivated by large sample settings parametric assumptions Gaussian noise • For practical settings better to use linear loss 8 Epsilon-insensitive Loss for Regression • Can also control model complexity (~ falsifiability) L ( y, f (x,  ))  max | y  f (x,  ) |  ,0 9 Empirical Comparison • Univariate regression y  x   x  [0,1]  ~ N (0,0.36) • Squared, linear and SVM loss (with   0.6 ): 2 1.5 y 1 0.5 0 -0.5 -1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x • Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss 10 Empirical Comparison (cont’d) • Univariate regression y  x   x  [0,1]  ~ N (0,0.36) • Squared, linear and SVM loss (with   0.6 ) • Test error (MSE) estimated for 5 independent realizations of training data (4 training samples) Squared loss Least modulus loss SVM loss with epsilon=0.6 1 0.024 0.134 0.067 2 0.128 0.075 0.063 3 0.920 0.274 0.041 4 0.035 0.053 0.032 5 0.111 0.027 0.005 Mean 0.244 0.113 0.042 St. Dev. 0.381 0.099 0.025 11 Loss Functions for Classification • Decision rule D(x)  sign f (x,) • Quantity yf (x,  ) is analogous to residuals in regression • Common loss functions: 0/1 loss and linear loss • Properties of a good loss function? ~ continuous + convex + robust 12 Motivation for margin-based loss (1) • Given: Linearly separable data How to construct linear decision boundary? (a) Many linear decision boundaries (that have no errors) 13 Motivation for margin-based loss (2) • Given: Linearly separable data Which linear decision boundary is better ? The model with larger margin is more robust for future data 14 Largest-margin solution • All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability) M  2 15 Margin-based loss for classification SVM loss or hinge loss L ( y, f (x, ))  max |   yf (x, ) |,0  Minimization of slack variables i    yi f (xi ,  ) 1 y  1 2 y  1 16 Margin-based loss for classification: margin size is adapted to training data Class +1 Class -1 Margin L ( y, f (x, ))  max   yf (x, ),0 17 Motivation: philosophical • Classical view: good model explains the data + low complexity  Occam’s razor (complexity ~ # parameters) • VC theory: good model explains the data + low VC-dimension  VC-falsifiability (small VC-dim ~ large falsifiability), i.e. the goal is to find a model that: can explain training data / cannot explain other data The idea: falsifiability ~ empirical loss function 18 Adaptive Loss Functions • Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model • The trade-off (between the two goals) is adaptively controlled  adaptive loss fct • For classification, the degree of falsifiability is ~ margin size (see below) 19 Margin-based loss for classification Margin  2 L ( y, f (x, ))  max   yf (x, ),0 20 Classification: non-separable data slack _ variables     yf (x,  ) L ( y, f (x, ))  max   yf (x, ),0 21 Margin based complexity control • Large degree of falsifiability is achieved by - large margin (for classification) - small epsilon (for regression) • For linear classifiers: larger margin  smaller VC-dimension 1   2  3  ... ~ h1  h2  h3  ... 22  -margin hyperplanes • Solutions provided by minimization of SVM loss can be indexed by the value of margin   SRM structure: for 1   2  3  ... VC-dim. h1  h2  h3  ... • If data samples belong to a sphere of radius R, then the VC dimension bounded by h  min( R /  , d )  1 2 2 • For large margin hyperplanes, VC-dimension controlled independent of dimensionality d. 23 SVM Model Complexity • Two ways to control model complexity -via model parameterization f ( x,  ) use fixed loss function: L( y, f (x,  )) -via adaptive loss function: L ( y, f (x, )) use fixed (linear) parameterization f (x,  )  (w  x)  b • ~ Two types of SRM structures • Margin-based loss can be motivated by Popper’s falsifiability 24  Margin-based loss: summary • Classification: L ( y, f (x, ))  max   yf (x, ),0 falsifiability controlled by margin  • Regression: L ( y, f (x,  ))  max | y  f (x,  ) |  ,0 falsifiability controlled by • Single class learning:  Lr ( f (x, ))  max | x  a) | r,0 falsifiability controlled by radius r NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems. 25 OUTLINE • • • Objectives Motivation for margin-based loss Linear SVM Classifiers - Primal formulation (linearly separable case) - Dual optimization formulation - Soft-margin SVM formulation • • • • Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 26 Optimal Separating Hyperplane Distance btwn hyperplane and sample f (x' ) / w  Margin   1 / w Shaded points are SVs 27 Optimization Formulation • • Given training data x i ,yi  i  1,..., n Find parameters w , b of linear hyperplane • Quadratic optimization with linear constraints tractable for moderate dimensions d For large dimensions use dual formulation: - scales better with n (rather than d) - uses only dot products • f x  w  x  b 2 that minimize  w   0.5  w yi w  x i   b  1 under constraints 28 From Optimization Theory: • • For a given convex minimization problem with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers  i  0 Karush-Kuhn-Tucker (KKT) conditions: * Lagrange coefficients  i  0 only for samples that satisfy the original constraint yi w  x i   b  1 with equality ~ SV’s have positive Lagrange coefficients 29 Convex Hull Interpretation of Dual Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors 30 Dual Optimization Formulation • • Given training data x i ,yi  i  1,..., n Find parameters  i* ,b * of an opt. hyperplane n Dx     i* y i x  x i   b * i 1 as a solution to maximization problem 1 n L    i   i j yi y j x i  x j   max 2 i , j 1 i 1 n under constraints n y i 1 • • i i  0, i  0 Note: data samples with nonzero  i* are SV’s Formulation requires only inner products x  x' 31 Support Vectors • SV’s ~ training samples with non-zero loss • SV’s are samples that falsify the model • The model depends only on SVs  SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters. 32 Support Vectors • SVM test error bound: E Test _ error   • • E  #_ support _ vectors  n  small # SV’s ~ good generalization Can be explained using LOO cross-validation SVM generalization can be related to data compression 33 Soft-Margin SVM formulation x 1 = 1 - f (x1 ) f (x)  w  x  b x1 x 2 = 1 - f (x 2 ) x2 f (x) = +1 x3 x 3 = 1+ f (x 3 ) f (x) = 0 f (x) = - 1 n Minimize: under constraints 1 2 C   i  w  min 2 i 1 yi w  x i   b  1   i 34 SVM Dual Formulation • • Given training data x i ,yi  i  1,..., n Find parameters  i* ,b * of an opt. hyperplane n Dx     i* y i x  x i   b * i 1 as a solution to maximization problem 1 n L    i   i j yi y j x i  x j   max 2 i , j 1 i 1 n n under constraints y i 1 • • i i  0, 0  i  C Note: data samples with nonzero  i* are SVs Formulation requires only inner products x  x' 35 OUTLINE • • • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 36 Nonlinear Decision Boundary • Fixed (linear) parameterization is too rigid • Nonlinear curved margin may yield larger margin (falsifiability) and lower error 37 Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM • Nonlinear mapping to feature z space • Linear in z-space ~ nonlinear in x-space m • But  z i  z j    gk (xi ) gk (x j )    xi , x j  ~ symmetric fct k 1  Compute dot product via kernel analytically x gx       z wz yˆ 38 Example of Kernel Function • 2D input space x  x1 ,x 2  • Mapping to z space (2-d order polynomial) z1  1 z 2  2x1 z3  2x2 z 4  2 x1 x2 z 5  x12 z 6  x22 • Can show by direct substitution that for two input vectors u  u1 ,u2  v  v1 ,v2  Their dot product is calculated analytically G(u)  G( v)    u, v   u  v   1 2 39 SVM Formulation (with kernels) • Replacing z  z  K x, x leads to: • Find parameters  ,b of an optimal Dx     y K x , x   b hyperplane * i n i 1 * * i i * i as a solution ton maximization problem n 1 L    i   i j yi y j K xi , x j   max 2 i , j 1 n i 1 under constraints  y i i  0, 0   i  C / n • i 1 Given: the training data x i ,yi  i  1,..., n an inner product kernel K x, x regularization parameter C 40 Examples of Kernels Kernel K x, x is a symmetric function satisfying general (Mercer’s) conditions. Examples of kernels for different mappings xz • Polynomials of degree m K x, x  x  x'  1 m 2  • RBF kernel  x  x'    K x, x   exp   2 (width parameter)      • Neural Networks K x, x  tanh v(x  x' )  a v, a for given parameters Automatic selection of the number of hidden units (SV’s) 41 More on Kernels • The kernel matrix has all info (data + kernel) K(1,1) K(1,2)…….K(1,n) K(2,1) K(2,2)…….K(2,n) …………………………. K(n,1) K(n,2)…….K(n,n) • Kernel defines a distance in some feature space (aka kernel-induced feature space) • Kernel parameter controls nonlinearity • Kernels can incorporate a priori knowledge • Kernels can be defined over complex structures (trees, sequences, sets, etc.) 42 New insights provided by SVM • Why linear classifiers can generalize? (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both • Requires common-sense parameter tuning 43 OUTLINE • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples - Model Selection - Histogram of Projections - SVM Extensions and Modifications • • SVM for Regression Summary and Discussion 44 SVM Model Selection • • • The quality of SVM classifiers depends on proper tuning of model parameters: - kernel type (poly, RBF, etc) - kernel complexity parameter - regularization parameter C Note: VC-dimension depends on both C and kernel parameters These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale) 45 SVM Example 1: Ripley’s data • Ripley’s data set: - 250 training samples 2  ( u , v )  exp   u  v - SVM using RBF kernel - model selection via 10-fold cross-validation • Cross-validation error table:  gamma =2-3 =2-2 =2-1 =20 =21 =22 =23 C= 0.1 98.4% 51.6% 33.2% 28% 20.8% 19.2% 15.6% C= 1 23.6% 22% 19.6% 18% 16.4% 14.4% 14% C= 10 18.8% 20% 18.8% 16.4% 14% 13.6% 15.6% C= 100 20.4% 20% 15.6% 14% 12.8% 15.6% 16.4% C= 1000 18.4% 16% 13.6% 12.8% 16% 15.6% 18.4%  C= 10000 14.4% 14% 14.8% 15.6% 17.2% 16% 18.4%  optimal C = 1,000, gamma = 1 Note: may be multiple optimal parameter values 46 Optimal SVM Model • RBF SVM with optimal parameters C = 1,000, gamma = 1 • Test error is 9.8% (estimated using1,000 test samples) 1.2 1 0.8 x2 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 47 SVM Example 2: Noisy Hyperbolas • Noisy Hyperbolas data set: - 100 training samples (50 per class) - 100 validation samples (used for parameter tuning) RBF SVM model: Poly SVM model (5-th degree): 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 • Which model is ‘better’? • Model interpretation? 48 SVM Example 3: handwritten digits • MNIST handwritten digits (5 vs. 8) ~ high-dimensional data - 1,000 training samples (500 per class) - 1,000 validation samples (used for parameter tuning) - 1,866 test samples • Each sample is a real-valued vector of size 28*28=784: 28 pixels 28 pixels • RBF SVM: optimal parameters C=1,   28 49 How to visualize high-dim SVM model? • Histogram of projections for linear SVM: - project training data onto normal vector w (of SVM model) - show univariate histogram of projected training samples • On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins • Similar histograms can be obtained for nonlinear SVM 50 Histogram of Projections for Digits Data • Projections of training/test data onto normal direction of RBF SVM decision boundary: Training data Test data 500 180 450 160 400 140 350 120 300 100 250 80 200 60 150 40 100 20 50 0 -1.5 -1 -0.5 0 0.5 1 1.5 0 -1.5 -1 -0.5 0 0.5 1 1.5 51 Practical Issues for SVM Classifiers • Pre-processing all inputs pre-scaled to the range [0,1] or [-1,+1] • Model Selection (parameter tuning) • SVM Extensions - multi-class problems - unbalanced data sets - unequal misclassification costs 52 SVM for multi-class problems • Digit recognition ~ ten-class problem: - estimate 10 binary classifiers (one digit vs the rest) • For prediction: a test input is applied to all 10 binary SVM classifiers, and the class with the largest SVM output value is selected 53 Unbalanced Settings and Unequal Costs • Unbalanced settings: - different number of positive and negative samples - different prior probabilities for training /test data • Different Misclassification Costs - two types of errors, FP and FN - Cost (false_positive) vs. Cost(false_negative) - Loss function: C  Pfp  C  Pfn  min - these ‘costs’ need to be specified a priori, based on application requirements 54 SVM Modifications • The Problem: - How to modify standard SVM formulation? • Unbalanced Data + Unequal Costs 1 2   C   i  C   i  w  min 2 i   class i   class where C  Cost  false neg      S C   Cost  false pos   S • In practice, need to specify C C ,  C 55 Ripley’s data set (as before) where - negative samples ~ ‘triangles’ - given misclassification costs C  / C   3 :1 Note: boundary shifted away from positive samples 1.2 1 0.8 0.6 x2 • Example: SVM with unequal costs 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 56 SVM Applications • Handwritten digit recognition • Face detection in unrestricted images • Text/ document classification • Image classification and retrieval • ……. 57 Handwritten Digit Recognition (mid-90’s) • Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples • Data encoding: 28x28 grey scale pixel image • Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application • Multi-class problem: one-vs-all approach  10 SVM classifiers (one per each digit) 58 Digit Recognition Results • Summary - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type - 100 – 400 support vectors per class (digit) • More details Type of kernel No. of Support Vectors Polynomial 274 RBF 291 Neural Network 254 Error% 4.0 4.1 4.2 • ~ 80-90% of SV’s coincide (for different kernels) • Reduced-set SVM (Burges, 1996) ~ 15 per class 59 Document Classification (Joachims, 1998) • The Problem: Classification of text documents in large data bases, for text indexing and retrieval • Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly • Predictive Learning Approach (SVM): construct a classifier using all possible features (words) • Document/ Text Representation: individual words = input features (possibly weighted) • SVM performance: – Very promising (~ 90% accuracy vs 80% by other classifiers) – Most problems are linearly separable  use linear SVM 60 Image Data Mining (Chapelle et al, 1999) • Example image data: • Classification of images in data bases, for image indexing etc • DATA SET Corel photo images: 2670 samples divided into 7 classes: airplanes, birds, fish, vehicles etc. Training data: 1375 images; Test data: 1375 images (50%) MAIN ISSUE: invariant representation/ data encoding 61 OUTLINE • • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression - SV Regression formulation - Dual optimization formulation - Model selection - Example: Boston Housing • Summary and Discussion 62 General SVM Modeling Approach 1 For linear model f (x,  )  w  x  b minimize SVM functional 1 2 RSVM (w, b, Z n )  w  C  Remp ( w, b, Z n ) 2 using SVM loss suitable for the learning problem at hand 2 Transform (1) to dual optimization formulation (using only dot products) 3 Use kernels to obtain nonlinear version of (2). Note: this approach is used for all learning problems. However, tunable parameters of margin-based loss are different for various types of learning problems 63 SVM Regression • For linear model f (x,  )  w  x  b minimize SVM functional 1 2 RSVM (w, b, Z n )  w  C  Remp ( w, b, Z n ) 2 where empirical loss (for regression) is given by L ( y , f( x , ))  max | y  f( x , ) |  ,0 • Two distinct ways to control model complexity: - by the value of C (with fixed epsilon) - by the value of epsilon (with fixed large C) • SVM regression tunes both epsilon and C for optimal performance 64 Linear SVM regression For linear parameterization f (x,  )  w  x  b SVM regression functional: 1 RSVM ( w, b, Z n )  w 2 where 2  C  Remp ( , Z n )  min 1 n Remp ( , Z n )   L ( yi , f (x i ,  )) n i 1 L ( y, f (x,  ))  max | y  f (x,  ) |  ,0 65 Direct Optimization Formulation y Given training data x i ,yi   1  i  1,..., n  2* Minimize n x 1 (w  w )  C  ( i   i* ) 2 i 1  yi  (w  x i )  b     i  * Under constraints ( w  x )  b  y      i i i   ,  *  0, i  1,..., n i i  66 Dual Formulation for SVM Regression Given training data x i ,yi  i  1,..., n And the values of Find coefficients  ,C  i* ,  i* , i  1,..., n which maximize n n 1 n L(  i ,  i )    ( i   i )   yi ( i   i )   (  i   i )(  j   j )( x i  x j ) 2 i , j 1 i 1 i 1 n Under constraints  i 1 n i   i 0  i  C i 1 0  i  C Yields the following solution f ( x)   ( iSV * i   ) ( x i  x)  b * i * 67 Example: RBF regression  xc 2   j  f  x, w    w j exp  2  j 1  0.20     m 1 0.8 y 0.6 0.4 0.2 0 -0.2 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 RBF estimate (dashed line) using   0.16, C  2000 SVM model uses only 5 SV’s (out of the 40 points) 68 Example: decomposition of RBF model 4 3 2 y 1 0 -1 -2 -3 -4 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Weighted sum of 5 RBF kernel fcts gives the SVM model 69 SVM Model Selection: General • Setting/ tuning of SVM hyper-parameters - usually performed by experts - more recently, by non-expert practitioners • Issues for SVM model selection (1) parameters controlling the ‘margin’ size (2) kernel type and kernel complexity • Strategies for model selection - exhaustive search in the parameter space(via resampling) - efficient search using VC analytic bounds - rule-of-thumb analytic strategies (for a particular type of learning problem) 70 Model Selection: continued • Parameters controlling margin size - for classification, parameter C - for regression, the value of epsilon - for single-class learning, the radius • Complexity control ~ the fraction of SV’s( -SVM) - for classification, replace C with   [0,1] - for regression, specify the fraction of points allowed to lie outside  -insensitive zone  • For very sparse data (d/n>>1) use linear SVM 71 Parameter Selection for SVM Regression • Selection of parameter C Recall the SVM solution f (x)   (   ) (x  x)  b iSV * i * i * i 0   i*  C , i  1,..., n 0   C where and  with bounded kernels (RBF) C  ymax  ymin * i • Selection of  in general,  ~ (noise level) • But this does not reflect dependency on sample size 2 For linear regression:  y2 / x   suggesting    n n The final prescription   3 ln n n 72 Effect of SVM parameters on test error • Training data univariate Sinc(x) function t ( x)  sin( x) x x  [10,10] with additive Gaussian noise (sigma=0.2) (a) small sample size 50 (b) large sample size 200 Prediction Risk Prediction Risk Prediction Risk 0.20.2 0.2 0.15 0.15 0.15 0.10.1 0.1 0.05 0.05 0.05 0 0 0.60.6 0 0.6 0.40.4 0.20.2 epsilon epsilon 0 0 0 0 6 6 4 4 2 2 C/nC/n 8 8 1010 0.4 0.2 epsilon 0 0 2 4 6 8 10 C/n 73 SVM vs Regularization • System imitation  SVM • System identification  regularization But their risk functionals ‘look similar’ n 1 RSVM (w, b)  C  L( yi , f (x i ,  ))  w 2 i 1 n Rreg (w, b)   ( yi  f (x i ,  )) 2  w 2 2 i 1 Recent claims: SVM = special case of regularization • These claims neglect the role of margin loss 74 Comparison for Classification • Linear SVM vs Penalized LDA – comparison is fair • Data Sets: small (20 samples per class) large (100 samples per class) 75 Comparison results: classification • Small sample size: Linear SVM yields 0.5% - 1.1% error rate Penalized LDA yields 2.8% - 3% error • Large sample size: Linear SVM yields 0.4% - 1.1% error rate Penalized LDA yields 1.1% - 2.2% error • Conclusion: margin based complexity control is better than regularization 76 Comparison for regression • Linear SVM vs linear ridge regression Note: Linear SVM has 2 parameters • Sparse Data Set: 30 noisy samples, using target function x  [0,1]5 t (x)  2 x1  x2  0  x3  0  x4  0  x5 corrupted with gaussian noise with   0.2 • Complexity Control: - for RR vary regularization parameter - for SVM ~ epsilon and C parameters 77 Control for ridge regression • Coefficient shrinkage for ridge regression 2.5 w1 2 Coefficients 1.5 w2 1 0.5 w3 w4 0 w5 -0.5 -8 -6 -4 -2 0 2 Log(Lambda) 4 6 8 78 Complexity control for SVM • Coefficient shrinkage for SVM: (a)Vary C (epsilon=0) (b)Vary epsilon(C=large) 2.5 2.5 w1 2 2 1.5 1.5 Coefficients Coefficients w2 1 0.5 1 w2 0.5 w3 0 w5 w5 -0.5 -8 -6 w3 w4 w4 0 w1 -4 -2 0 Log(n/C) 2 4 6 8 -0.5 0 0.5 1 epsilon 1.5 2 79 Comparison: ridge regression vs SVM • Sparse setting: n=10, noise   0.2 • Ridge regression:  chosen by cross-validation • SV Regression:     0.2 C selected by cross-validation • Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM) 10 0 Risk (MSE) 10 1 10 10 -1 -2 Ridge SVM 80 OUTLINE • • • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 81 Summary • Direct approach  different formulations • Margin-based loss: robust, controls complexity (falsifiability) • SRM: new type of structure • Nonlinear feature selection (~ SV’s): incorporated into model estimation • Appropriate applications - high-dimensional data - content-based /content-dependent 82

Lecture set 7 - Department of Electrical and Computer Engineering

Related documents

Products

Support

Lecture set 7 - Department of Electrical and Computer Engineering

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib