Introduction to Predictive Learning LECTURE SET 7 Support Vector Machines Electrical and Computer Engineering 1 OUTLINE • Objectives explain motivation for SVM describe basic SVM for classification & regression compare SVM vs. statistical & NN methods • • • • • • Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 2 MOTIVATION for SVM • • • • Recall ‘conventional’ methods: - model complexity ~ dimensionality - nonlinear methods multiple local minima - hard to control complexity ‘Good’ learning method: (a) tractable optimization formulation (b) tractable complexity control(1-2 parameters) (c) flexible nonlinear parameterization Properties (a), (b) hold for linear methods SVM solution approach 3 SVM APPROACH • • Linear approximation in Z-space using special adaptive loss function Complexity independent of dimensionality x gx z wz yˆ 4 Motivation for Nonlinear Methods 1. Nonlinear learning algorithm proposed using ‘reasonable’ heuristic arguments. reasonable ~ statistical or biological 2. Empirical validation + improvement 3. Statistical explanation (why it really works) Examples: statistical, neural network methods. In contrast, SVM methods have been originally proposed under VC theoretic framework. 5 OUTLINE • • Objectives Motivation for margin-based loss Loss functions for regression Loss functions for classification Philosophical interpretation • • • • • Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 6 Main Idea • • • • • Model complexity controlled by a special loss function used for fitting training data Such empirical loss functions may be different from the loss functions used in a learning problem setting Such loss functions are adaptive, i.e. can adapt their complexity to particular data set Different loss functions for different learning problems (classification, regression etc) Model complexity(VC-dim.) is controlled independently of the number of features 7 Robust Loss Function for Regression • Squared loss ~ motivated by large sample settings parametric assumptions Gaussian noise • For practical settings better to use linear loss 8 Epsilon-insensitive Loss for Regression • Can also control model complexity (~ falsifiability) L ( y, f (x, )) max | y f (x, ) | ,0 9 Empirical Comparison • Univariate regression y x x [0,1] ~ N (0,0.36) • Squared, linear and SVM loss (with 0.6 ): 2 1.5 y 1 0.5 0 -0.5 -1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x • Red ~ target function, Dotted ~ estimate using squared loss, Dashed ~ linear loss, Dashed-dotted ~ SVM loss 10 Empirical Comparison (cont’d) • Univariate regression y x x [0,1] ~ N (0,0.36) • Squared, linear and SVM loss (with 0.6 ) • Test error (MSE) estimated for 5 independent realizations of training data (4 training samples) Squared loss Least modulus loss SVM loss with epsilon=0.6 1 0.024 0.134 0.067 2 0.128 0.075 0.063 3 0.920 0.274 0.041 4 0.035 0.053 0.032 5 0.111 0.027 0.005 Mean 0.244 0.113 0.042 St. Dev. 0.381 0.099 0.025 11 Loss Functions for Classification • Decision rule D(x) sign f (x,) • Quantity yf (x, ) is analogous to residuals in regression • Common loss functions: 0/1 loss and linear loss • Properties of a good loss function? ~ continuous + convex + robust 12 Motivation for margin-based loss (1) • Given: Linearly separable data How to construct linear decision boundary? (a) Many linear decision boundaries (that have no errors) 13 Motivation for margin-based loss (2) • Given: Linearly separable data Which linear decision boundary is better ? The model with larger margin is more robust for future data 14 Largest-margin solution • All solutions explain the data well (zero error) All solutions ~ the same linear parameterization Larger margin ~ more confidence (larger falsifiability) M 2 15 Margin-based loss for classification SVM loss or hinge loss L ( y, f (x, )) max | yf (x, ) |,0 Minimization of slack variables i yi f (xi , ) 1 y 1 2 y 1 16 Margin-based loss for classification: margin size is adapted to training data Class +1 Class -1 Margin L ( y, f (x, )) max yf (x, ),0 17 Motivation: philosophical • Classical view: good model explains the data + low complexity Occam’s razor (complexity ~ # parameters) • VC theory: good model explains the data + low VC-dimension VC-falsifiability (small VC-dim ~ large falsifiability), i.e. the goal is to find a model that: can explain training data / cannot explain other data The idea: falsifiability ~ empirical loss function 18 Adaptive Loss Functions • Both goals (explanation + falsifiability) can encoded into empirical loss function where - (large) portion of the data has zero loss - the rest of the data has non-zero loss, i.e. it falsifies the model • The trade-off (between the two goals) is adaptively controlled adaptive loss fct • For classification, the degree of falsifiability is ~ margin size (see below) 19 Margin-based loss for classification Margin 2 L ( y, f (x, )) max yf (x, ),0 20 Classification: non-separable data slack _ variables yf (x, ) L ( y, f (x, )) max yf (x, ),0 21 Margin based complexity control • Large degree of falsifiability is achieved by - large margin (for classification) - small epsilon (for regression) • For linear classifiers: larger margin smaller VC-dimension 1 2 3 ... ~ h1 h2 h3 ... 22 -margin hyperplanes • Solutions provided by minimization of SVM loss can be indexed by the value of margin SRM structure: for 1 2 3 ... VC-dim. h1 h2 h3 ... • If data samples belong to a sphere of radius R, then the VC dimension bounded by h min( R / , d ) 1 2 2 • For large margin hyperplanes, VC-dimension controlled independent of dimensionality d. 23 SVM Model Complexity • Two ways to control model complexity -via model parameterization f ( x, ) use fixed loss function: L( y, f (x, )) -via adaptive loss function: L ( y, f (x, )) use fixed (linear) parameterization f (x, ) (w x) b • ~ Two types of SRM structures • Margin-based loss can be motivated by Popper’s falsifiability 24 Margin-based loss: summary • Classification: L ( y, f (x, )) max yf (x, ),0 falsifiability controlled by margin • Regression: L ( y, f (x, )) max | y f (x, ) | ,0 falsifiability controlled by • Single class learning: Lr ( f (x, )) max | x a) | r,0 falsifiability controlled by radius r NOTE: the same interpretation/ motivation for margin-based loss for different types of learning problems. 25 OUTLINE • • • Objectives Motivation for margin-based loss Linear SVM Classifiers - Primal formulation (linearly separable case) - Dual optimization formulation - Soft-margin SVM formulation • • • • Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 26 Optimal Separating Hyperplane Distance btwn hyperplane and sample f (x' ) / w Margin 1 / w Shaded points are SVs 27 Optimization Formulation • • Given training data x i ,yi i 1,..., n Find parameters w , b of linear hyperplane • Quadratic optimization with linear constraints tractable for moderate dimensions d For large dimensions use dual formulation: - scales better with n (rather than d) - uses only dot products • f x w x b 2 that minimize w 0.5 w yi w x i b 1 under constraints 28 From Optimization Theory: • • For a given convex minimization problem with convex inequality constraints there exists an equivalent dual unconstrained maximization formulation with nonnegative Lagrange multipliers i 0 Karush-Kuhn-Tucker (KKT) conditions: * Lagrange coefficients i 0 only for samples that satisfy the original constraint yi w x i b 1 with equality ~ SV’s have positive Lagrange coefficients 29 Convex Hull Interpretation of Dual Find convex hulls for each class. The closest points to an optimal hyperplane are support vectors 30 Dual Optimization Formulation • • Given training data x i ,yi i 1,..., n Find parameters i* ,b * of an opt. hyperplane n Dx i* y i x x i b * i 1 as a solution to maximization problem 1 n L i i j yi y j x i x j max 2 i , j 1 i 1 n under constraints n y i 1 • • i i 0, i 0 Note: data samples with nonzero i* are SV’s Formulation requires only inner products x x' 31 Support Vectors • SV’s ~ training samples with non-zero loss • SV’s are samples that falsify the model • The model depends only on SVs SV’s ~ robust characterization of the data WSJ Feb 27, 2004: About 40% of us (Americans) will vote for a Democrat, even if the candidate is Genghis Khan. About 40% will vote for a Republican, even if the candidate is Attila the Han. This means that the election is left in the hands of one-fifth of the voters. 32 Support Vectors • SVM test error bound: E Test _ error • • E #_ support _ vectors n small # SV’s ~ good generalization Can be explained using LOO cross-validation SVM generalization can be related to data compression 33 Soft-Margin SVM formulation x 1 = 1 - f (x1 ) f (x) w x b x1 x 2 = 1 - f (x 2 ) x2 f (x) = +1 x3 x 3 = 1+ f (x 3 ) f (x) = 0 f (x) = - 1 n Minimize: under constraints 1 2 C i w min 2 i 1 yi w x i b 1 i 34 SVM Dual Formulation • • Given training data x i ,yi i 1,..., n Find parameters i* ,b * of an opt. hyperplane n Dx i* y i x x i b * i 1 as a solution to maximization problem 1 n L i i j yi y j x i x j max 2 i , j 1 i 1 n n under constraints y i 1 • • i i 0, 0 i C Note: data samples with nonzero i* are SVs Formulation requires only inner products x x' 35 OUTLINE • • • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 36 Nonlinear Decision Boundary • Fixed (linear) parameterization is too rigid • Nonlinear curved margin may yield larger margin (falsifiability) and lower error 37 Nonlinear Mapping via Kernels Nonlinear f(x,w) + margin-based loss = SVM • Nonlinear mapping to feature z space • Linear in z-space ~ nonlinear in x-space m • But z i z j gk (xi ) gk (x j ) xi , x j ~ symmetric fct k 1 Compute dot product via kernel analytically x gx z wz yˆ 38 Example of Kernel Function • 2D input space x x1 ,x 2 • Mapping to z space (2-d order polynomial) z1 1 z 2 2x1 z3 2x2 z 4 2 x1 x2 z 5 x12 z 6 x22 • Can show by direct substitution that for two input vectors u u1 ,u2 v v1 ,v2 Their dot product is calculated analytically G(u) G( v) u, v u v 1 2 39 SVM Formulation (with kernels) • Replacing z z K x, x leads to: • Find parameters ,b of an optimal Dx y K x , x b hyperplane * i n i 1 * * i i * i as a solution ton maximization problem n 1 L i i j yi y j K xi , x j max 2 i , j 1 n i 1 under constraints y i i 0, 0 i C / n • i 1 Given: the training data x i ,yi i 1,..., n an inner product kernel K x, x regularization parameter C 40 Examples of Kernels Kernel K x, x is a symmetric function satisfying general (Mercer’s) conditions. Examples of kernels for different mappings xz • Polynomials of degree m K x, x x x' 1 m 2 • RBF kernel x x' K x, x exp 2 (width parameter) • Neural Networks K x, x tanh v(x x' ) a v, a for given parameters Automatic selection of the number of hidden units (SV’s) 41 More on Kernels • The kernel matrix has all info (data + kernel) K(1,1) K(1,2)…….K(1,n) K(2,1) K(2,2)…….K(2,n) …………………………. K(n,1) K(n,2)…….K(n,n) • Kernel defines a distance in some feature space (aka kernel-induced feature space) • Kernel parameter controls nonlinearity • Kernels can incorporate a priori knowledge • Kernels can be defined over complex structures (trees, sequences, sets, etc.) 42 New insights provided by SVM • Why linear classifiers can generalize? (1) Margin is large (relative to R) (2) % of SV’s is small (3) ratio d/n is small • SVM offers an effective way to control complexity (via margin + kernel selection) i.e. implementing (1) or (2) or both • Requires common-sense parameter tuning 43 OUTLINE • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples - Model Selection - Histogram of Projections - SVM Extensions and Modifications • • SVM for Regression Summary and Discussion 44 SVM Model Selection • • • The quality of SVM classifiers depends on proper tuning of model parameters: - kernel type (poly, RBF, etc) - kernel complexity parameter - regularization parameter C Note: VC-dimension depends on both C and kernel parameters These parameters are usually selected via x-validation, by searching over wide range of parameter values (on the log-scale) 45 SVM Example 1: Ripley’s data • Ripley’s data set: - 250 training samples 2 ( u , v ) exp u v - SVM using RBF kernel - model selection via 10-fold cross-validation • Cross-validation error table: gamma =2-3 =2-2 =2-1 =20 =21 =22 =23 C= 0.1 98.4% 51.6% 33.2% 28% 20.8% 19.2% 15.6% C= 1 23.6% 22% 19.6% 18% 16.4% 14.4% 14% C= 10 18.8% 20% 18.8% 16.4% 14% 13.6% 15.6% C= 100 20.4% 20% 15.6% 14% 12.8% 15.6% 16.4% C= 1000 18.4% 16% 13.6% 12.8% 16% 15.6% 18.4% C= 10000 14.4% 14% 14.8% 15.6% 17.2% 16% 18.4% optimal C = 1,000, gamma = 1 Note: may be multiple optimal parameter values 46 Optimal SVM Model • RBF SVM with optimal parameters C = 1,000, gamma = 1 • Test error is 9.8% (estimated using1,000 test samples) 1.2 1 0.8 x2 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 47 SVM Example 2: Noisy Hyperbolas • Noisy Hyperbolas data set: - 100 training samples (50 per class) - 100 validation samples (used for parameter tuning) RBF SVM model: Poly SVM model (5-th degree): 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 • Which model is ‘better’? • Model interpretation? 48 SVM Example 3: handwritten digits • MNIST handwritten digits (5 vs. 8) ~ high-dimensional data - 1,000 training samples (500 per class) - 1,000 validation samples (used for parameter tuning) - 1,866 test samples • Each sample is a real-valued vector of size 28*28=784: 28 pixels 28 pixels • RBF SVM: optimal parameters C=1, 28 49 How to visualize high-dim SVM model? • Histogram of projections for linear SVM: - project training data onto normal vector w (of SVM model) - show univariate histogram of projected training samples • On the histogram: ‘0’~ decision boundary, -1/+1 ~ margins • Similar histograms can be obtained for nonlinear SVM 50 Histogram of Projections for Digits Data • Projections of training/test data onto normal direction of RBF SVM decision boundary: Training data Test data 500 180 450 160 400 140 350 120 300 100 250 80 200 60 150 40 100 20 50 0 -1.5 -1 -0.5 0 0.5 1 1.5 0 -1.5 -1 -0.5 0 0.5 1 1.5 51 Practical Issues for SVM Classifiers • Pre-processing all inputs pre-scaled to the range [0,1] or [-1,+1] • Model Selection (parameter tuning) • SVM Extensions - multi-class problems - unbalanced data sets - unequal misclassification costs 52 SVM for multi-class problems • Digit recognition ~ ten-class problem: - estimate 10 binary classifiers (one digit vs the rest) • For prediction: a test input is applied to all 10 binary SVM classifiers, and the class with the largest SVM output value is selected 53 Unbalanced Settings and Unequal Costs • Unbalanced settings: - different number of positive and negative samples - different prior probabilities for training /test data • Different Misclassification Costs - two types of errors, FP and FN - Cost (false_positive) vs. Cost(false_negative) - Loss function: C Pfp C Pfn min - these ‘costs’ need to be specified a priori, based on application requirements 54 SVM Modifications • The Problem: - How to modify standard SVM formulation? • Unbalanced Data + Unequal Costs 1 2 C i C i w min 2 i class i class where C Cost false neg S C Cost false pos S • In practice, need to specify C C , C 55 Ripley’s data set (as before) where - negative samples ~ ‘triangles’ - given misclassification costs C / C 3 :1 Note: boundary shifted away from positive samples 1.2 1 0.8 0.6 x2 • Example: SVM with unequal costs 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 x1 0.5 1 56 SVM Applications • Handwritten digit recognition • Face detection in unrestricted images • Text/ document classification • Image classification and retrieval • ……. 57 Handwritten Digit Recognition (mid-90’s) • Data set: postal images (zip-code), segmented, cropped; ~ 7K training samples, and 2K test samples • Data encoding: 28x28 grey scale pixel image • Original motivation: Compare SVM with custom MLP network (LeNet) designed for this application • Multi-class problem: one-vs-all approach 10 SVM classifiers (one per each digit) 58 Digit Recognition Results • Summary - prediction accuracy better than custom NN’s - accuracy does not depend on the kernel type - 100 – 400 support vectors per class (digit) • More details Type of kernel No. of Support Vectors Polynomial 274 RBF 291 Neural Network 254 Error% 4.0 4.1 4.2 • ~ 80-90% of SV’s coincide (for different kernels) • Reduced-set SVM (Burges, 1996) ~ 15 per class 59 Document Classification (Joachims, 1998) • The Problem: Classification of text documents in large data bases, for text indexing and retrieval • Traditional approach: human categorization (i.e. via feature selection) – relies on a good indexing scheme. This is time-consuming and costly • Predictive Learning Approach (SVM): construct a classifier using all possible features (words) • Document/ Text Representation: individual words = input features (possibly weighted) • SVM performance: – Very promising (~ 90% accuracy vs 80% by other classifiers) – Most problems are linearly separable use linear SVM 60 Image Data Mining (Chapelle et al, 1999) • Example image data: • Classification of images in data bases, for image indexing etc • DATA SET Corel photo images: 2670 samples divided into 7 classes: airplanes, birds, fish, vehicles etc. Training data: 1375 images; Test data: 1375 images (50%) MAIN ISSUE: invariant representation/ data encoding 61 OUTLINE • • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression - SV Regression formulation - Dual optimization formulation - Model selection - Example: Boston Housing • Summary and Discussion 62 General SVM Modeling Approach 1 For linear model f (x, ) w x b minimize SVM functional 1 2 RSVM (w, b, Z n ) w C Remp ( w, b, Z n ) 2 using SVM loss suitable for the learning problem at hand 2 Transform (1) to dual optimization formulation (using only dot products) 3 Use kernels to obtain nonlinear version of (2). Note: this approach is used for all learning problems. However, tunable parameters of margin-based loss are different for various types of learning problems 63 SVM Regression • For linear model f (x, ) w x b minimize SVM functional 1 2 RSVM (w, b, Z n ) w C Remp ( w, b, Z n ) 2 where empirical loss (for regression) is given by L ( y , f( x , )) max | y f( x , ) | ,0 • Two distinct ways to control model complexity: - by the value of C (with fixed epsilon) - by the value of epsilon (with fixed large C) • SVM regression tunes both epsilon and C for optimal performance 64 Linear SVM regression For linear parameterization f (x, ) w x b SVM regression functional: 1 RSVM ( w, b, Z n ) w 2 where 2 C Remp ( , Z n ) min 1 n Remp ( , Z n ) L ( yi , f (x i , )) n i 1 L ( y, f (x, )) max | y f (x, ) | ,0 65 Direct Optimization Formulation y Given training data x i ,yi 1 i 1,..., n 2* Minimize n x 1 (w w ) C ( i i* ) 2 i 1 yi (w x i ) b i * Under constraints ( w x ) b y i i i , * 0, i 1,..., n i i 66 Dual Formulation for SVM Regression Given training data x i ,yi i 1,..., n And the values of Find coefficients ,C i* , i* , i 1,..., n which maximize n n 1 n L( i , i ) ( i i ) yi ( i i ) ( i i )( j j )( x i x j ) 2 i , j 1 i 1 i 1 n Under constraints i 1 n i i 0 i C i 1 0 i C Yields the following solution f ( x) ( iSV * i ) ( x i x) b * i * 67 Example: RBF regression xc 2 j f x, w w j exp 2 j 1 0.20 m 1 0.8 y 0.6 0.4 0.2 0 -0.2 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 RBF estimate (dashed line) using 0.16, C 2000 SVM model uses only 5 SV’s (out of the 40 points) 68 Example: decomposition of RBF model 4 3 2 y 1 0 -1 -2 -3 -4 0 0.1 0.2 0.3 0.4 0.5 x 0.6 0.7 0.8 0.9 1 Weighted sum of 5 RBF kernel fcts gives the SVM model 69 SVM Model Selection: General • Setting/ tuning of SVM hyper-parameters - usually performed by experts - more recently, by non-expert practitioners • Issues for SVM model selection (1) parameters controlling the ‘margin’ size (2) kernel type and kernel complexity • Strategies for model selection - exhaustive search in the parameter space(via resampling) - efficient search using VC analytic bounds - rule-of-thumb analytic strategies (for a particular type of learning problem) 70 Model Selection: continued • Parameters controlling margin size - for classification, parameter C - for regression, the value of epsilon - for single-class learning, the radius • Complexity control ~ the fraction of SV’s( -SVM) - for classification, replace C with [0,1] - for regression, specify the fraction of points allowed to lie outside -insensitive zone • For very sparse data (d/n>>1) use linear SVM 71 Parameter Selection for SVM Regression • Selection of parameter C Recall the SVM solution f (x) ( ) (x x) b iSV * i * i * i 0 i* C , i 1,..., n 0 C where and with bounded kernels (RBF) C ymax ymin * i • Selection of in general, ~ (noise level) • But this does not reflect dependency on sample size 2 For linear regression: y2 / x suggesting n n The final prescription 3 ln n n 72 Effect of SVM parameters on test error • Training data univariate Sinc(x) function t ( x) sin( x) x x [10,10] with additive Gaussian noise (sigma=0.2) (a) small sample size 50 (b) large sample size 200 Prediction Risk Prediction Risk Prediction Risk 0.20.2 0.2 0.15 0.15 0.15 0.10.1 0.1 0.05 0.05 0.05 0 0 0.60.6 0 0.6 0.40.4 0.20.2 epsilon epsilon 0 0 0 0 6 6 4 4 2 2 C/nC/n 8 8 1010 0.4 0.2 epsilon 0 0 2 4 6 8 10 C/n 73 SVM vs Regularization • System imitation SVM • System identification regularization But their risk functionals ‘look similar’ n 1 RSVM (w, b) C L( yi , f (x i , )) w 2 i 1 n Rreg (w, b) ( yi f (x i , )) 2 w 2 2 i 1 Recent claims: SVM = special case of regularization • These claims neglect the role of margin loss 74 Comparison for Classification • Linear SVM vs Penalized LDA – comparison is fair • Data Sets: small (20 samples per class) large (100 samples per class) 75 Comparison results: classification • Small sample size: Linear SVM yields 0.5% - 1.1% error rate Penalized LDA yields 2.8% - 3% error • Large sample size: Linear SVM yields 0.4% - 1.1% error rate Penalized LDA yields 1.1% - 2.2% error • Conclusion: margin based complexity control is better than regularization 76 Comparison for regression • Linear SVM vs linear ridge regression Note: Linear SVM has 2 parameters • Sparse Data Set: 30 noisy samples, using target function x [0,1]5 t (x) 2 x1 x2 0 x3 0 x4 0 x5 corrupted with gaussian noise with 0.2 • Complexity Control: - for RR vary regularization parameter - for SVM ~ epsilon and C parameters 77 Control for ridge regression • Coefficient shrinkage for ridge regression 2.5 w1 2 Coefficients 1.5 w2 1 0.5 w3 w4 0 w5 -0.5 -8 -6 -4 -2 0 2 Log(Lambda) 4 6 8 78 Complexity control for SVM • Coefficient shrinkage for SVM: (a)Vary C (epsilon=0) (b)Vary epsilon(C=large) 2.5 2.5 w1 2 2 1.5 1.5 Coefficients Coefficients w2 1 0.5 1 w2 0.5 w3 0 w5 w5 -0.5 -8 -6 w3 w4 w4 0 w1 -4 -2 0 Log(n/C) 2 4 6 8 -0.5 0 0.5 1 epsilon 1.5 2 79 Comparison: ridge regression vs SVM • Sparse setting: n=10, noise 0.2 • Ridge regression: chosen by cross-validation • SV Regression: 0.2 C selected by cross-validation • Ave Risk (100 realizations): 0.44 (RR) vs 0.37 (SVM) 10 0 Risk (MSE) 10 1 10 10 -1 -2 Ridge SVM 80 OUTLINE • • • • • • • Objectives Motivation for margin-based loss Linear SVM Classifiers Nonlinear SVM Classifiers Practical Issues and Examples SVM for Regression Summary and Discussion 81 Summary • Direct approach different formulations • Margin-based loss: robust, controls complexity (falsifiability) • SRM: new type of structure • Nonlinear feature selection (~ SV’s): incorporated into model estimation • Appropriate applications - high-dimensional data - content-based /content-dependent 82