Machine Learning II Peter Gehler TU Darmstadt Feb 4, 2011 Acknowledgement Slides from Christop H. Lampert I.S.T. Austria, Vienna Slides and Additional Material http://www.christoph-lampert.org also look for Christoph H. Lampert Kernel Methods in Computer Vision Foundations and Trends in Computer Vision and Computer Graphics, now publisher, 2009 Selecting and Combining Kernels Selecting From Multiple Kernels Typically, one has many different kernels to choose from: different functional forms I linear, polynomial, RBF, . . . different parameters I polynomial degree, Gaussian bandwidth, . . . Different image features give rise to different kernels Color histograms, SIFT bag-of-words, HOG, Pyramid match, Spatial pyramids, . . . How to choose? Ideally, based on the kernels’ performance on task at hand: I estimate by cross-validation or validation set error Classically part of “Model Selection”. Cross-Validation Classical case: Split dataset D into N disjoint sets Dj Random sub-sampling: Split dataset randomly into train/validation set and repeat (with repetition) Leave-One-Out: N = |D| Train fi on ∪j6=i Dj Test on Di P CV Error = 1/N i err(fi , Di ) stratified CV: split such that class distribution is same as in entire set Kernel Parameter Selection Remark: Model Selection makes a difference! Action Classification, KTH dataset Method Accuracy (on test data) 80.66 85.19 Dollár et al. VS-PETS 2005: ”SVM classifier“ Nowozin et al., ICCV 2007: ”baseline RBF“ identical features, same kernel function difference: Nowozin used cross-validation for model selection (bandwidth and C ) Message: never rely on default parameters C ! Kernel Parameter Selection Rule of thumb for kernel parameters For generalize Gaussian kernels: k(x, x 0 ) = exp(− 1 2 d (x, x 0 )) 2γ with any distance d, set γ ≈ mediani,j=1,...,n d(xi , xj ). Many variants: I I mean instead of median only d(xi , xj ) with yi 6= yj ... In general, if there are several classes, then the kernel matrix : Kij = k(xi , xj ) should have a block structure w.r.t. the classes. 1.0 0 y 1.0 0.5 x 0.0 0.9 10 0.8 20 0.6 0.7 0.4 −1.0 −1.0 0.0 1.0 0.2 0 2.0 two moons 0.8 20 0.0 30 −0.8 40 −1.6 10 20 30 40 50 0.9 10 0.8 20 0.6 0.7 0.4 0.9 10 0.8 20 0.6 0.7 0.2 30 0.4 20 30 40 50 40 0.2 γ = 0.01 0 0.9 0.8 20 0.6 0.7 10 20 30 40 50 0.4 0.2 0 0.9 10 0.8 20 0.6 0.7 20 30 40 γ = 100 50 0.0 30 0.4 0.2 0.9 0.8 20 0.6 0.7 20 30 40 γ = 1000 0.6 0.7 0.5 30 0.4 0.3 40 0.2 0.1 0 10 50 0.0 20 30 40 50 30 0.4 0.9 10 0.8 20 0.6 0.7 0.5 30 0.4 0.3 40 0.2 0.3 40 0.2 0.1 0 10 20 30 40 50 0.1 0 0.0 10 20 30 40 50 0 0.9 10 0.8 20 0.6 0.7 1.0 0 0.9 10 0.8 20 0.6 0.7 0.5 30 0.4 0.5 30 0.4 0.3 40 0.2 0.3 40 0.2 0.1 10 20 30 40 0.0 γ = 10 1.0 0 0.0 1.0 0 0.5 0.1 10 20 −2.4 1.0 0.3 40 0 0.8 Gauss: γ = 0.001 0.5 0.1 10 10 γ=1 0.3 40 50 10 0.0 1.0 0.5 30 40 0 γ = 0.1 10 30 0.1 0 0.0 1.0 20 0.3 0.1 10 10 0.5 0.3 40 0 linear 1.0 0 0.5 30 0.0 label “kernel” 1.0 0 0 1.6 0.1 −2.0 0 10 0.3 40 0.9 2.4 0.5 30 −0.5 1.0 0 0 50 γ = 0.6 rule of thumb 0.0 0.1 0 10 20 30 40 γ = 1.6 5-fold CV 50 0.0 Kernel Selection ↔ Kernel Combination Is really one of the kernels best? Kernels are typcally designed to capture one aspect of the data I texture, color, edges, . . . Choosing one kernel means to select exactly one such aspect. Combining aspects if often better than Selecting. Method Colour Shape Texture HOG HSV siftint siftbdy combination Accuracy 60.9 ± 2.1 70.2 ± 1.3 63.7 ± 2.7 58.5 ± 4.5 61.3 ± 0.7 70.6 ± 1.6 59.4 ± 3.3 85.2 ± 1.5 Mean accuracy on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009] Combining Two Kernels For two kernels k1 , k2 : product k = k1 k2 is again a kernel I Problem: very small kernel values suppress large ones average k = 12 (k1 + k2 ) is again a kernel I I Problem: k1 , k2 on different scales. Re-scale first? convex combination kβ = (1 − β)k1 + βk2 with β ∈ [0, 1] Model selection: cross-validate over β ∈ {0, 0.1, . . . , 1}. Combining Many Kernels Multiple kernels: k1 ,. . . ,kK all convex combinations are kernels: k= K X j=1 βj kj with βj ≥ 0, K X β = 1. j=1 Kernels can be “deactivated” by βj = 0. Combinatorial explosion forbids cross-validation over all combinations of βj (testing two values per β is 2K ) Proxy: instead of CV, maximize SVM-objective. Each combined kernel induces a feature space. In which combined feature spaces can we best I I explain the training data, and achieve a large margin between the classes? Feature Space View of Kernel Combination Each kernel kj induces a Hilbert Space Hj and a mapping ϕj : X → Hj . β The weighted kernel kj j := βj kj induces the same Hilbert Space Hj , but q β a rescaled feature mapping ϕj j (x) := βj ϕj (x). β q β q k βj (x, x 0 ) ≡ hϕj j (x), ϕj j (x 0 )iH = h βj ϕj (x), βj ϕj (x 0 )iH = βj hϕj (x), ϕj (x 0 )iH = βj k(x, x 0 ). The linear combination k̂ := K j=1 βj kj induces c := ⊕K H , and the product space H j=1 j the product mapping ϕ̂(x) := (ϕβ1 1 (x), . . . , ϕβnn (x))t P k̂(x, x 0 ) ≡ hϕ̂(x), ϕ̂(x 0 )iHb = K X β β hϕj j (x), ϕj j (x 0 )iH = j=1 K X j=1 βj k(x, x 0 ) Feature Space View of Kernel Combination Implicit representation of a dataset using two kernels: Kernel k1 , feature representation ϕ1 (x1 ), . . . , ϕ1 (xn ) ∈ H1 Kernel k2 , feature representation ϕ2 (x1 ), . . . , ϕ2 (xn ) ∈ H2 Kernel Selection would most likely pick k2 . For k = (1 − β)k1 + βk2 , top is β = 0, bottom is β = 1. Feature Space View of Kernel Combination β = 0.00 5 √ margin = 0.0000 1.00φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.00φ1 (xi ) 5 Feature Space View of Kernel Combination β = 1.00 5 √ margin = 0.1000 0.00φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 1.00φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.99 5 √ margin = 0.2460 0.01φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.99φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.98 5 √ margin = 0.3278 0.02φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.98φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.97 5 √ margin = 0.3809 0.03φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.97φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.95 5 √ margin = 0.4515 0.05φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.95φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.90 5 √ margin = 0.5839 0.10φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.90φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.80 5 √ margin = 0.7194 0.20φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.80φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.70 5 √ margin = 0.7699 0.30φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.70φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.65 5 √ margin = 0.7770 0.35φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.65φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.60 5 √ margin = 0.7751 0.40φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.60φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.50 5 √ margin = 0.7566 0.50φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.50φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.40 5 √ margin = 0.7365 0.60φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.40φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.30 5 √ margin = 0.7073 0.70φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.30φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.20 5 √ margin = 0.6363 0.80φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.20φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.10 5 √ margin = 0.4928 0.90φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.10φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.03 5 √ margin = 0.2870 0.97φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.03φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.02 5 √ margin = 0.2363 0.98φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.02φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.01 5 √ margin = 0.1686 0.99φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.01φ1 (xi ) 5 Feature Space View of Kernel Combination β = 0.00 5 √ margin = 0.0000 1.00φ2 (xi ) H1 × H2 4 3 2 1 0 √ −1 −1 0 1 2 3 4 0.00φ1 (xi ) 5 Multiple Kernel Learning Determine the coefficients βj that realize the largest margin. First, how does the margin depend on βj ? Remember standard SVM (here without slack variables): minkwk2H w∈H subject to yi hw, xi iH ≥ 1 for i = 1, . . . n. H and ϕ were induced by kernel k. New samples are classified by f (x) = hw, xiH . Multiple Kernel Learning Insert K X 0 k(x, x ) = βj kj (x, x 0 ) (1) j=1 with I I I Hilbert space H = ⊕j √ Hj , √ feature map ϕ(x) = ( β1 ϕ1 (x), . . . , βK ϕK (x))t , weight vector w = (w1 , . . . , wK )t . such that kwk2H = X kwj k2Hj (2) j hw, ϕ(xi )iH = Xq βj hwj , ϕj (xi )iHj j (3) Multiple Kernel Learning For fixed βj , the largest margin hyperplane is given by min wj ∈Hj X kwj k2Hj j subject to yi Xq βj hwj , ϕj (xi )iHj ≥ 1 for i = 1, . . . n. j Renaming vj = min vj ∈Hj q X j βj wj (and defining 0 0 = 0): 1 kvj k2Hj βj subject to yi X hvj , ϕj (xi )iHj ≥ 1 j for i = 1, . . . n. Multiple Kernel Learning Therefore, best hyperplane for variable βj is given by: min X vj ∈Hj P j β =1 j j βj ≥0 1 kvj k2Hj βj (4) subject to yi X hvj , ϕj (xi )iHj ≥ 1 for i = 1, . . . n. (5) j This optimization problem is jointly-convex in vj and βj . There is a unique global minimum, and we can find it efficiently! Multiple Kernel Learning Same for soft-margin with slack-variables: min vj ∈Hj P β =1 j j βj ≥0 ξi ∈R+ X j X 1 kvj k2Hj + C ξi βj i (6) subject to yi X hvj , ϕj (xi )iHj ≥ 1 − ξi for i = 1, . . . n. (7) j This optimization problem is jointly-convex in vj and βj . There is a unique global minimum, and we can find it efficiently! Flower Classification: Dataset 17 types of flowers - 80 images per class 7 different precomputed kernels Data from Nilsback&Zissermann CVPR06 Combining Good Kernels Observation: if all kernels are reasonable, simple combination methods work as well as difficult ones (and are much faster): Single features Method Accuracy Time Colour 60.9 ± 2.1 3s Shape 70.2 ± 1.3 4s Texture 63.7 ± 2.7 3s HOG 58.5 ± 4.5 4s HSV 61.3 ± 0.7 3s siftint 70.6 ± 1.6 4s siftbdy 59.4 ± 3.3 5s Combination methods Method Accuracy Time product 85.5 ± 1.2 2s averaging 84.9 ± 1.9 10 s CG-Boost 84.8 ± 2.2 1225 s MKL (SILP) 85.2 ± 1.5 97 s MKL (Simple) 85.2 ± 1.5 152 s LP-β 85.5 ± 3.0 80 s LP-B 85.4 ± 2.4 98 s Mean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009] Message: Never use MKL without comparing to simpler baselines! Combining Good and Bad kernels Observation: if some kernels are helpful, but others are not, smart techniques are better. Performance with added noise features 90 85 accuracy 80 75 70 65 60 55 50 45 01 product average CG−Boost MKL (silp or simple) LP−β LP−B 5 10 25 50 no. noise features added Mean accuracy and total runtime (model selection, training, testing) on Oxford Flowers dataset [Gehler, Nowozin: ICCV2009] MKL Toy Example 1 Support-vector regression to learn samples of f (t) = sin(ωt) kx − x 0 k2 kj (x, x ) = exp 2σj2 0 ! with 2σj2 ∈ {0.005, 0.05, 0.5, 1, 10}. Multiple-Kernel Learning correctly identifies the right bandwidth. Software for Multiple Kernel Learning Existing toolboxes allow Multiple-Kernel SVM training: I Shogun (C++ with bindings to Matlab, Python etc.) [?] http://www.fml.tuebingen.mpg.de/raetsch/projects/shogun I SimpleMKL (Matlab) [?] http://asi.insa-rouen.fr/enseignants/˜arakotom/code/mklindex.html I SKMsmo (Matlab) [?] http://www.di.ens.fr/˜fbach/ (older and slower than the others) Typically, one only has to specify the set of kernels to select from and the regularization parameter C . Summary Kernel Selection and Combination Model selection is important to achive highest accuracy Combining several kernels is often superior to selecting one I I Simple techniques often work as well as difficult ones. Always try single best, averaging, product first. Learning Structured Outputs From Arbitrary Inputs to Arbitrary Outputs With kernels, we can handle “arbitrary” input spaces: we only need a pairwise similarity measure for objects: I I I images, e.g. gene sequences, e.g. string kernels graphs, e.g. random walk kernels We can learn mappings f : X → {−1, +1} or f : X → R. What about arbitrary output spaces? We know: kernels correspond to feature maps: ϕ : X → H. But: we cannot invert ϕ, there is no ϕ−1 : H → X . Kernels do not readily help to construct f :X →Y with Y 6= R. “True” Multiclass SVM Multiclass Classification When are we interested in f : X → Y? E.g. multi-class classification: f : X → {ω1 , . . . , ωK } Common solution: one-vs-rest training I For each class, train a separate fc : X → R. F F I Positive examples: {xi : yi = ωc } Negative examples: {xi : yi 6= ωc } (i.e. the rest) Final decision is f (x) = argmaxc∈C fc (x) Problem: fc know nothing of each other. E.g. scales of output could be vastly different: I f1 : X → [−2, 2], f2 : X → [−100, 100] Task: Learn a real multi-class SVM that knows about the argmax decision Multiclass Classification Express the K learning problems as one: Assume kernel k : X × X → R with feature map ϕ : X → H. Define class-dependent feature maps: xi → 7 (ϕ(x), 0, 0, . . . , 0) =: ϕ1 (xi ) xi 7→ (0, ϕ(x), 0, . . . , 0) =: ϕ2 (xi ) .. . xi 7→ (0, 0, . . . , 0, ϕ(x)) =: ϕK (xi ) if yi = c1 , if yi = c2 , if yi = cK , Combined weight vector w = (w1 , . . . , wK ). Combined Hilbert spaces: Hmc := ⊕K j=1 H Equivalent: per-class weight-vector or per-class feature map: hϕ(xi ), wj iH = hϕj (xi ), wiHmc Multiclass Classification We add all one-vs-rest SVM problems min wj ∈H X kwj k2H j subject to hwj , ϕ(xi )iH ≥ 1 −hwj , ϕ(xi )iH ≥ 1 for i = 1, . . . n with yi = cj , for i = 1, . . . n with yi 6= cj , for all j = 1, . . . , K . Uncoupled constraints: same solutions wj as for separate SVMs. Same decision as before: classify new samples using f (x) = argmax hwj , ϕ(x)iH j=1,...,K Multiclass Classification Rewrite hϕ(xi ), wj iH = hϕj (xi ), wiHmc and P j kwj k2H = kwk2Hmc : min kwk2Hmc w∈Hmc subject to hw, ϕj (xi )iHmc ≥ 1 −hw, ϕj (xi )iHmc ≥ 1 for i = 1, . . . n with yi = cj . for i = 1, . . . n with yi 6= cj . Solution w = (w1 , . . . , wK ) and classification rule f (x) = argmax hw, ϕj (x)iHmc j=1,...,K are the same as before. Multiclass Classification Now, introduce coupling constraints for better decisions: min kwk2H w∈Hmc subject to hw, ϕj (xi )iH −hw, ϕk (xi )iH ≥ 1 with yi = cj , for all k 6= j for i = 1, . . . n. Before: correct class has margin of 1 compared to 0. Now: correct class has margin of 1 compared to all other classes Classification rule stay the same: f (x) = argmax hw, ϕj (x)iHmc j=1,...,K Called Crammer-Singer Multiclass SVM Joint Feature Map We have defined one feature map ϕj per output class cj ∈ Y: ϕj : X → Hmc Instead, we can say we have defined one joint feature map Φ, that depends on the sample x and on the class label y: Φ : X × Y → Hmc Φ( x, y ) := ϕj (x) for y = cj Joint Feature Map Multiclass Classification Multiclass SVM with joint feature map Φ : X × Y → Hmc : min kwk2H w∈Hmc subject to hw, Φ(xi , yi )iH −hw, Φ(xi , y)iH ≥ 1, for all y 6= yi for i = 1, . . . n. Classify new samples using: f (x) = argmax hw, Φ(x, y)iHmc y∈Y Φ(x, y) occurs only inside of scalar products: Kernelize! Joint Kernel Multiclass Classification Joint Kernel Function: kjoint : (X × Y) × (X × Y) → R Similarity between two (sample,label) pairs. Example: multiclass kernel of Φ: kmc ( (x, y) , (x 0 , y 0 ) ) = k(x, x 0 ) · δy=y 0 Check: kmc ( (x, y) , (x 0 , y 0 ) ) = hΦ(x, y), Φ(x 0 , y 0 )iHmc Y can have more structure than just being a set {1, . . . , K }. Example: multiclass kernel with class similarities kjoint ( (x, y) , (x 0 , y 0 ) ) = k(x, x 0 ) · kclass (y, y 0 ) where kclass (y, y 0 ) measures similarity e.g. in a label hierarchy. What would we like to predict? Natural Language Processing: I I Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: I I Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Robotics: I Planning (output: sequence of actions) Computer Vision I I I Image Segmentation (output: segmentation mask) Human Pose Estimation (output: positions of body parts) Image Retrieval (output: ranking of images in database) Computer Vision Example: Semantic Image Segmentation 7→ input: images output: segmentation masks input space X = {images} = ˆ [0, 255]3·M ·N output space Y = {segmentation masks} = ˆ {0, 1}M ·N (structured output) prediction function: f : X → Y f (x) := argmin E(x, y) y∈Y energy function E(x, y) = P i wi> ϕu (xi , yi ) + P i,j Images: [M. Everingham et al. "The PASCAL Visual Object Classes (VOC) challenge", IJCV 2010] wij> ϕp (yi , yj ) Computer Vision Example: Human Pose Estimation input: image body model output: model fit input space X = {images} output space Y = {positions/angles of K body parts} = ˆ R4K . prediction function: f : X → Y f (x) := argmin E(x, y) y∈Y energy E(x, y) = P i wi> ϕfit (xi , yi ) + P i,j wij> ϕpose (yi , yj ) Images: [Ferrari, Marin-Jimenez, Zisserman: "Progressive Search Space Reduction for Human Pose Estimation", CVPR 2008.] Computer Vision Example: Point Matching input: image pairs output: mapping y : xi ↔ y(xi ) prediction function: f : X → Y f (x) := argmax F (x, y) y∈Y scoring function F (x, y) = P > P wi ϕsim (xi , y(xi )) + i,j wij> ϕdist (xi , xj , y(xi ), y(xj )) + i P > i,j,k wijk ϕangle (xi , xj , xk , y(xi ), y(xj ), y(xk )) [J. McAuley et al.: "Robust Near-Isometric Matching via Structured Learning of Graphical Models", NIPS, 2008] Computer Vision Example: Object Localization output: object position (left, top right, bottom) input: image input space X = {images} output space Y = R4 bounding box coordinates prediction function: f : X → Y f (x) := argmax F (x, y) y∈Y scoring function F (x, y) = w > ϕ(x, y) where ϕ(x, y) = h(x|y ) is a feature vector for an image region, e.g. bag-of-visual-words. [M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008] Computer Vision Examples: Summary Image Segmentation y = argmin E(x, y) E(x, y) = y∈{0,1}N X wi> ϕ(xi , yi ) + i X wij> ϕ(yi , yj ) i,j Pose Estimation y = argmin E(x, y) E(x, y) = y∈R4K X wi> ϕ(xi , yi ) + i X wij> ϕ(yi , yj ) i,j Point Matching y = argmax F(x, y) F(x, y) = y∈Πn X i wi> ϕ(xi , yi ) + X wij> ϕ(yi , yj ) + i,j Object Localization y = argmax F (x, y) y∈R4 F (x, y) = w > ϕ(x, y) X i,j,k > wijk ϕ(yi , yj , yk ) Grand Unified View Predict structured output by maximization y = argmax F (x, y) y∈Y of a compatiblity function F (x, y) = hw, ϕ(x, y)i that is linear in a parameter vector w. A generic structured prediction problem X : arbitrary input domain Y: structured output domain, decompose y = (y1 , . . . , yk ) Prediction function f : X → Y by f (x) = argmax F (x, y) y∈Y Compatiblity function (or negative of "energy") F (x, y) = hw, ϕ(x, y)i = k X wi> ϕi (yi , x) unary terms i=1 + k X wij> ϕij (yi , yj , x) binary terms i,j=1 + ... higher order terms (sometimes) Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. feature functions has only unary terms ϕ(x, y) = ϕ1 (x, y1 ), . . . , ϕ5 (x, y5 ) . F (x, y) = hw1 , ϕ1 (x, y1 )i + · · · + hw, ϕ5 (x, y5 )i Q V E S T Output Input Advantage: computing y ∗ = argmaxy F (x, y) is easy. We can find each yi∗ independently, check 5 · 26 = 130 values. Problem: only local information, we can’t correct errors. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. one global feature function yth pos. z }| { ϕ(x, y) = (0, . . . , 0, Φ(x) , 0, . . . , 0) if y ∈ D dictionary, (0, . . . , 0, 0 , 0, . . . , 0) otherwise. QUEST Output Input Advantage: access to global information, e.g. from dictionary D. Problem: argmaxy hw, ϕ(x, y)i has to check 265 = 11881376 values. We need separate training data for each word. Example: Sequence Prediction – Handwriting Recognition X = 5-letter word images , x = (x1 , . . . , x5 ), xj ∈ {0, 1}300×80 Y = ASCII translation , y = (y1 , . . . , y5 ), yj ∈ {A, . . . , Z }. feature function with unary and pairwise terms ϕ(x, y) = ϕ1 (y1 , x), ϕ2 (y2 , x), . . . , ϕ5 (y5 , x), ϕ1,2 (y1 , y2 ), . . . , ϕ4,5 (y4 , y5 ) Q U E S T Output Input Compromise: computing y ∗ is still efficient (Viterbi best path) Compromise: neighbor information allows correction of local errors. During the last lectures we learned how to evaluate argmaxy F (x, y) (for some of these models) chain tree loop-free graphs: Viterbi decoding / dynamic programming grid arbitrary graph loopy graphs: approximate inference (e.g. loopy BP) Today: how to learn a good function F (x, y) from training data. Parameter Learning in Structured Models Given: parametric model (family): F (x, y) = hw, ϕ(x, y)i Given: prediction method: f (x) = argmaxy∈Y F (x, y) Not given: parameter vector w (high-dimensional) Supervised Training: Given: example pairs {(x 1 , y 1 ), . . . , (x n , y n )} ⊂ X × Y. typical inputs with "the right" outputs for them. { , , , Task: determine "good" w , } Structured Output SVM Two criteria for decision function f : Correctness: Ensure f (xi ) = yi for i = 1, . . . , n. Robustness: f should also work if xi are perturbed. Translated to structured prediction f (x) = argmaxy∈Y hw, ϕ(x, y)i: Ensure for i = 1, . . . , n, argmax hw, ϕ(xi , y)i = yi , y∈Y ⇔ hw, ϕ(xi , yi )i > hw, ϕ(xi , y)i Minimize kwk2 . for all y ∈ Y \ {yi }. Structured Output SVM Optimization Problem: min d w∈R ,ξ∈Rn+ n 1 CX kwk2 + ξi 2 n i=1 subject to, for i = 1, . . . , n, hw, ϕ(xi , yi )i ≥ ∆(yi , y) + hw, ϕ(xi , y)i − ξi , for all y ∈ Y. ∆(yi , y) ≥ 0: Loss function ("predict y, correct would be yi ") Optimization problem very similar to normal (soft-margin) SVM I I I quadratic in w, linear in ξ constraints linear in w and ξ convex But there are n(|Y| − 1) constraints! I I numeric optimization needs some tricks computationally expensive Example: A "True" Multiclass SVM 1 Y = {1, 2, . . . , K }, ∆(y, y 0 ) = for y 6= y 0 . 0 otherwise. ϕ(x, y) = Jy = 1KΦ(x), Jy = 2KΦ(x), . . . , Jy = K KΦ(x) = Φ(x)ey> with ey =y-th unit vector Solve: n 1 CX 2 ξi min kwk + w,ξ 2 n i=1 subject to, for i = 1, . . . , n, hw, ϕ(x i , y i )i ≥ 1 + hw, ϕ(x i , y)i − ξ i Classification: MAP for all y ∈ Y \ {y i }. f (x) = argmax hw, ϕ(x, y)i y∈Y Crammer-Singer Multiclass SVM Example: Hierarchical Multiclass Classification Loss function can reflect hierarchy: cat 1 ∆(y, y 0 ) := (distance in tree) 2 ∆(cat, cat) = 0, ∆(cat, dog) = 1, dog car bus ∆(cat, bus) = 2, etc. Solve: n CX 1 ξi min kwk2 + w,ξ 2 n i=1 subject to, for i = 1, . . . , n, hw, ϕ(x i , y i )i ≥ ∆(y i , y) + hw, ϕ(x i , y)i − ξ i for all y ∈ Y \ {y i }. Example: Object Localization output: object position (left, top right, bottom) input: image ϕ(x, y) = Φ(x|y ) feature vector of image inside box region y∩y 0 ∆(y, y 0 ) := area (box overlap). area y∪y 0 F (x, y) = hw, ϕ(x, y)i: quality score for region y in image x hw, ϕ(x i , y i )i ≥ ∆(y i , y) + hw, ϕ(x i , y)i − ξ i Interpretation: I I I correct location must have largest score of all regions highly overlapping regions can have similar score non-overlapping one must have clearly lower score [M. Blaschko, C. Lampert: "Learning to Localize Objects with Structured Output Regression", ECCV, 2008] Example: Object Localization Results Experiments on PASCAL VOC 2006 dataset: Compare S-SVM with conventional training for sliding windows Identical setup: same features, same image-kernel, etc. Precision–recall curves for VOC 2006 bicycle, bus and cat. Structured prediction training improved precision and recall. Summary Structured-Output SVM Task: predict f : X → Y instead of f : X → R. Key idea: I I learn F : X × Y → R with F = hw, ϕ(x, y)i predict via f (x) := argmaxy∈Y F (x, y). Convex optimization problem, similar to SVM I but very many constraints: computationally expensive Field of active research, many open questions I I I I how to speed up training? how to handle complicated ϕ ("higher order terms") how to combine S-SVM with approximate inference? ...