Ben's Slides

Learning Structured Prediction Models: A Large Margin Approach Ben Taskar U.C. Berkeley Vassil Chatalbashev Carlos Guestrin Daphne Koller Michael Collins Dan Klein Chris Manning “Don’t worry, Howard. The big questions are multiple choice.” Handwriting recognition x y brace Sequential structure Object segmentation x y Spatial structure Natural language parsing x y The screen was a sea of red Recursive structure Disulfide connectivity prediction x y RSCCPCYWGGCPW GQNCYPEGCSGPKV Combinatorial structure Outline  Structured prediction models      Geometric View    Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings Structured model polytopes Linear programming inference Structured large margin estimation     Min-max formulation Application: 3D object segmentation Certificate formulation Application: disulfide connectivity prediction Structured models scoring function space of feasible outputs Mild assumption: linear combination Chain Markov Net (aka CRF*) P(y|x)  i (xi,yi) i (yi,yi+1) (xi,yi) = exp{ wf(xi,yi)} (yi,yi+1) = exp{ wf (yi,yi+1)} y a-z a-z a-z a-z a-z f(y,y’) = I(y=‘z’,y’=‘a’) f(x,y) = I(xp=1, y=‘z’) x *Lafferty et al. 01 Chain Markov Net (aka CRF*) P(y|x) i (xi,yi) i (yi,yi+1) = exp{wTf(x,y)} i (xi,yi) = exp{ w w = i f[… ,yw (xi, i)} , … , w, …] =i,y[… , f (x,y) , … , f(x,y) , …] i (yi,yi+1) = exp{ f(x,y) w i f (y i+1)}  y a-z a-z a-z a-z a-z f(x,y) = #(y=‘z’,y’=‘a’) f(x,y) = #(xp=1, y=‘z’) x *Lafferty et al. 01 Associative Markov Nets Point features spin-images, point height Edge features length of edge, edge orientation “associative” restriction i yi ij yj PCFG #(NP  DT NN) … #(PP  IN NP) … #(NN  ‘sea’) Disulfide bonds: non-bipartite matching 2 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 1 4 6 6 1 6 2 4 3 5 Fariselli & Casadio `01, Baldi et al. ‘04 3 5 Scoring function 2 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 2 3 4 5 1 4 6 RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 3 6 6 5 String features: residues, physical properties Structured models scoring function space of feasible outputs Mild assumption: Another mild assumption:  linear programming MAP inference  linear program  LP inference for      Chains Trees Associative Markov Nets Bipartite Matchings … Markov Net Inference LP 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 Has integral solutions y for chains, trees Gives upper bound for general networks Associative MN Inference LP “associative” restriction    For K=2, solutions are always integral (optimal) For K>2, within factor of 2 of optimal Constraint matrix A is linear in number of nodes and edges, regardless of the tree-width Other Inference LPs  Context-free parsing  Dynamic programs  Bipartite matching  Network flow  Many other combinatorial problems Outline  Structured prediction models      Geometric View    Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings Structured model polytopes Linear programming inference Structured large margin estimation     Min-max formulation Application: 3D object segmentation Certificate formulation Application: disulfide connectivity prediction Learning w  Training example (x, y*)  Probabilistic approach:  Maximize conditional likelihood  Problem: computing Zw(x) is #P-complete Geometric Example Training data: Goal: Learn w s.t. wTf( , y*) points the “right” way OCR Example  We want: argmaxword wT f(  ,word) = “brace” Equivalently: wT f( wT f( … wT f( ,“brace”) > wT f( ,“brace”) > wT f( ,“aaaaa”) ,“aaaab”) ,“brace”) > wT f( ,“zzzzz”) a lot! Large margin estimation    Given training example (x, y*), we want: Maximize margin Mistake weighted margin: # of mistakes in y *Taskar et al. 03 Large margin estimation  Brute force enumeration  Min-max formulation  ‘Plug-in’ linear program for inference Min-max formulation Assume linear loss (Hamming): Inference LP inference Min-max formulation By strong LP duality Minimize jointly over w, z Min-max formulation  Formulation produces compact QP for      Low-treewidth Markov networks Associative Markov networks Context free grammars Bipartite matchings Any problem with compact LP inference 3D Mapping Data provided by: Michael Montemerlo & Sebastian Thrun Laser Range Finder GPS IMU Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points Segmentation results Hand labeled 180K test points Model Accuracy SVM 68% V-SVM 73% M3N 93% Fly-through Certificate formulation  Non-bipartite matchings:   O(n3) combinatorial algorithm No polynomial-size LP known 2 3 1 4 6  Spanning trees    kl ij Intuition:   No polynomial-size LP known Simple certificate of optimality 5 Verifying optimality easier than optimizing Compact optimality condition of y* wrt. Certificate for non-bipartite matching 2 3 Alternating cycle:  Every other edge is in matching Augmenting alternating cycle:  1 4 6 5 Score of edges not in matching greater than edges in matching Negate score of edges not in matching  Augmenting alternating cycle = negative length alternating cycle Matching is optimal Edmonds ‘65  no negative alternating cycles Certificate for non-bipartite matching 2 Pick any node r as root = length of shortest alternating path from r to j 3 1 4 6 5 Triangle inequality: Theorem: No negative length cycle  distance function d exists Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints Certificate formulation  Formulation produces compact QP for    Spanning trees Non-bipartite matchings Any problem with compact optimality condition Disulfide connectivity prediction  Dataset  Swiss Prot protein database, release 39      446 sequences (4-50 cysteines) Features: window profiles (size 9) around each pair Two modes: bonded state known/unknown Comparison:    Fariselli & Casadio 01, Baldi et al. 04 SVM-trained weights (ignoring constraints during learning) DAG Recursive Neural Network [Baldi et al. 04] Our model:   Max-margin matching using RBF kernel Training: off-the-shelf LP/QP solver CPLEX (~1 hour) Known bonded state Precision / Accuracy Bonds SVM DAG RNN [Baldi et al., 04] 2 3 4 0.63 / 0.63 0.51 / 0.38 0.34 / 0.12 0.74 / 0.74 0.61 / 0.51 0.44 / 0.27 5 0.31 / 0.07 0.41 / 0.11 4-fold cross-validation Max-margin matching 0.77 / 0.77 0.62 / 0.52 0.51 / 0.36 0.43 / 0.16 Unknown bonded state Precision / Recall / Accuracy Bonds DAG RNN [Baldi et al., 04] 2 3 4 0.49 / 0.59 / 0.40 0.45 / 0.50 / 0.32 0.37 / 0.36 / 0.15 5 0.31 / 0.28 / 0.03 Max-margin matching 0.57 / 0.59 / 0.44 0.48 / 0.52 / 0.28 0.39 / 0.40 / 0.14 0.31 / 0.33 / 0.07 4-fold cross-validation Formulation summary  Brute force enumeration  Min-max formulation   ‘Plug-in’ convex program for inference Certificate formulation  Directly guarantee optimality of y* Estimation Margin Discriminative P(y|x) Generative P(x,y) MEMMs CRFs HMMs PCFGs MRFs Local P(z) = i P(zi|z) Global P(z) = 1/Z c (zc) Omissions  Formulation details     Kernels Multiple examples Slacks for non-separable case Approximate learning of intractable models   General MRFs Learning to cluster  Structured generalization bounds  Scalable algorithms (no QP solver needed)    Structured SMO (works for chains, trees) Structured EG (works for chains, trees) Structured PG (works for chains, matchings, AMNs, …) Current Work  Learning approximate energy functions    Semi-supervised learning    Protein folding Physical processes Hidden variables Mixing labeled and unlabeled data Discriminative structure learning  Using sparsifying priors Conclusion  Two general techniques for structured large-margin estimation Exact, compact, convex formulations Allow efficient use of kernels Tractable when other estimation methods are not Structured generalization bounds Efficient learning algorithms Empirical success on many domains  Papers at http://www.cs.berkeley.edu/~taskar       Duals and Kernels  Kernel trick works!   Scoring functions (log-potentials) can use kernels Same for certificate formulation Length: ~8 chars Letter: 16x8 pixels 10-fold Train/Test 5000/50000 letters 600/6000 words Test error (average per-character) Handwriting Recognition 30 raw pixels quadratic kernel cubic kernel better 25 20 15 Models: 10 Multiclass-SVMs* 45% error reduction over linear CRFs 5 CRFs 33% error reduction over multiclass SVMs 3 M nets 0 MC–SVMs *Crammer & Singer 01 CRFs M^3 nets Hypertext Classification  WebKB dataset    Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth better 20 relaxed dual Test Error 15 10 5 loopy belief propagation *Taskar et al 02 0 53% error reduction over SVMs 38% error reduction over RMNs SVMs RMNS M^3Ns Projected Gradient yk+1 Projecting y’ onto constraints: yk yk+3 yk+2 yk+4  min-cost convex flow for Markov nets, matchings Convergence: same as steepest gradient Conjugate gradient also possible (two-metric proj.) Min-Cost Flow for Markov Chains a-z a-z a-z a-z a-z a a a a a s t z    z z z z Capacities = C Edge costs = For edges from node s, to node t, cost = 0 Min-Cost Flow for Bipartite Matchings s    t Capacities = C Edge costs = For edges from node s, to node t, cost = 0 CFG Chart  CNF tree = set of two types of parts:   Constituents (A, s, e) CF-rules (A  B C, s, m, e) CFG Inference LP inside outside Has integral solutions y for trees

Ben's Slides

Related documents

Products

Support

Ben's Slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib