Structured Prediction Cascades Ben Taskar David Weiss, Ben Sapp, Alex Toshev University of Pennsylvania Supervised Learning Learn from • Regression: • Binary Classification: • Multiclass Classification: • Structured Prediction: Handwriting Recognition x y `structured’ Machine Translation x ‘Ce n'est pas un autre problème de classification.’ y ‘This is not another classification problem.’ Pose Estimation x © Arthur Gretton y Structured Models scoring function Complexity of inference depends on space of feasible outputs “part” structure parts = cliques, productions Supervised Structured Prediction Model: Data Learning Prediction Discriminative estimation of θ Intractable/impractical for complex models Intractable/impractical for complex models 3rd order OCR model = 1 mil states * length Berkeley English grammar = 4 mil productions * length^3 Tree-based pose model = 100 mil states * # joints Approximation vs. Computation • Usual trade-off: approximation vs. estimation – Complex models need more data – Error from over-fitting • New trade-off: approximation vs. computation – Complex models need more time/memory – Error from inexact inference • This talk: enable complex models via cascades An Inspiration: Viola Jones Face Detector Scanning window at every location, scale and orientation Classifier Cascade • • • • • • C1 Non-face C2 Non-face C3 Non-face Cn Non-face Face Most patches are non-face Filter out easy cases fast!! Simple features first Low precision, high recall Learned layer-by-layer Next layer more complex Related Work • Global Thresholding and Multiple-Pass Parsing. J Goodman, 97 • A maximum-entropy-inspired parser. E. Charniak, 00. • Coarse-to-fine n-best parsing and MaxEnt discriminative reranking, E Charniak & M Johnson, 05 • TAG, dynamic pro- gramming, and the perceptron for efficient, feature-rich parsing, X. Carreras, M. Collins, and T. Koo, 08 • Coarse-to-Fine Natural Language Processing, S. Petrov, 09 • Coarse-to-fine face detection. Fleuret, F., Geman, D, 01 • Robust real-time object detection. Viola, P., Jones, M, 02 • Progressive search space reduction for human pose estimation, Ferrari, V., Marin-Jimenez, M., Zisserman, A, 08 What’s a Structured Cascade? • What to filter? – Clique assignments • What are the layers? F1 – Higher order models – Higher resol. models F2 ?????? ?????? • How to learn filters? F3 ?????? Fn ?????? ‘structured’ – Novel convex loss – Simple online algorithm – Generalization bounds Trade-off in Learning Cascades Filter 1 • Accuracy: Minimize the number of errors incurred by each level • Efficiency: Maximize the number of filtered assignments at each level Filter 2 Filter D Predict Max-marginals (Sequences) Score of an output: a a a a b b b b c c c c d d d d Max marginal: Compute max (*bc*) Filtering with Max-marginals • Set threshold • Filter clique assignment if a a a a b b b b c c c c d d d d Filtering with Max-marginals • Set threshold • Filter clique assignment if a a a a b b b b c c c c d d d d Remove edge bc Why Max-marginals? • Valid path guarantee – Filtering leaves at least one valid global assignment • Faster inference – No exponentiation, O(k) vs. O(k log k) in some cases • Convex estimation – Simple stochastic subgradient algorithm • Generalization bounds for error/efficiency – Guarantees on expected trade-off Choosing a Threshold • Threshold must be specific to input x – Max-marginal scores are not normalized • Keep top-K max-marginal assignments? – Hard(er) to handle and analyze • Convex alternative: max-mean-max function: max score mean max marginal m = #edges Choosing a Threshold (cont) Mean max marginal Score of truth Max score Max marginal scores Range of possible thresholds α = efficiency level Example OCR Cascade ba b h kc 144 269 -137 d e f g 80 52 -42 49 Mean ≈ 0 Max = 368 h i j k 360 -27 -774 368 α = 0.55 l -24 Threshold ≈ 204 … Example OCR Cascade b h k a e n u a b d g a c e g n o a e m n u w r u Example OCR Cascade b ▪bh k ▪h ▪k a bae n ha u a aab d ab g a aac e a aae m ae n u r g n ac ka be he … ad ag ea … oae u ag an … wam an au … Example OCR Cascade Mean ≈ 1412 α = 0.44 ▪b ▪h ▪k 1440 1558 1575 ba ha ka be he 1440 1558 1575 1413 1553 aa ab ad ag ea 1400 1480 1575 1397 1393 aa ac ae ag an 1257 1285 1302 1294 1356 aa ae am an au 1336 1306 1390 1346 1306 Max = 1575 Threshold ≈ 1502 … … … … Example OCR Cascade Mean ≈ 1412 α = 0.44 ▪b ▪h ▪k 1440 1558 1575 ba ha ka be he 1440 1558 1575 1413 1553 aa ab ad ag ea 1400 1480 1575 1397 1393 aa ac ae ag an 1257 1285 1302 1294 1356 aa ae am an au 1336 1306 1390 1346 1306 Max = 1575 Threshold ≈ 1502 … … … … Example OCR Cascade Mean ≈ 1412 ▪h ▪k ▪h ▪k ha ha ka he ka ad ed nd ad ke Threshold ≈ 1502 he kn … … … do ow Max = 1575 α = 0.44 om … Example OCR Cascade ▪h▪▪h ▪k ▪▪k ha ▪ha ka ▪ka he ▪heke kn ▪ke … ad had ed kad nd hed ked … do ado edo ndo ow dow om dom Example: Pictorial Structure Cascade H URA T LRA k ≈ 320,000 k2 ≈ 100mil ULA LLA Upper arm Lower arm 10x10x12 10x10x24 40x40x24 80x80x24 110×122×24 Quantifying Loss • Filter loss If score(truth) > threshold, all true states are safe • Efficiency loss Proportion of unfiltered clique assignments Learning One Cascade Level • Fix α, solve convex problem for θ No filter mistakes Margin w/ slack Minimize filter mistakes at efficiency level α An Online Algorithm • Stochastic sub-gradient update: Features of truth Convex combination: Features of best guess + Average features of max marginal “witnesses” Generalization Bounds • W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Expected loss on true distribution Generalization Bounds • W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Empirical -ramp upper bound -ramp( ) 1 - 0 Generalization Bounds • W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data n number of examples m number of clique assignments number of cliques B Generalization Bounds • W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data Similar bound holds for Le and all ® 2 [0,1] OCR Experiments 80 70 % Error 60 50 73.35 1st order 2nd order 3rd order 50.56 4rd order 40 30 20 26.17 22.5 14.31 12.05 10 15.54 7.75 0 Character Error Word Error Dataset: http://www.cis.upenn.edu/~taskar/ocr Efficiency Experiments • POS Tagging (WSJ + CONLL datasets) – English, Bulgarian, Portuguese • Compare 2nd order model filters – Structured perceptron (max-sum) – SCP w/ ® 2 [0, 0.2,0.4,0.6,0.8] (max-sum) – CRF log-likelihood (sum-product marginals) • Tightly controlled – Use random 40% of training set – Use development set to fit regularization parameter ¸ and α for all methods POS Tagging Results English Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) POS Tagging Results Portuguese Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) POS Tagging Results Bulgarian Error Cap (%) on Development Set Max-sum (SCP) Max-sum (0-1) Sum-product (CRF) English POS Cascade (2nd order) DT NN The cascade VRB ADJ is efficient. Full SCP CRF Taglist Accuracy (%) 96.83 96.82 96.84 --- Filter Loss (%) 0 0.12 0.024 0.118 Test Time (ms) 173.28 1.56 4.16 10.6 Avg. Num States 1935.7 3.93 11.845 95.39 Pose Estimation (Buffy Dataset) torso head upper arms lower arms total Ferrari et al 08 -- -- -- -- 61.4% Ferrari et al. 09 -- -- -- -- 74.5% Andriluka et al. 09 98.3% 95.7% 86.8% 51.7% 78.8% Eichner et al. 09 98.72% 97.87% 92.8% 59.79% 80.1% CPS (ours) 100.00% 99.57% 96.81% 62.34% 86.31% method “Ferrari score:” percent of parts correct within radius Pose Estimation Cascade Efficiency vs. Accuracy Features: Shape/Segmentation Match Features: Contour Continuation Conclusions • Novel loss significantly improves efficiency • Principled learning of accurate cascades • Deep cascades focus structured inference and allow rich models • Open questions: – How to learn cascade structure – Dealing with intractable models – Other applications: factored dynamic models, grammars Thanks! Structured Prediction Cascades, Weiss & Taskar, AISTATS10 Cascaded Models for Articulated Pose Estimation, Sapp, Toshev & Taskar, ECCV10