A coarse-to-fine approach for fast deformable object detection Marco Pedersoli Andrea Vedaldi Jordi Gonzàlez Object detection [VOC 2010] 2 [Fischler Elschlager 1973] [Vedaldi Zisserman 2009] [Felzenszwalb et al 08] [Zhu et al 10] • Addressing the computational bottleneck - branch-and-bound [Blaschko Lampert 08, Lehmann et al. 09] - cascades [Viola Jones 01, Vedaldi et al. 09, Felzenszwalb et al 10, Weiss Taskar 10] - - jumping windows [Chum 07] sampling windows [Gualdi et al. 10] coarse-to-fine [Fleuret German 01, Zhang et al 07, Pedersoli et al. 10] 3 Analysis of the cost of pictorial structures The cost of pictorial structures • • • 4 cost of inference one part: L two parts: L2 … P parts: LP - with a tree using dynamic programming PL2 Polynomial, but still too slow in practice - L = number of part locations ~ number of pixels ~ millions with a tree and quadratic springs using the distance transform [Felzenszwalb and Huttenlocher 05] PL In principle, millions of times faster than dynamic programming! - A notable case: deformable part models • - 55 Deformable part model [Felzenszwalb et al. 08] locations are discrete number of possible part locations: L - L / δ2 δ deformations are bounded cost of placing two parts: L2 LC, C << L C = max. deformation size image total geometric cost: C PL / δ2 A notable case: deformable part models • • With deformable part models finding the optimal parts configuration is cheap distance transform speed-up is limited - 6 geometric cost: C PL / δ2 Standard analysis does not account for filtering: filtering cost: F PL / δ2 F = size of filter image total cost: (F + C) PL / δ2 • Typical example - filter size: F = 6 × 6 × 32 - deformation size: C = 6 × 6 • Filtering dominates the finding the optimal part configuration! Accelerating deformable part models deformable part model cost: (F + C) PL / δ2 the key is reducing the filter evaluations • Cascade of deformable parts [Felzenszwalb et al. 2010] • detect parts sequentially stop when confidence below a threshold Coarse-to-fine localization [Pedersoli et al. 2010] - multi-resolution search we extend this idea to deformable part models 7 8 Our contribution: Coarse-to-fine for deformable models Our model • • 9 Multi-resolution deformable parts each part is a HOG filter recursive arrangement resolution doubles bounded deformation - Score of a configuration S(y) HOG filter score parent-child deformation score - image Coarse-to-Fine search 10 Quantify the saving • • 11 # filter evaluations 1D view (circle = part location) overall speedup 4R 2D view exponentially larger saving exact CTF L L 4L L 16L L Lateral constraints • • • Geometry in deformable part models is cheap can afford additional constraints - Lateral constraints connect sibling parts - Inference use dynamic programming within each level open the cycle by conditioning one node - 12 Lateral constraints • • 13 Why are lateral constraints useful? Encourage consistent local deformations without lateral constraints siblings move independently no way to make their motion coherent - without lateral constraints y and y’ have the same geometric cost with lateral constraints y can be encouraged 14 Experiments Effect of deformation size • • INRIA pedestrian dataset C = deformation size (HOG cells) AP = average precision (%) Coarse-to-fine (CTF) inference - 15 C 3×3 5×5 7×7 AP time 83.5 0.33s 83.2 2.0s 83.6 9.3s Remarks large C slows down inference but does not improve precision small C implies already substantial part deformation due to multiple resolutions - Effect of the lateral constraints Exact vs Coarse-to-fine (CTF) inference inference tree tree + lateral conn. • • exact inference 83.0 AP 83.4 AP CTF inference 80.7 AP 83.5 AP tree CTF ~ exact inference scores CTF ≤ exact bound is tighter with lateral constraints - Effect is significant on training as well additional coherence avoids spurious solutions Example learning the head model - tree + lat. exact score • 16 CTF score CTF learning and tree CTF learning and tree + lat. Training speed • • Structured latent SVM [Felzenszwalb et al. 08, Vedaldi et al. 09] deformations of training objects are unknown estimated as latent variables - Algorithm Initialization: no negative examples, no deformations Outer loop ▪ Inner loop Collect hard negative examples (CTF inference) Learn the model parameters (SGD) Estimate the deformations (CTF inference) - ▪ • 17 • • The training speed is dominated by the cost of inference! time exact inference training ≈20h testing 2h ( 10s per image) CTF inference ≈2h 4m > 10× speedup! (0.33s per image) PASCAL VOC 2007 • Evaluate on the detection of 20 different object categories ~5,000 images for training, ~5,000 images for testing - MKL BOW PS Hierarc. Cascade OUR • plane 37,6 29,0 29,4 22,8 27,7 bike 47,8 54,6 55,8 49,4 54,0 bird 15,3 0,6 9,4 10,6 6,6 boat bottle bus car cat chair cow 15,3 21,9 50,7 50,6 30,0 17,3 33,0 13,4 26,2 39,4 46,4 16,1 16,3 16,5 14,3 28,6 44,0 51,3 21,3 20,0 19,3 12,9 27,1 47,4 50,2 18,8 15,7 23,6 15,1 14,8 44,2 47,3 14,6 12,5 22,0 Remarks very good for aeroplane, bicycle, boat, table, horse, motorbike, sheep less good for bottle, sofa, tv - • 18 Speed-accuracy trade-off time is drastically reduced hit on AP is small - table 22,5 24,5 25,2 10,3 24,2 dog horse mbike person 21,5 51,2 45,5 23,3 5,0 43,6 37,8 35,0 12,5 50,4 38,4 36,6 12,1 36,4 37,1 37,2 12,0 52,0 42,0 31,2 plant sheep sofa train tv mean Time (s) 12,4 23,9 28,5 45,3 48,5 32,1 ~ 70 8,8 17,3 21,6 34,0 39,0 26,8 ~ 10 15,1 19,7 25,1 36,8 39,3 29,6 ~ 8 13,2 22,6 22,9 34,7 40,0 27,3 < 1 10,6 22,9 18,8 35,3 31,1 26,9 < 1 Comparison to the cascade of parts • • Cascade of parts [Felzenszwalb et al. 10] test parts sequentially, reject when score falls below threshold saving at unpromising locations (content dependent) difficult to use in training (thresholds must be learned) - Coarse-to-fine inference saving is uniform (content independent) can be used during training - 19 Coarse-to-fine cascade of parts • • 20 Cascade and CTF use orthogonal principles CTF easily combined speed-up multiplies! - Example apply a threshold at the root plot AP vs speed-up In some cases 100 x speed-up can be achieved - CTF CTF cascade score > τ1? cascade score > τ2? reject reject Summary • • • • Analysis of deformable part models filtering dominates the geometric configuration cost speed-up requires reducing filtering - Coarse-to-fine search for deformable models lower resolutions can drive the search at higher resolutions lateral constraints add coherence to the search exponential saving independent of the image content can be used for training too - Practical results 10x speed-up on VOC and INRIA with minimum AP loss can be combined with cascade of parts for multiplied speedup - Future More complex models with rotation, foreshortening, … - 21 Thank you!