Visual Element Discovery as Discriminative Mode Seeking Carl Doersch, Abhinav Gupta, Alexei A. Efros CMU CMU UCB The need for mid-level representations 6 billion images 70 billion images 1 billion images served daily 10 billion images 60 hours uploaded per minute From : Almost 90% of web traffic is visual! Discriminative patches • Visual words are too simple • Objects are too difficult • Something in the middle? (Felzenswalb et al. 2008) (Singh et al. 2012) Mid-level “Visual Elements” (Singh et al. 2012) (Doersch et al. 2012) • Simple enough to be detected easily • Complex enough to be meaningful – “Meaningful” as measured by weak labels Mid-level “Visual Elements” (Singh et al. 2012) (Doersch et al. 2012) • • • • • Doersch et al. 2012 Singh et al. 2012 Jain et al. 2013 Endres et al. 2013 Juneja et al. 2013 • • • • • Li et al. 2013 Sun et al. 2013 Wang et al. 2013 Fouhey et al. 2013 Lee et al. 2013 Our goal • Provide a mathematical optimization for visual elements • Improve performance of mid-level representations. Elements as Patch Classifiers What if the labels are weak? • E.g. image has horse/no-horse • (Or even weaker, like Paris/not-Paris) • Idea: Label these all as “horse” • Problem: 10,000 patches per image, most of which are unclassifiable. The weaker the label, the bigger the problem. Task: Learn to classify Paris from Not-Paris Paris Also Paris Other approaches • Latent SVM: – Assumes we have one instance per positive image • Multiple instance learning – Not clear how to define the bags What if the labels are weak? (Singh et al. 2012) (Doersch et al. 2012) • Negatives are negatives, positives might not be positive • Most of our data can be ignored • First: how to cluster without clustering everything Mean shift Mean shift Mean shift Patch distances Input Nearest neighbor Min distance: 2.59e-4 Max distance: 1.22e-4 Mean shift Negative Set Paris Not Paris Negative Set Paris Not Paris Density Ratios Paris Not Paris Density Ratios Paris Not Paris Adaptive Bandwidth Positive Negative Bandwidth Discriminative Mode Seeking • Find local optima of an estimate of the density ratio • Allow an adaptive bandwidth • Be extremely fast – Minimize the number of passes through the data Discriminative Mode Seeking • Mean shift: maximize (w.r.t. w) Bandwidth Patch Feature Distance Centroid w b Discriminative Mode Seeking B(w) is the value of b satisfying: Discriminative Mode Seeking optimize s.t. • Distance metric: Normalized Correlation Discriminative Mode Seeking optimize s.t. Positive Negative w Optimization s.t. • Initialization is straightforward • For each element, just keep around ~500 patches where wTx - b > 0 • Trivially parallelizable in MapReduce. • Optimization is piecewise quadratic Evaluation via Purity-Coverage Plot • Analogous to Precision-Recall Plot Low Purity Element 1 Element 2 Element 3 Element 4 Element 5 High purity, Low Coverage Element 1 Element 2 Element 3 Element 4 Element 5 Purity-Coverage Curve 1 0.8 Purity 0.6 0.4 0.2 Paris Not Paris 0 0 2 4 6 Coverage 8 10 x1e4 pixels Purity-Coverage Curve 1 0.8 Purity 0.6 0.4 0.2 Paris Not Paris 0 0 2 4 6 Coverage 8 10 x1e4 pixels Purity-Coverage Curve • Coverage for multiple elements is simply the union. This work This work, no inter-element SVM Retrained 5x (Doersch et al. 2012) LDA Retrained 5x LDA Retrained Exemplar LDA (Hariharan et al. 2012) Purity-Coverage Top 25 Elements 1 Top 200 Elements 0.98 0.96 Purity 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0 0.1 0.2 0.3 0.4 Coverage (fraction of positive dataset) 0.5 0 0.2 0.4 0.6 Coverage (fraction of positive dataset) 0.8 Results on Indoor 67 Scenes Kitchen Elevator Grocery Bakery Bowling Bathroom Results on Indoor 67 Scenes Method Accuracy Method Accuracy ROI+Gist (Quattoni et al.) 26.05 miSVM (Li et al.) 46.40 MM-Scene (Zhu et al.) 28.00 D. Patches (full) (Singh et al.) 49.40 Scene-DPM (Pandley et al.) 30.40 MMDL (Wang et al.) 50.15 CENTRIST (Wu et al.) 36.90 Discr. Parts (Sun et al.) 51.40 Object Bank (Li et al.) 37.60 IFV (Juneja et al.) 60.77 RBoW (Parizi et al.) 37.93 Bag of Parts+IFV (Juneja et al.) 63.10 Discr. Patches (Singh et al.) 38.10 Ours (no inter-element) 63.36 Latent Pyramid. (Sadeghi et al.) 44.84 Ours 64.03 Bag of Parts (Juneja et al.) 46.10 Ours+IFV 66.87 Qualitative Indoor67 Results Indoor67: Error Analysis Ground Truth (GT): deli Guess: grocery store GT: corridor Guess: staircase GT: museum Guess: garage GT: laundromat Guess: closet Thank you! More results at http://graphics.cs.cmu.edu/projects/discriminativeModeSeeking/ Ground Truth (GT): deli GT: museum Guess: grocery store GT: corridor Paris Elements • Indoor 67 Elements garage• SourceGT: laundromat Indoor 67Guess: Heatmaps code (soon) Guess: staircase Guess: closet Some New Paris Elements