Learning and Inference in Vision: from Features to Scene Understanding Jonathan Huang, Tomasz Malisiewicz MLD Student Research Symposium, 2009 Sky Bridge Sign Trees Car Road Huge datasets PASCAL Visual Objects Challenge (VOC) dataset ~15000 annotated images, ~35,000 annotated object instances, 20 object classes with segmentations, bounding boxes Huge datasets LabelMe dataset ~11845 static images, >100,000 labeled polygons Outline I. Recognizing single object classes (Jon) II. Scene understanding with multiple classes (Tomasz) Recognition task #1: Find all markers Recognition task #2: Find all cats Object recognition is often hard due to: Geometric Variability Variation within an object class Viewpoint/Scales/Illumination Variability Images from Flickr From Pixels to Visual features Imaging Scene Pixels Inference Low level features car Higher level inference Local Visual Features Images are high dimensional! (640 width) *(480 height) = (307200 pixels) Compute image statistics in a region (e.g., estimate the distribution of image gradient orientations) Key ideas in feature design Be invariant to stuff you don’t care about… while not being too invariant Object classification Let’s look at a simpler example first… Cow or Horse?? Inference: What object class is this? Learning: What does each object class look like? Document classification analogy ??? John Terry scored on a header to lift Chelsea to a 1-0 victory over Manchester United and extend the Blues’ Premier League lead to 5 points. Chelsea had been frustrated by Manchester United for 76 minutes, but took advantage of a free kick awarded when Darren Fletcher fouled Ashley Cole. Brian Ching scored six minutes into overtime and the Houston Dynamo advanced to Major League Soccer’s Western ... ??? In the Senate, where proposals differ substantially from the House-passed measure on issues like a government-run plan and how to pay for coverage, the bill is stalled while budget analysts assess its overall costs. The slim margin in the House — the bill passed with just two votes to spare, and 39 Democrats opposed it — suggests even greater challenges in the Senate, where the majority leader, ... Classify each document as sports or politics Bag-of-words models for text classification bag words (Sue Ann) “Much of the meaning behind written language is preserved even when the ordering of the individual words is lost.” [El-Arini et al.,’09] Document classification analogy ??? but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ... ??? the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House passed The ... Document classification analogy ??? but to on Darren awarded Fletcher advanced Ashley lift over to 1-0 scored advantage Major for lead 76 Chelsea Premier to Terry League John Houston the kick Chelsea took United points. free minutes fouled United been frustrated overtime Manchester six a when League a extend victory Ching 5 and to and Western Manchester Brian Cole. Dynamo Soccer’s by a minutes, Blues’ the had header into of scored ... ??? the margin how In on majority 39 costs. with measure slim overall — to like opposed suggests challenges pay even substantially stalled government-run where the issues votes it the where bill for spare, from bill and a Senate, analysts coverage, in — the Democrats greater differ two proposals budget its House assess while Senate, to in just the leader and the plan passed the is House-passed The ... Visual words (discretization) Words are discrete, visual features are typically continuous… Discretization via clustering/vector quantization Visual words [Sivic et al., ‘05] Object classification with bag of words [Sivic et al., ‘05] Object classification with bag of words Performance on Caltech 101 dataset with linear SVM on bag-of-word vectors: Faces Airplanes Cars [Csurka et al., ‘04] Object Detection problem Detection: Locate all the faces in this image. Classification: Is this a face, or not a face? Face detection via a series of classifications (a.k.a. sliding window brain damage) False Detection Missed Faces Sliding window detection results The need for… capturing spatial relationships One Approach Create a more descriptive (complicated) feature gradient magnitudes Original Image gradient orientations Estimated Image Gradients Subdivided Image cells Histogrammed gradients in each cell Histograms of Oriented Gradients (HOG) features [Dalal & Triggs, ‘06] People Tracking with HOG features Modeling Spatial Relationships with Deformable Part Based Models Spring-based models: Parts prefer low-energy configurations [Fischler & Elschlager ,’73], [Ramanan et al,’07], [Felszwenwalb et al,’05,’09], [Kumar et al, ‘09] Parts Based Model Goal: Assign model parts to image regions preserving both local appearance and spatial relationships Vertices – Local Appearance Edges - Spatial Relationship Parts based models - Inference Problem Inference problem: What is the best scoring assignment f? For trees can use belief propagation Local Appearance term for exact solution in polytime Inference is NP-hard for general graphs Pairwise Spatial Relationship term Parts based models - Learning Problem Linear models: Local Appearance term Pairwise Spatial Relationship term Learning linear models: Find weight vectors that best separate positive and negative examples. E.g., Convex max-margin objective s.t. Positive examples on one side Negative examples on the other [Kumar et al,’09] Root filter (8x8 resolution) Part filter (4x4 Quadratic spatial configuration model resolution) Person deformable part model [Felszwenwalb et al,’09] [Felszwenwalb et al,’09] [Ramanan et al,’09] Outline I. Recognizing single object classes (Jon) II. Scene understanding with multiple classes (Tomasz) Part II: Scene Understanding with Multiple Classes Goal: Predict Many Different Objects in a Single Image Tree Building Car Fence Fire Hydrant Sidewalk Wait... • What’s wrong with just learning a different sliding window classifier for each object type in the world? The image as seen from a object detector’s point of view Relationships between objects make recognition possible 41 Antonio Torralba. The Context Challenge. http://web.mit.edu/torralba/www/carsAndFacesInContext.html 41 Objects as the “Parts” of a Scene Deformable Part Model Scene Model Key Challenge in Scene Understanding: Modeling relationships between objects from different categories 43 Fixed Extent “Things” vs Free-form “Stuff” Tree Building Car Fence Things have a well-defined shape. A part of a car is not a car. Stuff is free-form and mostly defined by color/texture. A part of a building is still a building. Fire Hydrant Sidewalk 3 Types of Scene Models Pixel-based Window-based Segment-based Pixel-based Scene Understanding Unable to reason about instances Produces Segmentation Only limited notion of context Works well on “stuff” TextonBoost: Joint Appearance, Shape and Context Modeling for Multiclass Object Recognition and Segmentation. Shotton et al. ECCV 2006 Pixel-wise Conditional Random Fields (TextonBoost) • Inference • y^* = argmax_y p(y|x) • Training: Use boosting to learn unary potential • Future Direction: Higher-Order Cliques 50 50 TextonBoost: Joint Appearance, Shape and Context Modeling for Multiclass Object Recognition and Segmentation. Shotton et al. ECCV 2006 Window-based Scene Understanding Object Recognition by Scene Alignment. Russell et al. NIPS 2007 Discriminative models for multi-class object layout. Desai et al. ICCV 2009 Often not possible to model “stuff” using windows. Window assumption also questionable for some “things.” Possible to model interactions between object instances. Discriminative models for multiclass object layout • Inference via Greedy Forward Search • Training 52 52 Window-based results 53 53 Region-Based Scene Understanding Use Segmentation algorithm to extract stable regions Use CRF to label those segments Problem: Hard to get object-segments. Problem: Inference difficult for fully connected models. Region-Based CRF Spatial Relations • Training: Bag of Words with Nearest Neighbor classifier • Maximum Likelihood training of pairwise potentials 56 Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008. 56 Segmentation-Based Results Input image 57 No context w/ context Object Categorization using Co-Occurrence, Location and Appearance. Galleguillos et al. CVPR 2008. 57 Model Granularity vs. Object Type Granularity Pixels Things (car, Object cow, person) Type Stuff (road, sky, tree) 58 Windows Regions :-( :-) :-/ :-) :-( :-) Scene Understanding Recap • Rich object-object interactions are important for scene understanding. • Different underlying assumptions (pixel vs. window vs. region) are better suited for different types of objects (“stuff” vs. “things”) • Many of the techniques for single class object recognition (e.g., part based models) are relevant for scene understanding Thanks! Image Classification Sliding Window based Object Detection Modeling Spatial Relationships between objects Modeling Spatial Relationships between parts