Recognizing objects and actions in images and video Jitendra Malik U.C. Berkeley University of California Berkeley Computer Vision Group Collaborators • Grouping: Jianbo Shi, Serge Belongie, Thomas Leung • Ecological Statistics: Charless Fowlkes, David Martin, Xiaofeng Ren • Recognition: Serge Belongie, Jan Puzicha, Greg Mori University of California Berkeley Computer Vision Group Motivation • Interaction with multimedia needs to be in terms that are significant to humans: objects, actions, events.. • Feature based CBIR systems (e.g. color histograms) are a failure for this purpose • Presently, the best solutions use text / speech recognition/ human annotation • Goal: Auto-annotation University of California Berkeley Computer Vision Group From images/video to objects Labeled sets: tiger, grass etc University of California Berkeley Computer Vision Group University of California Berkeley Computer Vision Group University of California Berkeley Computer Vision Group Consistency A Perceptual organization forms a tree: Image BG B C L-bird grass bush far beak body beak eye head • A,C are refinements of B • A,C are mutual refinements • A,B,C represent the same percept • Attention accounts for differences University of California Berkeley R-bird body eye head Two segmentations are consistent when they can be explained by the same segmentation tree (i.e. they could be derived from a single perceptual organization). Computer Vision Group What enables us to parse a scene? – Low level cues • Color/texture • Contours • Motion – Mid level cues • T-junctions • Convexity – High level Cues • Familiar Object • Familiar Motion University of California Berkeley Computer Vision Group Outline • Finding boundaries • Recognizing objects • Recognizing actions University of California Berkeley Computer Vision Group Finding boundaries: Is texture a problem or a solution? image University of California Berkeley orientation energy Computer Vision Group Statistically optimal contour detection • Use humans to segment a large collection of natural images. • Train a classifier for the contour/non-contour classification using orientation energy and texture gradient as features. University of California Berkeley Computer Vision Group Orientation Energy • Gaussian 2nd derivative and its Hilbert pair • OE ( I f odd ) 2 ( I f even ) 2 • Can detect combination of bar and edge features; also insensitive to linear shading [Perona&Malik 90] • Multiple scales University of California Berkeley Computer Vision Group Texture gradient = Chi square distance between texton histograms in half disks across edge K [ h ( m) h ( m)]2 1 j 2 (hi , h j ) i 2 m1 hi (m) h j (m) Chi-square i j k University of California Berkeley 0.1 0.8 Computer Vision Group University of California Berkeley Computer Vision Group University of California Berkeley Computer Vision Group University of California Berkeley Computer Vision Group ROC curve for local boundary detection University of California Berkeley Computer Vision Group Outline • Finding boundaries • Recognizing objects • Recognizing actions University of California Berkeley Computer Vision Group Biological Shape • D’Arcy Thompson: On Growth and Form, 1917 – studied transformations between shapes of organisms University of California Berkeley Computer Vision Group Deformable Templates: Related Work • Fischler & Elschlager (1973) • Grenander et al. (1991) • von der Malsburg (1993) University of California Berkeley Computer Vision Group ... Matching Framework model target • Find correspondences between points on shape • Fast pruning • Estimate transformation & measure similarity University of California Berkeley Computer Vision Group Comparing Pointsets University of California Berkeley Computer Vision Group Shape Context Count the number of points inside each bin, e.g.: Count = 4 ... Count = 10 Compact representation of distribution of points relative to each point University of California Berkeley Computer Vision Group Shape Context University of California Berkeley Computer Vision Group Shape Contexts • Invariant under translation and scale • Can be made invariant to rotation by using local tangent orientation frame • Tolerant to small affine distortion – Log-polar bins make spatial blur proportional to r Cf. Spin Images (Johnson & Hebert) - range image registration University of California Berkeley Computer Vision Group Comparing Shape Contexts Compute matching costs using Chi Squared distance: Recover correspondences by solving linear assignment problem with costs Cij [Jonker & Volgenant 1987] University of California Berkeley Computer Vision Group ... Matching Framework model target • Find correspondences between points on shape • Fast pruning • Estimate transformation & measure similarity University of California Berkeley Computer Vision Group Fast pruning • Find best match for the shape context at only a few random points and add up cost r 2 j * dist ( S query , Si ) ( SCquery , SCi ) j 1 SC arg min u * i University of California Berkeley 2 j ( SCquery , SCiu ) Computer Vision Group Snodgrass Results University of California Berkeley Computer Vision Group Results University of California Berkeley Computer Vision Group ... Matching Framework model target • Find correspondences between points on shape • Fast pruning • Estimate transformation & measure similarity University of California Berkeley Computer Vision Group Thin Plate Spline Model • 2D counterpart to cubic spline: • Minimizes bending energy: • Solve by inverting linear system • Can be regularized when data is inexact Duchon (1977), Meinguet (1979), Wahba (1991) University of California Berkeley Computer Vision Group Matching Example model University of California Berkeley target Computer Vision Group Outlier Test Example University of California Berkeley Computer Vision Group Synthetic Test Results Fish - deformation + noise ICP University of California Berkeley Shape Context Fish - deformation + outliers Chui & Rangarajan Computer Vision Group Terms in Similarity Score • Shape Context difference • Local Image appearance difference – orientation – gray-level correlation in Gaussian window – … (many more possible) • Bending energy University of California Berkeley Computer Vision Group Object Recognition Experiments • Handwritten digits • COIL 3D objects (Nayar-Murase) • Human body configurations • Trademarks University of California Berkeley Computer Vision Group Handwritten Digit Recognition • MNIST 60 000: – – – – – – – – linear: 12.0% 40 PCA+ quad: 3.3% 1000 RBF +linear: 3.6% K-NN: 5% K-NN (deskewed): 2.4% K-NN (tangent dist.): 1.1% SVM: 1.1% LeNet 5: 0.95% University of California Berkeley • MNIST 600 000 (distortions): – LeNet 5: 0.8% – SVM: 0.8% – Boosted LeNet 4: 0.7% • MNIST 20 000: – K-NN, Shape Context matching: 0.63% Computer Vision Group University of California Berkeley Computer Vision Group Results: Digit Recognition 1-NN classifier using: Shape context + 0.3 * bending + 1.6 * image appearance University of California Berkeley Computer Vision Group COIL Object Database University of California Berkeley Computer Vision Group Error vs. Number of Views University of California Berkeley Computer Vision Group Prototypes Selected for 2 Categories Details in Belongie, Malik & Puzicha (NIPS2000) University of California Berkeley Computer Vision Group Editing: K-medoids • Input: similarity matrix • Select: K prototypes • Minimize: mean distance to nearest prototype • Algorithm: – iterative – split cluster with most errors • Result: Adaptive distribution of resources (cfr. aspect graphs) University of California Berkeley Computer Vision Group Error vs. Number of Views University of California Berkeley Computer Vision Group Human body configurations University of California Berkeley Computer Vision Group Deformable Matching • Kinematic chain-based deformation model • Use iterations of correspondence and deformation • Keypoints on exemplars are deformed to locations on query image University of California Berkeley Computer Vision Group Results University of California Berkeley Computer Vision Group Trademark Similarity University of California Berkeley Computer Vision Group Recognizing objects in scenes University of California Berkeley Computer Vision Group Shape matching using multi-scale scanning • Shape context computation (10 Mops) – Scales * key-points * contour-points (10*100*10000) • Multi scale coarse matching (100 Gops) – Scales * objects * views * samples * key-points* dim-sc (10*1000*10*100*100*100) • Deform into alignment (1 Gops) – Image-objects * shortlist * (samples)^2 *dim-sc (10*100*10000*100) University of California Berkeley Computer Vision Group Shape matching using grouping • Complexity determining step: find approx. nearest neighbors of 10^2 query points in a set of 10^6 stored points in the 100 dimensional space of shape contexts. • Naïve bound of 10^9 can be much improved using ideas from theoretical CS (JohnsonLindenstrauss, Indyk-Motwani etc) University of California Berkeley Computer Vision Group Putting grouping/segmentation on a sound foundation • Construct a dataset of human segmented images • Measure the conditional probability distribution of various Gestalt grouping factors • Incorporate these in an inference algorithm University of California Berkeley Computer Vision Group Outline • Finding boundaries • Recognizing objects • Recognizing actions University of California Berkeley Computer Vision Group Examples of Actions • Movement and posture change – run, walk, crawl, jump, hop, swim, skate, sit, stand, kneel, lie, dance (various), … • Object manipulation – pick, carry, hold, lift, throw, catch, push, pull, write, type, touch, hit, press, stroke, shake, stir, turn, eat, drink, cut, stab, kick, point, drive, bike, insert, extract, juggle, play musical instrument (various)… • Conversational gesture – point, … • Sign Language University of California Berkeley Computer Vision Group Activities and Situation Classification • Example: Withdrawing money from an ATM • Activities constructed by composing actions. Partial order plans may be a good model. • Activities may involve multiple agents • Detecting unusual situations or activity patterns is facilitated by the video activity transform University of California Berkeley Computer Vision Group On the difficulty of action detection • Number of categories • Intra-Category variation – Clothing – Illumination – Person performing the action • Inter-Category variation University of California Berkeley Computer Vision Group Objects in space Actions in spacetime • Segment/Region-of-interest • Segment/volume-of-interest • Features (points, curves, wavelet coefficients..) • Features (points, curves, wavelets, motion vectors..) • Correspondence and deform into alignment • Correspondence and deform into alignment • Recover parameters of generative model • Recover parameters of generative model • Discriminative classifier • Discriminative classifier University of California Berkeley Computer Vision Group Key cues for action recognition • “Morpho-kinesics” of action (shape and movement of the body) • Identity of the object/s • Activity context University of California Berkeley Computer Vision Group Image/Video Stick figure Action • Stick figures can be specified in a variety of ways or at various resolutions (deg of freedom) – 2D joint positions – 3D joint positions – Joint angles • Complete representation • Evidence that it is effectively computable University of California Berkeley Computer Vision Group Tracking by Repeated Finding University of California Berkeley Computer Vision Group Achievable goals in 3 years • Reasonable competence at object recognition at crude category level (~1000) • Detection/Tracking of humans as kinematic chains, assuming adequate resolution. • Recognition of ~10-100 actions and compositions thereof. University of California Berkeley Computer Vision Group