Recognizing objects and actions

advertisement
Recognizing objects and actions in
images and video
Jitendra Malik
U.C. Berkeley
University of California
Berkeley
Computer Vision Group
Collaborators
• Grouping: Jianbo Shi, Serge Belongie, Thomas
Leung
• Ecological Statistics: Charless Fowlkes, David
Martin, Xiaofeng Ren
• Recognition: Serge Belongie, Jan Puzicha,
Greg Mori
University of California
Berkeley
Computer Vision Group
Motivation
• Interaction with multimedia needs to be in terms that
are significant to humans: objects, actions, events..
• Feature based CBIR systems (e.g. color histograms)
are a failure for this purpose
• Presently, the best solutions use text / speech
recognition/ human annotation
• Goal: Auto-annotation
University of California
Berkeley
Computer Vision Group
From images/video to objects
Labeled sets: tiger, grass etc
University of California
Berkeley
Computer Vision Group
University of California
Berkeley
Computer Vision Group
University of California
Berkeley
Computer Vision Group
Consistency
A
Perceptual organization forms
a tree:
Image
BG
B
C
L-bird
grass bush far
beak
body beak
eye head
• A,C are refinements of B
• A,C are mutual refinements
• A,B,C represent the same percept
•
Attention accounts for differences
University of California
Berkeley
R-bird
body
eye head
Two segmentations are
consistent when they can be
explained by the same
segmentation tree (i.e. they
could be derived from a single
perceptual organization).
Computer Vision Group
What enables us to parse a scene?
– Low level cues
• Color/texture
• Contours
• Motion
– Mid level cues
• T-junctions
• Convexity
– High level Cues
• Familiar Object
• Familiar Motion
University of California
Berkeley
Computer Vision Group
Outline
• Finding boundaries
• Recognizing objects
• Recognizing actions
University of California
Berkeley
Computer Vision Group
Finding boundaries:
Is texture a problem or a solution?
image
University of California
Berkeley
orientation energy
Computer Vision Group
Statistically optimal contour detection
• Use humans to segment a large collection of
natural images.
• Train a classifier for the contour/non-contour
classification using orientation energy and
texture gradient as features.
University of California
Berkeley
Computer Vision Group
Orientation Energy
• Gaussian 2nd derivative and its Hilbert pair
• OE  ( I  f odd ) 2  ( I  f even ) 2
• Can detect combination of bar and edge features; also
insensitive to linear shading [Perona&Malik 90]
• Multiple scales
University of California
Berkeley
Computer Vision Group
Texture gradient = Chi square distance between
texton histograms in half disks across edge
K [ h ( m)  h ( m)]2
1
j
 2 (hi , h j )   i
2 m1 hi (m)  h j (m)
Chi-square
i
j
k
University of California
Berkeley
 0.1
0.8
Computer Vision Group
University of California
Berkeley
Computer Vision Group
University of California
Berkeley
Computer Vision Group
University of California
Berkeley
Computer Vision Group
ROC curve for local boundary detection
University of California
Berkeley
Computer Vision Group
Outline
• Finding boundaries
• Recognizing objects
• Recognizing actions
University of California
Berkeley
Computer Vision Group
Biological Shape
• D’Arcy Thompson: On Growth and Form, 1917
– studied transformations between shapes of organisms
University of California
Berkeley
Computer Vision Group
Deformable Templates: Related Work
• Fischler & Elschlager (1973)
• Grenander et al. (1991)
• von der Malsburg (1993)
University of California
Berkeley
Computer Vision Group
...
Matching Framework
model
target
• Find correspondences between points on shape
• Fast pruning
• Estimate transformation & measure similarity
University of California
Berkeley
Computer Vision Group
Comparing Pointsets
University of California
Berkeley
Computer Vision Group
Shape Context
Count the number of points
inside each bin, e.g.:
Count = 4
...
Count = 10
 Compact representation
of distribution of points
relative to each point
University of California
Berkeley
Computer Vision Group
Shape Context
University of California
Berkeley
Computer Vision Group
Shape Contexts
• Invariant under translation and scale
• Can be made invariant to rotation by using
local tangent orientation frame
• Tolerant to small affine distortion
– Log-polar bins make spatial blur proportional to r
Cf. Spin Images (Johnson & Hebert) - range image registration
University of California
Berkeley
Computer Vision Group
Comparing Shape Contexts
Compute matching costs using
Chi Squared distance:
Recover correspondences by
solving linear assignment
problem with costs Cij
[Jonker & Volgenant 1987]
University of California
Berkeley
Computer Vision Group
...
Matching Framework
model
target
• Find correspondences between points on shape
• Fast pruning
• Estimate transformation & measure similarity
University of California
Berkeley
Computer Vision Group
Fast pruning
• Find best match for
the shape context at
only a few random
points and add up
cost
r
2
j
*
dist ( S query , Si )    ( SCquery , SCi )
j 1
SC  arg min u
*
i
University of California
Berkeley

2
j
( SCquery
, SCiu )
Computer Vision Group
Snodgrass Results
University of California
Berkeley
Computer Vision Group
Results
University of California
Berkeley
Computer Vision Group
...
Matching Framework
model
target
• Find correspondences between points on shape
• Fast pruning
• Estimate transformation & measure similarity
University of California
Berkeley
Computer Vision Group
Thin Plate Spline Model
• 2D counterpart to cubic spline:
• Minimizes bending energy:
• Solve by inverting linear system
• Can be regularized when data is inexact
Duchon (1977), Meinguet (1979), Wahba (1991)
University of California
Berkeley
Computer Vision Group
Matching
Example
model
University of California
Berkeley
target
Computer Vision Group
Outlier Test Example
University of California
Berkeley
Computer Vision Group
Synthetic Test Results
Fish - deformation + noise
ICP
University of California
Berkeley
Shape Context
Fish - deformation + outliers
Chui & Rangarajan
Computer Vision Group
Terms in Similarity Score
• Shape Context difference
• Local Image appearance difference
– orientation
– gray-level correlation in Gaussian window
– … (many more possible)
• Bending energy
University of California
Berkeley
Computer Vision Group
Object Recognition Experiments
• Handwritten digits
• COIL 3D objects (Nayar-Murase)
• Human body configurations
• Trademarks
University of California
Berkeley
Computer Vision Group
Handwritten Digit Recognition
• MNIST 60 000:
–
–
–
–
–
–
–
–
linear: 12.0%
40 PCA+ quad: 3.3%
1000 RBF +linear: 3.6%
K-NN: 5%
K-NN (deskewed): 2.4%
K-NN (tangent dist.): 1.1%
SVM: 1.1%
LeNet 5: 0.95%
University of California
Berkeley
• MNIST 600 000
(distortions):
– LeNet 5: 0.8%
– SVM: 0.8%
– Boosted LeNet 4: 0.7%
• MNIST 20 000:
– K-NN, Shape Context
matching: 0.63%
Computer Vision Group
University of California
Berkeley
Computer Vision Group
Results: Digit Recognition
1-NN classifier using:
Shape context + 0.3 * bending + 1.6 * image appearance
University of California
Berkeley
Computer Vision Group
COIL Object Database
University of California
Berkeley
Computer Vision Group
Error vs. Number of Views
University of California
Berkeley
Computer Vision Group
Prototypes Selected for 2 Categories
Details in Belongie, Malik & Puzicha (NIPS2000)
University of California
Berkeley
Computer Vision Group
Editing: K-medoids
• Input: similarity matrix
• Select: K prototypes
• Minimize: mean distance to nearest prototype
• Algorithm:
– iterative
– split cluster with most errors
• Result: Adaptive distribution of resources (cfr. aspect
graphs)
University of California
Berkeley
Computer Vision Group
Error vs. Number of Views
University of California
Berkeley
Computer Vision Group
Human body configurations
University of California
Berkeley
Computer Vision Group
Deformable Matching
• Kinematic chain-based
deformation model
• Use iterations of
correspondence and
deformation
• Keypoints on exemplars
are deformed to locations
on query image
University of California
Berkeley
Computer Vision Group
Results
University of California
Berkeley
Computer Vision Group
Trademark Similarity
University of California
Berkeley
Computer Vision Group
Recognizing objects in scenes
University of California
Berkeley
Computer Vision Group
Shape matching using multi-scale
scanning
• Shape context computation (10 Mops)
– Scales * key-points * contour-points (10*100*10000)
• Multi scale coarse matching (100 Gops)
– Scales * objects * views * samples * key-points* dim-sc
(10*1000*10*100*100*100)
• Deform into alignment (1 Gops)
– Image-objects * shortlist * (samples)^2 *dim-sc
(10*100*10000*100)
University of California
Berkeley
Computer Vision Group
Shape matching using grouping
• Complexity determining step: find approx.
nearest neighbors of 10^2 query points in a set
of 10^6 stored points in the 100 dimensional
space of shape contexts.
• Naïve bound of 10^9 can be much improved
using ideas from theoretical CS (JohnsonLindenstrauss, Indyk-Motwani etc)
University of California
Berkeley
Computer Vision Group
Putting grouping/segmentation on a
sound foundation
• Construct a dataset of human segmented
images
• Measure the conditional probability distribution
of various Gestalt grouping factors
• Incorporate these in an inference algorithm
University of California
Berkeley
Computer Vision Group
Outline
• Finding boundaries
• Recognizing objects
• Recognizing actions
University of California
Berkeley
Computer Vision Group
Examples of Actions
• Movement and posture change
– run, walk, crawl, jump, hop, swim, skate, sit, stand, kneel, lie, dance
(various), …
• Object manipulation
– pick, carry, hold, lift, throw, catch, push, pull, write, type, touch, hit,
press, stroke, shake, stir, turn, eat, drink, cut, stab, kick, point, drive,
bike, insert, extract, juggle, play musical instrument (various)…
• Conversational gesture
– point, …
• Sign Language
University of California
Berkeley
Computer Vision Group
Activities and Situation Classification
• Example: Withdrawing money from an ATM
• Activities constructed by composing actions.
Partial order plans may be a good model.
• Activities may involve multiple agents
• Detecting unusual situations or activity patterns
is facilitated by the video  activity transform
University of California
Berkeley
Computer Vision Group
On the difficulty of action detection
• Number of categories
• Intra-Category variation
– Clothing
– Illumination
– Person performing the action
• Inter-Category variation
University of California
Berkeley
Computer Vision Group
Objects in space
Actions in spacetime
• Segment/Region-of-interest
• Segment/volume-of-interest
• Features (points, curves,
wavelet coefficients..)
• Features (points, curves,
wavelets, motion vectors..)
• Correspondence and deform
into alignment
• Correspondence and deform
into alignment
• Recover parameters of
generative model
• Recover parameters of
generative model
• Discriminative classifier
• Discriminative classifier
University of California
Berkeley
Computer Vision Group
Key cues for action recognition
• “Morpho-kinesics” of action (shape and
movement of the body)
• Identity of the object/s
• Activity context
University of California
Berkeley
Computer Vision Group
Image/Video  Stick figure  Action
• Stick figures can be specified in a variety of
ways or at various resolutions (deg of freedom)
– 2D joint positions
– 3D joint positions
– Joint angles
• Complete representation
• Evidence that it is effectively computable
University of California
Berkeley
Computer Vision Group
Tracking by Repeated Finding
University of California
Berkeley
Computer Vision Group
Achievable goals in 3 years
• Reasonable competence at object recognition at
crude category level (~1000)
• Detection/Tracking of humans as kinematic
chains, assuming adequate resolution.
• Recognition of ~10-100 actions and
compositions thereof.
University of California
Berkeley
Computer Vision Group
Download