Recognizing Action at a Distance
A.A. Efros, A.C. Berg, G. Mori, J. Malik
UC Berkeley
UC Berkeley
Computer Vision Group
ICCV 2003
Looking at People
Near field Far field
• 300-pixel man
• Limb tracking
– e.g. Yacoob & Black,
UC Berkeley
Rao & Shah, etc.
Computer Vision Group
• 3-pixel man
• Blob tracking
– vast surveillance literature
ICCV 2003
Medium-field Recognition
UC Berkeley
Computer Vision Group
The 30-Pixel Man
ICCV 2003
Appearance vs. Motion
UC Berkeley
Computer Vision Group
Jackson Pollock
Number 21 (detail)
ICCV 2003
Goals
• Recognize human actions at a distance
– Low resolution, noisy data
– Moving camera, occlusions
– Wide range of actions (including non-periodic)
UC Berkeley
Computer Vision Group
ICCV 2003
Our Approach
• Motion-based approach
– Non-parametric; use large amount of data
– Classify a novel motion by finding the most similar motion from the training set
• Related Work
– Periodicity analysis
• Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis;
Collins et al.
– Model-free
• Temporal Templates [Bobick & Davis]
• Orientation histograms [Freeman et al; Zelnik & Irani]
• Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth]
UC Berkeley
Computer Vision Group
ICCV 2003
Gathering action data
• Tracking
– Simple correlation-based tracker
UC Berkeley
– User-initialized
Computer Vision Group
ICCV 2003
Figure-centric Representation
• Stabilized spatio-temporal volume
– No translation information
– All motion caused by person’s limbs
• Good news: indifferent to camera motion
• Bad news: hard!
• Good test to see if actions, not just translation, are being captured
UC Berkeley
Computer Vision Group
ICCV 2003
Remembrance of Things Past
• “Explain” novel motion sequence by matching to previously seen video clips
– For each frame, match based on some temporal extent input sequence run walk left motion analysis swing walk right database jog
Challenge: how to compare motions?
UC Berkeley
Computer Vision Group
ICCV 2003
How to describe motion?
• Appearance
– Not preserved across different clothing
• Gradients (spatial, temporal)
– same (e.g. contrast reversal)
• Edges/Silhouettes
– Too unreliable
• Optical flow
– Explicitly encodes motion
– Least affected by appearance
– …but too noisy
UC Berkeley
Computer Vision Group
ICCV 2003
Spatial Motion Descriptor
Image frame Optical flow F x , y
F x
, F F x
, F x
, F y
, F y
blurred F x
, F x
, F y
, F y
ICCV 2003
Spatio-temporal Motion Descriptor
Temporal extent E
…
Sequence A S …
… …
Sequence B t
E
A A
E
I matrix
E
B frame-to-frame
UC Berkeley
E blurry I
B motion-to-motion
ICCV 2003 similarity matrix
Football Actions: matching
Input
Sequence
Matched
Frames
UC Berkeley
Computer Vision Group input matched
ICCV 2003
Football Actions: classification
10 actions; 4500 total frames; 13-frame motion descriptor
UC Berkeley
Computer Vision Group
ICCV 2003
Classifying Ballet Actions
16 Actions; 24800 total frames; 51-frame motion descriptor.
Men used to classify women and vice versa.
UC Berkeley
Computer Vision Group
ICCV 2003
Classifying Tennis Actions
6 actions; 4600 frames; 7-frame motion descriptor
Woman player used as training, man as testing.
UC Berkeley
Computer Vision Group
ICCV 2003
Classifying Tennis
• Red bars show classification results
UC Berkeley
Computer Vision Group
ICCV 2003
Querying the Database input sequence run swing walk left walk right database
Action Recognition: run walk left swing walk right
Joint Positions:
UC Berkeley
Computer Vision Group jog jog
ICCV 2003
2D Skeleton Transfer
• We annotate database with 2D joint positions
• After matching, transfer data to novel sequence
– Ajust the match for best fit
Input sequence:
Transferred 2D skeletons:
UC Berkeley
Computer Vision Group
ICCV 2003
3D Skeleton Transfer
• We populate database with rendered stick figures from
3D Motion Capture data
• Matching as before, we get 3D joint positions (kind of)!
Input sequence:
Transferred 3D skeletons:
UC Berkeley
Computer Vision Group
ICCV 2003
“Do as I Do” Motion Synthesis input sequence synthetic sequence
• Matching two things:
– Motion similarity across sequences
– Appearance similarity within sequence (like VideoTextures)
• Dynamic Programming
UC Berkeley
Computer Vision Group
ICCV 2003
Source Motion
“Do as I Do”
Source Appearance
3400 Frames
ICCV 2003 UC Berkeley
Computer Vision Group
Result
“Do as I Say” Synthesis run walk left swing walk right jog run jog swing walk left walk right synthetic sequence
• Synthesize given action labels
– e.g. video game control
UC Berkeley
Computer Vision Group
ICCV 2003
“Do as I Say”
• Red box shows when constraint is applied
UC Berkeley
Computer Vision Group
ICCV 2003
UC Berkeley
Computer Vision Group
Actor Replacement
SHOW VIDEO
(GregWorldCup.avi, DivX)
ICCV 2003
Conclusions
• In medium field action is about motion
• What we propose:
– A way of matching motions at coarse scale
• What we get out:
– Action recognition
– Skeleton transfer
– Synthesis: “Do as I Do”
&
“Do as I say”
• What we learned?
– A lot to be said for the “little guy”!
UC Berkeley
Computer Vision Group
ICCV 2003