ICCV03.ppt

advertisement
Recognizing Action at a Distance
A.A. Efros, A.C. Berg, G. Mori, J. Malik
UC Berkeley
Looking at People
Near field
• 300-pixel man
• Limb tracking
– e.g. Yacoob & Black,
Rao & Shah, etc.
Far field
• 3-pixel man
• Blob tracking
– vast surveillance
literature
Medium-field Recognition
The 30-Pixel Man
Appearance vs. Motion
Jackson Pollock
Number 21 (detail)
Goals
• Recognize human actions at a distance
– Low resolution, noisy data
– Moving camera, occlusions
– Wide range of actions (including non-periodic)
Our Approach
• Motion-based approach
– Non-parametric; use large amount of data
– Classify a novel motion by finding the most similar
motion from the training set
• Related Work
– Periodicity analysis
• Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis;
Collins et al.
– Model-free
• Temporal Templates [Bobick & Davis]
• Orientation histograms [Freeman et al; Zelnik & Irani]
• Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth]
Gathering action data
• Tracking
– Simple correlation-based tracker
– User-initialized
Figure-centric Representation
• Stabilized spatio-temporal
volume
– No translation information
– All motion caused by person’s
limbs
• Good news: indifferent to camera
motion
• Bad news: hard!
• Good test to see if actions, not
just translation, are being
captured
Remembrance of Things Past
• “Explain” novel motion sequence by
matching to previously seen video clips
– For each frame, match based on some temporal
extent
input sequence
motion analysis
run
swing
walk left
jog
walk right
database
Challenge: how to compare motions?
How to describe motion?
• Appearance
– Not preserved across different clothing
• Gradients (spatial, temporal)
– same (e.g. contrast reversal)
• Edges/Silhouettes
– Too unreliable
• Optical flow
– Explicitly encodes motion
– Least affected by appearance
– …but too noisy
Spatial Motion Descriptor
Image frame
Fx , Fy
Fx , Fx , Fy , Fy
Optical flow Fx , y
blurred Fx , Fx , Fy , Fy
Spatio-temporal Motion Descriptor
Temporal extent E
S
…
…
Sequence A
…
…
Sequence B
t
E
A
A
E
I matrix
E
B
B
E
frame-to-frame
similarity matrix
blurry I
motion-to-motion
similarity matrix
Football Actions: matching
Input
Sequence
Matched
Frames
input
matched
Football Actions: classification
10 actions; 4500 total frames; 13-frame motion descriptor
Classifying Ballet Actions
16 Actions; 24800 total frames; 51-frame motion descriptor.
Men used to classify women and vice versa.
Classifying Tennis Actions
6 actions; 4600 frames; 7-frame motion descriptor
Woman player used as training, man as testing.
Classifying Tennis
• Red bars show classification results
Querying the Database
input sequence
run
swing
walk left
jog
walk right
database
Action Recognition:
run walk left
Joint Positions:
swing
walk right
jog
2D Skeleton Transfer
• We annotate database with 2D joint positions
• After matching, transfer data to novel sequence
– Ajust the match for best fit
Input sequence:
Transferred 2D skeletons:
3D Skeleton Transfer
• We populate database with rendered stick figures from
3D Motion Capture data
• Matching as before, we get 3D joint positions (kind of)!
Input sequence:
Transferred 3D skeletons:
“Do as I Do” Motion Synthesis
input sequence
synthetic sequence
• Matching two things:
– Motion similarity across sequences
– Appearance similarity within sequence (like VideoTextures)
• Dynamic Programming
“Do as I Do”
Source Motion
Source Appearance
3400 Frames
Result
“Do as I Say” Synthesis
run
walk left
swing
walk right
run
swing
walk left
jog
walk right
synthetic sequence
• Synthesize given action labels
– e.g. video game control
jog
“Do as I Say”
• Red box shows when constraint is applied
Actor Replacement
SHOW VIDEO
Conclusions
• In medium field action is about motion
• What we propose:
– A way of matching motions at coarse scale
• What we get out:
– Action recognition
– Skeleton transfer
– Synthesis: “Do as I Do” & “Do as I say”
• What we learned?
– A lot to be said for the “little guy”!
Thank You
Smoothness for Synthesis
•
•
•
•
Wact is action similarity between source and target
Wapp is appearance similarity within target frames
For every source frame i, find best target frame  i
by maximizing following cost function:
n

i 1
n
Wact (i,  i )   appWapp ( i ,  i 1  1)
act
i 2
• Optimize using dynamic programming
The Database Analogy
Conclusions
• Action is about motion
• Purely motion-based descriptor for actions
• We treat optical flow
– Not as measurement of pixel displacement
– But as a set of noisy features that are carefully
smoothed and aggregated
• Can handle very poor, noisy data
Cool Video, Attempt II
Comparing motion descriptors
S
…
…
…
…
t
I matrix
frame-to-frame
similarity matrix
blurry I
motion-to-motion
similarity matrix
Download