Recognizing Action at a Distance A.A. Efros, A.C. Berg, G. Mori, J. Malik UC Berkeley Looking at People Near field • 300-pixel man • Limb tracking – e.g. Yacoob & Black, Rao & Shah, etc. Far field • 3-pixel man • Blob tracking – vast surveillance literature Medium-field Recognition The 30-Pixel Man Appearance vs. Motion Jackson Pollock Number 21 (detail) Goals • Recognize human actions at a distance – Low resolution, noisy data – Moving camera, occlusions – Wide range of actions (including non-periodic) Our Approach • Motion-based approach – Non-parametric; use large amount of data – Classify a novel motion by finding the most similar motion from the training set • Related Work – Periodicity analysis • Polana & Nelson; Seitz & Dyer; Bobick et al; Cutler & Davis; Collins et al. – Model-free • Temporal Templates [Bobick & Davis] • Orientation histograms [Freeman et al; Zelnik & Irani] • Using MoCap data [Zhao & Nevatia, Ramanan & Forsyth] Gathering action data • Tracking – Simple correlation-based tracker – User-initialized Figure-centric Representation • Stabilized spatio-temporal volume – No translation information – All motion caused by person’s limbs • Good news: indifferent to camera motion • Bad news: hard! • Good test to see if actions, not just translation, are being captured Remembrance of Things Past • “Explain” novel motion sequence by matching to previously seen video clips – For each frame, match based on some temporal extent input sequence motion analysis run swing walk left jog walk right database Challenge: how to compare motions? How to describe motion? • Appearance – Not preserved across different clothing • Gradients (spatial, temporal) – same (e.g. contrast reversal) • Edges/Silhouettes – Too unreliable • Optical flow – Explicitly encodes motion – Least affected by appearance – …but too noisy Spatial Motion Descriptor Image frame Fx , Fy Fx , Fx , Fy , Fy Optical flow Fx , y blurred Fx , Fx , Fy , Fy Spatio-temporal Motion Descriptor Temporal extent E S … … Sequence A … … Sequence B t E A A E I matrix E B B E frame-to-frame similarity matrix blurry I motion-to-motion similarity matrix Football Actions: matching Input Sequence Matched Frames input matched Football Actions: classification 10 actions; 4500 total frames; 13-frame motion descriptor Classifying Ballet Actions 16 Actions; 24800 total frames; 51-frame motion descriptor. Men used to classify women and vice versa. Classifying Tennis Actions 6 actions; 4600 frames; 7-frame motion descriptor Woman player used as training, man as testing. Classifying Tennis • Red bars show classification results Querying the Database input sequence run swing walk left jog walk right database Action Recognition: run walk left Joint Positions: swing walk right jog 2D Skeleton Transfer • We annotate database with 2D joint positions • After matching, transfer data to novel sequence – Ajust the match for best fit Input sequence: Transferred 2D skeletons: 3D Skeleton Transfer • We populate database with rendered stick figures from 3D Motion Capture data • Matching as before, we get 3D joint positions (kind of)! Input sequence: Transferred 3D skeletons: “Do as I Do” Motion Synthesis input sequence synthetic sequence • Matching two things: – Motion similarity across sequences – Appearance similarity within sequence (like VideoTextures) • Dynamic Programming “Do as I Do” Source Motion Source Appearance 3400 Frames Result “Do as I Say” Synthesis run walk left swing walk right run swing walk left jog walk right synthetic sequence • Synthesize given action labels – e.g. video game control jog “Do as I Say” • Red box shows when constraint is applied Actor Replacement SHOW VIDEO Conclusions • In medium field action is about motion • What we propose: – A way of matching motions at coarse scale • What we get out: – Action recognition – Skeleton transfer – Synthesis: “Do as I Do” & “Do as I say” • What we learned? – A lot to be said for the “little guy”! Thank You Smoothness for Synthesis • • • • Wact is action similarity between source and target Wapp is appearance similarity within target frames For every source frame i, find best target frame i by maximizing following cost function: n i 1 n Wact (i, i ) appWapp ( i , i 1 1) act i 2 • Optimize using dynamic programming The Database Analogy Conclusions • Action is about motion • Purely motion-based descriptor for actions • We treat optical flow – Not as measurement of pixel displacement – But as a set of noisy features that are carefully smoothed and aggregated • Can handle very poor, noisy data Cool Video, Attempt II Comparing motion descriptors S … … … … t I matrix frame-to-frame similarity matrix blurry I motion-to-motion similarity matrix