Action recognition in videos Cordelia Schmid INRIA Grenoble Action recognition - problem • Short actions, i.e. drinking, sit down Coffee & Cigarettes dataset Hollywood dataset Action recognition - problem • Short actions, i.e. drinking, sit down • Activities/events, i.e. making a sandwich, depositing a suspicious object TRECVID Multimedia Event Detection TRECVID - Multimedia Event Detection Attempting a board trick Feeding an animal Wedding ceremony Getting a vehicle unstuck Action recognition • Action recognition is person-centric • Vision is person-centric: We mostly care about things which are important Movies TV YouTube Source I.Laptev Action recognition • Action recognition is person-centric • Vision is person-centric: We mostly care about things which are important 35% 34% Movies TV 40% YouTube Source I.Laptev Action recognition from still images • Description of the human pose – Silhouette description [Sullivan & Carlsson, 2002] – Histogram of gradients (HOG) [Dalal & Triggs 2005] – Human body part layout [Felzenszwalb & Huttenlocher, 2000] Action recognition from still images • Supervised modeling interaction between human & object [Gupta et al. 2009, Yao & Fei-Fei 2009] • Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011] Results on PASCAL VOC 2010 Human action classification dataset Importance of action objects • Human pose often not sufficient by itself • Objects define the actions Importance of temporal information • Video/temporal information necessary to disambiguate actions • Temporal context describes the action/activity • Key frames provide significant less information Action recognition in videos • Temporal information allows to stabilize human and object detection by tracking – J. Malik: tracking by detection is difficult ? • Large amount of data, very fast growing – H. Sawhney: large amount of data, not often well explored • Often comes with some sort of supervision, scripts, subtitles – similar in spirit to M. Hebert’s comment on large amount of data collected by a robot Action recognition in videos Motion history image [Bobick & Davis, 2001] Learning dynamic prior [Blake et al. 1998] Spatial motion descriptor [Efros et al. ICCV 2003] Sign language recognition [Zisserman et al. 2009] Action recognition in videos • Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07] Collection of space-time patches Histogram of visual words HOG & HOF patch descriptors SVM classifier Action recognition in videos • Bag of space-time features – Many recent extensions: new features / tracklets, temporal structuring etc. – Advantages • Very useful as a baseline • Captures spatial and temporal context – see Efros comment on image classification – Disadvantages • No interpretation of the action • Not sufficient for localization & description Action recognition in videos • Localization by 3D HOG/HOF, interaction with objects Tracking by detection and tracking Interaction with objects Space-time description Action recognition in videos • HOG 3D tracks + description – Detection & tracking of human with part based models works well • move towards more flexible models, similar to P. Felzenszwalb’s • integrate motion information • very good baseline – Towards more flexible descriptions, based on body parts, a lot of recent work on finding human body parts – Interaction with object important, but hard Discussion • Need for more challenging datasets – Need for realistic datasets KTH dataset Hollywood dataset – Scale up number of classes (today ~10 actions per dataset) – Increase number of examples per class, possibly with weakly supervised learning (the number of examples per videos is low) – Define a taxonomy, use redundancy between action classes to improve training – Manual exhaustive labeling of all actions impossible Discussion • Make better use of the large amount of information inherent in videos – automatic collection of additional examples – improve models incrementally – use weak labels from associated data (text, sound, subtitles) • Many existing techniques are straightforward extensions of methods for images – almost no use of 3D information – learn better interaction and temporal models – design activity models by decomposition into simple actions