Some Recent Works of Human Activity Recognition 吴心筱 wuxinxiao@bit.edu.cn Action Description Action, Object and Scene Multi-View Action Recognition Action Detection Complex Activity Recognition Multimedia Event Detection Action Description Action Description Extension of Interest Points Extension of Bag-of-Words Mid-level Attribute Feature Dense Trajectory Action Bank Extension of Interest Points Bregonzio et al., CVPR, 2009 Clouds of interest points accumulated over multiple temporal scales Matteo Bregonzio, Shaogang Gong and Tao Xiang. Recognising Action as Clouds of Space-Time Interest Points. CVPR 2009. Extension of Interest Points Holistic features of the clouds as the spatiotemporal information of interest points: Matteo Bregonzio, Shaogang Gong and Tao Xiang. Recognizing Action as Clouds of Space-Time Interest Points. CVPR, 2009. Extension of Interest Points Wu et al., CVPR, 2011 Multi-scale spatio-temporal (ST) context distribution feature Characterize the spatial and temporal context distributions of interest points over multiple space-time scales. Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Luo. Action recognition using context and appearance distribution features. CVPR 2011. Extension of Interest Points A set of XYT relative coordinates between the center interest point and other interest points in a local region. Multi-scale local regions across multiple spacetime scales. Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Liu. Action recognition using context and appearance distribution features. CVPR 2011. Extension of Bag-of-Words Wu et al., CVPR, 2011 A global GMM is trained using all local features from all the training videos. The video-specific GMM for a given video is generated from the global GMM via a Maximum A Posterior adaption process. Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Luo. Action recognition using context and appearance distribution features. CVPR 2011. GMM vs Bag-of-Words Extension of Bag-of-Words Kovashka and Grauman, CVPR, 2010 Exploit multiple “bag-of-words” model to represent the hierarchy of space-time configurations at different scales. A. Kovashka and K. Grauman. Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. CVPR, 2010. Kovashka and Grauman, CVPR, 2010 A. Kovashka and K. Grauman. Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. CVPR, 2010. Kovashka and Grauman, CVPR, 2010 A. Kovashka and K. Grauman. Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. CVPR, 2010. Extension of Bag-of-Words Savarese, WMVC, 2008 Use a local histogram to capture co-occurences of words in a local region. S. Savarese, A. Delpozo, J.C. Niebles and L. Fei-Fei. Spatial-temporal correlatons for unsupervised action classification. WMVC, 2008. Extension of Bag-of-Words M. Ryoo and J. Aggarwal, ICCV, 2009. Propose a “featuretype X featuretype X relationship” histogram to capture both appearance and relationship information between pairwise visual words. M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. ICCV, 2009. Mid-level Attribute Feature Liu et al., CVPR, 2011. Action attributes: a set of inter mediate concepts. A unified framework: action attributes are effectively selected in a discriminative fashion. Data-driven Attributes. Jingen Liu, Benjamin Kuipers and Silvio Savarese. Recognizing Human Actions by Attributes. CVPR, 2011. Jingen Liu, Benjamin Kuipers and Silvio Savarese. Recognizing Human Actions by Attributes. CVPR, 2011. Data Driven Liu et al., CVPR, 2011. Dense Trajectory Wang et al., CVPR, 2011. Sample dense points from each frame and track them based on displacement information from a dense optical flow field. Heng Wang, Alexander Klaser, Cordelia Schmid and Cheng-Lin Liu. CVPR, 2011. Wang et al., CVPR, 2011. Four descriptors: Trajectory; HOG; HOF; MBH. Heng Wang, Alexander Klaser, Cordelia Schmid and Cheng-Lin Liu. CVPR, 2011. Action Bank Sadanand and Corso, CVPR, 2011. Object BankAction Bank Action Bank: a large set of action detectors. Sreemanananth Sadanand and Jason J. Corso. Action Bank: A High-Level Representation of Activity in Video, CVPR, 2012. Sreemanananth Sadanand and Jason J. Corso. Action Bank: A High-Level Representation of Activity in Video, CVPR, 2012. Actions, Object and Scene Nazli Ikizler-Cinbis and Stan Sclaroff, ECCV, 2010 Combine the information from person, object and scene Multiple instance learning + multiple kernel learning A bag contains all the instances extracted from a video for a particular feature channel. Different features have different kernel weights. Nazli Ikizler-Cinbis and Stan Sclaroff, Object, Scene and Actions: Combining Multiple Features for Human Action Recognition, ECCV, 2010. Nazli Ikizler-Cinbis and Stan Sclaroff, Object, Scene and Actions: Combining Multiple Features for Human Action Recognition, ECCV, 2010. Marcin Marszalek, Ivan Laptev and Cordelia Schmid, CVPR 2009. Automatically discover the relation between scene classes and human actions:using movie scripts Marcin Marszalek, Ivan Laptev and Cordelia Schmid, Actions in Context, CVPR, 2009. Develop a joint framework for action and scene recognition in natural video Multi-View Action Recognition Multiple Views View-invariant Recognition View-cross Recognition View-invariant Weinland et al., ICCV, 2009. A 3D visual hull is proposed to represent an action exemplar using a system of 5 calibrated cameras. Daniel Weinland, Edmond Boyer and Remi Ronfard. Action recognition from arbitrary views using 3D exemplars. ICCV, 2009. Weinland et al., ICCV, 2009. 3D exemplar-based HMM for classification Daniel Weinland, Edmond Boyer and Remi Ronfard. Action recognition from arbitrary views using 3D exemplars. ICCV, 2009. View-invariant Yan et al., CVPR, 2008. 4D action feature: 3D shapes over time (4D) Pingkun Yan, Saad M. Khan, Mubarak Shah. Learning 4D Action Feature Models for Arbitrary View Action Recognition. CVPR, 2008. View-invariant Junejo et al., IEEE TPAMI, 2008. A novel view-invariant feature: self-similarity descriptor Frame-to-frame similarity Imran N. Junejo, Emilie Dexter, Ivan Laptev and Patrick Perez. View-independent action recognition from temporal self-similarities. IEEE T-PAMI, 2008. View-invariant Lewandowski et al, ECCV, 2010. View-independent manifold representation A stylistic invariant embedded manifold is produced to describe an action for each view. All view-dependent manifolds are automatically combined to generate an unified manifold . Michal Lewandowski, Dimitrios Makris, and Jean-Christophe Nebel. View and style-independent action manifolds for human activity recognition, ECCV, 2010. View-invariant Wu and Jia, ECCV, 2012. Propose a latent kernelized structural SVM. The view index is treated as a latent variable and inferred during both training and testing. kernelized Xinxiao Wu and Yunde Jia. View-Invariant action recognition using latent kernelized structural SVM. ECCV, 2012. Cross-view Liu et al., CVPR, 2011. Learn the bilingual-words from both source view and target view. Transfer action models between two views via the bag-of-bilingual-words model. Jingen Liu, Mubarak Shah, Benjamin Kuipers and Silvio Savarese. Cross-View Action Recognition via View Knowledge Transfer. CVPR 2011. Cross-view Li et al, CVPR, 2012. Propose “virtual views” to connect action descriptors from source view and target view. Each virtual view is associated with a linear transformation of the action descriptor,and the sequence of transformations arising from the sequence of virtual views aims at bridging the source and target views Xinxiao Wu and Yunde Jia. View-Invariant action recognition using latent kernelized structural SVM. Cross-view Wu et al., PCM, 2012. Transfer Discriminant-Analysis of Canonical Correlations (Transfer DCC). Minimize the mismatch between data distributions of source and target views. Xinxiao Wu, Cuiwei Liu, and Yunde Jia. Transfer discriminantanalysis of canonical correlations for view-transfer action recognition, PCM, 2012. Action Detection Yuan et al., IEEE T-PAMI, 2010. A discriminative pattern matching criterion for action classification: naïve-Bayes mutual information maximization (NBMIM) An efficient search algorithm: spatio-temporal branch-and-bound (STBB) search algorithm Junsong Yuan, Zicheng Liu, and Ying Wu, Discriminative video pattern search for efficient action detection, IEEE T-PAMI, 2012. Hu et al., ICCV, 2009. The candidate of regions of an action are treated as a bag of instances. A novel multiple-instance learning framework, named SMILE-SVM (Simulated annealing Multiple Instance Learning Support Vector Machines), is proposed for learning human action detector. Yuxiao Hu, Liangliang Cao, Fengjun Lv, Shuicheng Yan, Yihong Gong and Thomas, S. Huang. Action detection in complex scenes with spatial and temporal ambiguities. ICCV, 2009. Complex Activity Recognition Gaidon et al., CVPR, 2011. Actom Sequence Model: represent an activity as a sequence of atomic actionanchored visual features. Automatically detect atomic actions from an input activity video. A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models for efficient action detection. CVPR, 2011. Hoai et al., CVPR, 2011. Jointly perform video segmentation and action recognition. M. Hoai, Z. Lan, and F. Torre. Joint segmentation and classification of human actions in video. CVPR, 2011. Tang et al., CVPR, 2012. Each activity is modeled by a set of latent state variables and duration variables. The states are the cluster centers by clustering all the fixed-length video clips from training data. A max-margin based discriminative model is introduced to learning the temporal structure of complex events. K. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for complex event detection. CVPR, 2012. Multimedia Event Detection Izadinia and Shah, ECCV, 2012. A latent discriminative model is proposed to detect the low-level events by modeling the coocurrence relationship between different lowlevel events in a graph. Each video is divided into short clips and each clip is manually annotated using one lowlevel event label, which are used fro training the low-level detectors. H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. ECCV, 2012. Thanks for your attention! Q & A?