Lecture1_Human Activity Analysis2

advertisement
Some Recent Works of
Human Activity Recognition
吴心筱
wuxinxiao@bit.edu.cn
Action Description
Action, Object and Scene
Multi-View Action Recognition
Action Detection
Complex Activity Recognition
Multimedia Event Detection
Action Description
Action Description
Extension of Interest Points
Extension of Bag-of-Words
Mid-level Attribute Feature
Dense Trajectory
Action Bank
Extension of Interest Points
Bregonzio et al.,
CVPR, 2009
Clouds of interest
points accumulated
over multiple
temporal scales
Matteo Bregonzio, Shaogang Gong and Tao Xiang. Recognising Action as
Clouds of Space-Time Interest Points. CVPR 2009.
Extension of Interest Points
Holistic features of the clouds as the spatiotemporal information of interest points:
Matteo Bregonzio, Shaogang Gong and Tao Xiang. Recognizing Action as
Clouds of Space-Time Interest Points. CVPR, 2009.
Extension of Interest Points
Wu et al., CVPR, 2011
Multi-scale spatio-temporal (ST) context distribution
feature
Characterize the spatial and temporal context distributions
of interest points over multiple space-time scales.
Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Luo. Action recognition using
context and appearance distribution features. CVPR 2011.
Extension of Interest Points
 A set of XYT relative coordinates between the
center interest point and other interest points in a
local region.
Multi-scale local regions across multiple spacetime scales.
Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Liu. Action recognition using
context and appearance distribution features. CVPR 2011.
Extension of Bag-of-Words
 Wu et al., CVPR, 2011
A global GMM is trained using all local features from
all the training videos.
The video-specific GMM for a given video is
generated from the global GMM via a Maximum A
Posterior adaption process.
Xinxiao Wu, Dong Xu, Lixin Duan and Jiebo Luo. Action recognition using
context and appearance distribution features. CVPR 2011.
GMM vs Bag-of-Words
Extension of Bag-of-Words
 Kovashka and Grauman, CVPR, 2010
 Exploit multiple “bag-of-words” model to
represent the hierarchy of space-time configurations
at different scales.
A. Kovashka and K. Grauman. Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. CVPR, 2010.
 Kovashka and Grauman, CVPR, 2010
A. Kovashka and K. Grauman. Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. CVPR, 2010.
 Kovashka and Grauman, CVPR, 2010
A. Kovashka and K. Grauman. Learning a hierarchy of discriminative spacetime neighborhood features for human action recognition. CVPR, 2010.
Extension of Bag-of-Words
 Savarese, WMVC, 2008
 Use a local histogram to capture co-occurences
of words in a local region.
S. Savarese, A. Delpozo, J.C. Niebles and L. Fei-Fei. Spatial-temporal
correlatons for unsupervised action classification. WMVC, 2008.
Extension of Bag-of-Words
M. Ryoo and J. Aggarwal, ICCV, 2009.
 Propose a “featuretype X featuretype X
relationship” histogram to capture both appearance
and relationship information between pairwise
visual words.
M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: video
structure comparison for recognition of complex human activities. ICCV,
2009.
Mid-level Attribute Feature
Liu et al., CVPR, 2011.
Action attributes: a set of inter mediate concepts.
A unified framework: action attributes are
effectively selected in a discriminative fashion.
Data-driven Attributes.
Jingen Liu, Benjamin Kuipers and Silvio Savarese. Recognizing Human
Actions by Attributes. CVPR, 2011.
Jingen Liu, Benjamin Kuipers and Silvio Savarese. Recognizing Human
Actions by Attributes. CVPR, 2011.
Data Driven
Liu et al., CVPR, 2011.
Dense Trajectory
Wang et al., CVPR,
2011.
Sample dense points
from each frame and
track them based on
displacement
information from a
dense optical flow field.
Heng Wang, Alexander Klaser, Cordelia Schmid and Cheng-Lin Liu. CVPR,
2011.
Wang et al., CVPR, 2011.
Four descriptors: Trajectory; HOG; HOF; MBH.
Heng Wang, Alexander Klaser, Cordelia Schmid and Cheng-Lin Liu. CVPR,
2011.
Action Bank
Sadanand and Corso, CVPR, 2011.
Object BankAction Bank
Action Bank: a large set of action detectors.
Sreemanananth Sadanand and Jason J. Corso. Action Bank: A High-Level
Representation of Activity in Video, CVPR, 2012.
Sreemanananth Sadanand and Jason J. Corso. Action Bank: A High-Level
Representation of Activity in Video, CVPR, 2012.
Actions, Object and Scene
Nazli Ikizler-Cinbis and Stan Sclaroff, ECCV,
2010
Combine the information from person, object
and scene
Multiple instance learning + multiple kernel
learning
A bag contains all the instances extracted
from a video for a particular feature channel.
Different features have different kernel
weights.
Nazli Ikizler-Cinbis and Stan Sclaroff, Object, Scene and Actions: Combining Multiple
Features for Human Action Recognition, ECCV, 2010.
Nazli Ikizler-Cinbis and Stan Sclaroff, Object, Scene and Actions: Combining Multiple
Features for Human Action Recognition, ECCV, 2010.
Marcin Marszalek, Ivan Laptev and Cordelia Schmid,
CVPR 2009.
Automatically discover the relation between scene
classes and human actions:using movie scripts
Marcin Marszalek, Ivan Laptev and Cordelia Schmid, Actions in Context, CVPR,
2009.
Develop a joint framework for action and scene
recognition in natural video
Multi-View Action
Recognition
Multiple Views
View-invariant Recognition
View-cross Recognition
View-invariant
Weinland et al., ICCV, 2009.
A 3D visual hull is proposed to represent an
action exemplar using a system of 5 calibrated
cameras.
Daniel Weinland, Edmond Boyer and Remi Ronfard. Action
recognition from arbitrary views using 3D exemplars. ICCV, 2009.
Weinland et al., ICCV, 2009.
3D exemplar-based HMM for classification
Daniel Weinland, Edmond Boyer and Remi Ronfard. Action
recognition from arbitrary views using 3D exemplars. ICCV, 2009.
View-invariant
Yan et al., CVPR, 2008.
4D action feature: 3D shapes over time (4D)
Pingkun Yan, Saad M. Khan, Mubarak Shah. Learning 4D Action
Feature Models for Arbitrary View Action Recognition. CVPR, 2008.
View-invariant
Junejo et al., IEEE TPAMI, 2008.
A novel view-invariant feature: self-similarity
descriptor
Frame-to-frame similarity
Imran N. Junejo, Emilie Dexter, Ivan Laptev and Patrick Perez.
View-independent action recognition from temporal self-similarities.
IEEE T-PAMI, 2008.
View-invariant
Lewandowski et al, ECCV, 2010.
View-independent manifold representation
A stylistic invariant embedded manifold is
produced to describe an action for each view.
All view-dependent manifolds are
automatically combined to generate an unified
manifold .
Michal Lewandowski, Dimitrios Makris, and Jean-Christophe Nebel.
View and style-independent action manifolds for human activity
recognition, ECCV, 2010.
View-invariant
Wu and Jia, ECCV, 2012.
Propose a latent kernelized structural SVM.
The view index is treated as a latent variable
and inferred during both training and testing.
kernelized
Xinxiao Wu and Yunde Jia. View-Invariant action recognition using
latent kernelized structural SVM. ECCV, 2012.
Cross-view
Liu et al., CVPR, 2011.
Learn the bilingual-words from both source
view and target view.
Transfer action models between two views via
the bag-of-bilingual-words model.
Jingen Liu, Mubarak Shah, Benjamin Kuipers and Silvio Savarese.
Cross-View Action Recognition via View Knowledge Transfer. CVPR
2011.
Cross-view
Li et al, CVPR, 2012.
Propose “virtual views” to connect action
descriptors from source view and target view.
Each virtual view is associated with a linear
transformation of the action descriptor,and the
sequence of transformations arising from the
sequence of virtual views aims at bridging the
source and target views
Xinxiao Wu and Yunde Jia. View-Invariant action recognition using
latent kernelized structural SVM.
Cross-view
Wu et al., PCM, 2012.
Transfer Discriminant-Analysis of Canonical
Correlations (Transfer DCC).
Minimize the mismatch between data
distributions of source and target views.
Xinxiao Wu, Cuiwei Liu, and Yunde Jia. Transfer discriminantanalysis of canonical correlations for view-transfer action
recognition, PCM, 2012.
Action Detection
Yuan et al., IEEE T-PAMI, 2010.
A discriminative pattern matching criterion for
action classification: naïve-Bayes mutual
information maximization (NBMIM)
An efficient search algorithm: spatio-temporal
branch-and-bound (STBB) search algorithm
Junsong Yuan, Zicheng Liu, and Ying Wu, Discriminative video
pattern search for efficient action detection, IEEE T-PAMI, 2012.
Hu et al., ICCV, 2009.
The candidate of regions of an action are
treated as a bag of instances.
A novel multiple-instance learning framework,
named SMILE-SVM (Simulated annealing
Multiple Instance Learning Support Vector
Machines), is proposed for learning human
action detector.
Yuxiao Hu, Liangliang Cao, Fengjun Lv, Shuicheng Yan, Yihong
Gong and Thomas, S. Huang. Action detection in complex scenes
with spatial and temporal ambiguities. ICCV, 2009.
Complex Activity
Recognition
Gaidon et al., CVPR,
2011.
Actom Sequence
Model: represent an
activity as a sequence
of atomic actionanchored visual
features.
Automatically detect
atomic actions from an
input activity video.
A. Gaidon, Z. Harchaoui, and C. Schmid. Actom sequence models
for efficient action detection. CVPR, 2011.
Hoai et al., CVPR, 2011.
Jointly perform video segmentation and action
recognition.
M. Hoai, Z. Lan, and F. Torre. Joint segmentation and classification
of human actions in video. CVPR, 2011.
Tang et al., CVPR, 2012.
Each activity is modeled by a set of latent
state variables and duration variables.
The states are the cluster centers by
clustering all the fixed-length video clips from
training data.
A max-margin based discriminative model is
introduced to learning the temporal structure of
complex events.
K. Tang, F.-F. Li, and D. Koller. Learning latent temporal structure for
complex event detection. CVPR, 2012.
Multimedia Event
Detection
Izadinia and Shah, ECCV, 2012.
A latent discriminative model is proposed to
detect the low-level events by modeling the coocurrence relationship between different lowlevel events in a graph.
Each video is divided into short clips and
each clip is manually annotated using one lowlevel event label, which are used fro training the
low-level detectors.
H. Izadinia and M. Shah. Recognizing complex events using large
margin joint low-level event model. ECCV, 2012.
Thanks for your
attention!
Q & A?
Download