[Human-Computer Interaction : From Theory to Applications] Final

advertisement
[Human-Computer Interaction : From Theory to Applications]
Final Report (Paper Study)
A survey on vision-based human action recognition
Image and Vision Computing 28 (2010) 976–990
 Student : Husan-Pei Wang(王瑄珮)
 Student ID : P96994020
 Instructor : Jenn-Jier Lien Ph.D.
Outline

1. Introduction



1.1 Challenges and characteristics of the domain
1.2 Common dataset
2. Image Representation

2.1 Global Representations
 2.1.1 Space–time volumes
 2.2 Local Representations
 2.2.1 Local descriptors
 2.2.2 Correlations between local descriptors
 2.3 Application-specific representations

3. Action classification




3.1 Direct classification
3.2 Temporal state-space models
4. Discussion
5. References
1. Introduction(1/2)

This paper considers the task of labeling videos
containing human motion with action classes.

Challenging:
Variations in motion performance
 Recording settings
 Inter-personal differences.


This paper provides a detailed overview of current
advances in the field to solve the challenge.
1. Introduction(2/2)

The recognition of movement can be performed at
various levels of abstraction.
 This paper adopt the hierarchy used by Moeslund et al. [1]:
Activity
Action
Action
primitive
Action primitive: 單一的動作
例如: 左腳向前
Action: 持續的單一動作
例如: 跑
Activity: 活動,由各種action所組成
例如: 跨欄競賽由跑、跳所組成
1.1 Challenges and characteristics of the domain
Intra- and
inter-class
variations
• A good human action recognition approach should be able to
generalize over variations within one class and distinguish
between actions of different classes.
Environment
and recording
settings
• The environment in which the action performance takes place
is an important source of variation in the recording.
• The same action, observed from different viewpoints, can
lead to very different image observations.
Temporal
variations
• Actions are assumed to be readily segmented in time.
• The rate at which the action is recorded has an important effect
on the temporal extent of an action.
Obtaining
and labeling
training data
• Use publicly available datasets hat for training, it provides a
sound mechanism for comparison.
• When no labels are available, an unsupervised approach needs
to be pursued but there is no guarantee that the discovered
classes are semantically meaningful.
1.2 Common dataset

Widely used sets:
KTH human motion dataset
 Weizmann human action dataset
 INRIA XMAS multi-view dataset
 UCF sports action dataset
 Hollywood human action dataset

Sequences
6 of sport
8
Actions:
14
10
motions: 150
25 Limit
Actors: No
11
10
considerable
variation:
Scenarios:
Huge
variety:
Viewpoint:
54
Background:
human
appearance、
actions
staticperformance
camera movement、
Background:
occlusions,
Camera
Views: fixed
viewpoint、
relatively
static
camera
movements
Include
Foreground
illumination、
dynamic
Background:
silhouettes
background..
backgrounds
.
static
illumination:
static
Include Silhouettes
and volumetric voxel
2. Image Representation

This section will discuss the features that are extracted
from the image sequences.
 This paper divide image representations into two
categories:


Global representations:Obtained in a top-down fashion
透過背景相減和追蹤,
此區塊(Region of interest)
先抓出人的位置
會整個被編碼(encode)
產生影像符號
(image
descriptor)
Local representations : Proceeds in a bottom-up fashion
先針對時間-空間域做
interest points 周圍的
interest points 偵測
patches就會被計算
這些patches結合成
final representation
2.1 Global Representations

Global representations encode the region of interest
(ROI) of a person as a whole.
The ROI is usually obtained through background subtraction or
tracking.
 They are sensitive to noise, partial occlusions and variations
in viewpoint.
 To partly overcome above issues:
 Grid-based approaches spatially divide the observation into
cells, each of which encodes part of the observation locally.

2.1.1 Space–time volumes

A 3D spatio-temporal volume (STV) is formed by
stacking frames over a given sequence.


Require:Accurate localization, alignment and possibly
background subtraction.
Blank et al. [2,3] first stack silhouettes over a given
sequence to form an STV (see following picture).
2.2 Local Representations(1/2)

Local representations describe the observation as a
collection of local descriptors or patches.

somewhat invariant to changes in viewpoint, person appearance
and partial occlusions.

Space–time interest points are the locations in space
and time where sudden changes of movement occur
in the video.
 Laptev and Lindeberg [4] extended the Harris corner
detector [5] to 3D. Space–time interest points are those
points where the local neighborhood has a significant
variation in both the spatial and the temporal domain.
The work is extended to compensate for relative camera
motions in [6].
 Drawback: the relatively small number of stable interest points.

2.2 Local Representations(2/2)


Improve: Dollár et al. [7] apply Gabor filtering on the spatial and
temporal dimensions individually.
 The number of interest points is adjusted by changing the
spatial and temporal size of the neighborhood in which local
minima are selected.
Instead of detecting interest points over the entire
volume:

Wong and Cipolla [8] first detect subspaces of correlated
movement. These subspaces correspond to large movements
such as an arm wave.
 Within these spaces, a sparse set of interest points is detected.
2.2.1 Local descriptors(1/2)

Local descriptors summarize an image or video patch in
a representation that is ideally invariant to background
clutter, appearance and occlusions, and possibly to
rotation and scale.

The spatial and temporal size of a patch is usually determined by
the scale of the interest point.
Extraction of space–time cuboids at interest points from similar actions performed by different persons[6]
2.2.1 Local descriptors(2/2)

Challenge:
Different number and the usually high dimensionality
of the descriptors. It’s hard to compare sets of local
descriptors.

Overcome:
 A codebook is generated by clustering patches
 Selecting either cluster enters or the closest patches as code
words.
 A local descriptor is described as a codeword contribution.
 A frame or sequence can be represented as a bag-of-words, a
histogram of codeword frequencies.
2.2.2 Correlations between local descriptors

In this section, it will describe approaches that exploit
correlations between local descriptors for selection or
the construction of higher-level descriptors.
 Scovanner et al. [11] construct a word co-occurrence
matrix, and iteratively merge words with similar cooccurrences until the difference between all pairs of
words is above a specified threshold.


This leads to a reduced codebook size and similar actions are
likely to generate more similar distributions of code words.
Correlations between descriptors can also be obtained
by tracking features.

Sun et al. [12] calculate SIFT(尺度不變特徵轉換) descriptors
around interest points in each frame and use Markov chaining to
determine tracks of these features.
2.3 Application-specific representations

This section discuss the works which use representations
that are directly motivated by the domain of human action
recognition.
 Smith et al. [13] use a number of specifically selected
features.
low-level : deal with color and movement.
 higher-level : obtained from head and hand regions.
 A boosting scheme: account the history of the action performance.


Vitaladevuni et al. [14] is inspired by the observation that
human actions differ in accelerating and decelerating
force.
Identify: reach, yank and throw types.
 Temporal segmentation into atomic movements described with
movement type,
 spatial location and direction of movement is performed first.

3. Action classification

When an image representation is available for an
observed frame or sequence, human action recognition
becomes a classification problem.
Direct classification
 Temporal state-space models
 Action detection

3.1 Direct classification(1/2)

Not pay special attention to the temporal domain.

Summarize all frames of an observed sequence into a
single representation or perform action recognition for
each frame individually.

Dimensionality reduction :
 降維方法即是透過分析資料來尋找一種Embedding的方式,將
資料從原先的高維空間映射到低維空間。
 降低運算複雜度
 取得更有本質意義的資料表示方式
 容易將高維資料視覺化(Visualization)。
3.1 Direct classification(2/2)

Nearest neighbor classification
 欲判斷某未知資料的類別時,僅須找出距離它最近的已知類別資
料再透過已知資料類別即可決定該未知資料的類別。
 優點: 簡單、有一定的精度
 缺點: 計算時間以及對記憶體空間需求會隨著原型資料點個數
或特徵變數增加而增加。

Discriminative classifiers
 主要將資料分類成兩種或更多的類別,而非將它們model化。
 最後會型成一個很大的分類結果,但每個類別都很小。
3.2 Temporal state-space models(1/6)

State-space models consist of states connected by
edges.


These edges model probabilities between states, and between
states and observations.
Model:
State: action performance (1state, 1 action performance)
 Observation: image representation at a given time.


Dynamic time warping(DTW)
DTW是計算輸入的音高向量和資料庫中標準答案的音高向量之前
的歐幾里得距離。
 需時較長、但有較高的辨識率。

3.2 Temporal state-space models(2/6)

Generative models

Hidden Markov models (HMM):
 以統計的方式來建立每個類別的(動態)機率模型
 此種模型特別適用於長度不固定的輸入向量
 不知道有多少個 states,states 的多寡需要由經驗來假設。
 三個組成要素
 observation probabilities
:是我們觀察到的某個東西是從某一個 hidden state 來的機率。
 transition probabilities
:是 hidden states 之間轉換的機率。
 initial probabilities。
:probabilities 是一開始的時候,落在某一個 hidden state 的機
率。
3.2 Temporal state-space models(3/6)

Generative models Applications
Feng and Perona [15] use a static HMM where key poses
correspond to states.
 Weinland et al. [16] construct a codebook by discriminatively
selecting templates. In the HMM, they condition the observation
on the viewpoint.
 Lv and Nevatia [17] uses an Action Net, which is constructed by
considering key poses and viewpoints. Transitions between
views and poses are encoded explicitly.
 Ahmad and Lee [18] take into account multiple viewpoints and
use a multi-dimensional HMM to deal with the different
observations.

3.2 Temporal state-space models(4/6)

Generative models

Instead of modeling the human body as a single observation, one
HMM can be used for each every body-part.
 This makes training easier. Because..
 The combinatorial complexity is reduced to learning
dynamical models for each limb individually.
 Composite movements that are not in the training set can be
recognized.
3.2 Temporal state-space models(5/6)

Discriminative models
將一個訓練集(training set)輸出的質最大化。
 HMMs assume that observations in time are independent, which
is often not the case.
 Discriminative models overcome this issue by modeling a
conditional distribution over action labels given the
observations.
 Discriminative models are suitable for classification of related
actions.
 Discriminative graphical models require many training sequences
to robustly determine all parameters.

3.2 Temporal state-space models(6/6)

Discriminative models

Conditional random fields (CRF) are discriminative models that
can use multiple overlapping features.

CRF同時擁有有限狀態HMM和SVM技術的優點,像是相依特徵和
透過完整順序來做為優先考量。

Variants of CRFs have also been proposed.
 Shi et al. [19] use a semi-Markov model (SMM), which is
suitable for both action segmentation and action recognition.
3.3. Action detection

Some works assume motion periodicity, which allows for
temporal segmentation by analyzing the self-similarity
matrix.
 Seitz and Dyer [20] introduce a periodicity detection
algorithm that is able to cope with small variations in the
temporal extent of a motion.
 Cutler and Davis [21] perform a frequency transform on
the self similarity matrix of a tracked object.
Peaks in the spectrum correspond to the frequency of the motion.
 The type of action is determined by analyzing the matrix structure.


Polana and Nelson [22] use Fourier transforms to find
the periodicity and temporally segment the video.

They match motion features to labeled 2D motion templates
4. Discussion(1/5)

Image representation
Global image representations
 優點:
 Good results
 They can usually be extracted with low cost.
 缺點:
 Limited to scenarios where ROIs can be determined reliably.
 Cannot deal with occlusions.
 Local representations
 Takes into account spatial and temporal correlations between
patches.
 occlusions has largely been ignored.

4. Discussion(2/5)

About viewpoints


Most of the reported work is restricted to fixed.
 Multiple view-dependent action models solves this issue.
 BUT-> Increased training complexity
About Classification
Temporal variations are not explicitly modeled, which proved to
be a reasonable approach in many cases.
 But-> For more complex motions, it is questionable whether
this approach is suitable.
 Generative state-space models such as HMMs can model
temporal variations.
 But-> Have difficulties distinguishing between related actions.
 Discriminative graphical approaches are more suitable.

4. Discussion(3/5)

About action detection

Many approaches assume that…
 The video is readily segmented into sequences
 It contain one instance of a known set of action labels.
 The location and approximate scale of the person in the video
is known or can easily be estimated.
 Thus-> The action detection task is ignored, which limits the
applicability to situations where segmentation in space and
time is possible.
 It
remains a challenge to perform action detection for
online applications.
4. Discussion(4/5)

The HOHA dataset [23] targets action recognition in
movies, whereas the UFC sport dataset [24] contains
sport footage.
 The use of application-specific datasets allows for the
use of evaluation metrics that go beyond precision and
recall.


Such as : speed of processing or detection accuracy.
The compilation or recording of datasets that contain
sufficient variation in movements, recording settings
and environmental settings remains challenging and
should continue to be a topic of discussion.
4. Discussion(5/5)

The problem of labeling data
For increasingly large and complex datasets, manual labeling
will become prohibitive.
 Multi-modal approach could improve recognition in some
domains
 For example in movie analysis. Also, context such as
background, camera motion, interaction between persons and
person identity provides informative cues [25].


This would be a big step towards the fulfillment of the
longstanding promise to achieve robust automatic
recognition and interpretation of human action.
5. References(1/4)








[1] Thomas B. Moeslund, Adrian Hilton, Volker Kruger, A survey of advances in visionbased human motion capture and analysis, Computer Vision and Image Understanding
(CVIU) 104 (2–3) (2006) 90–126.
[2] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, Ronen Basri, Actions as
space–time shapes, in: Proceedings of the International Conference On Computer Vision
(ICCV’05), vol. 2, Beijing, China, October 2005, pp. 1395–1402.
[3] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, Ronen Basri, Actions as
space–time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI) 29 (12) (2007) 2247–2253.
[4] Ivan Laptev, Tony Lindeberg, Space–time interest points, in: Proceedings of the
International Conference on Computer Vision (ICCV’03), vol. 1, Nice, France, October
2003, pp. 432–439.
[5] Chris Harris, Mike Stephens, A combined corner and edge detector, in: Proceedings of
the Alvey Vision Conference, Manchester, United Kingdom, August 1988, pp. 147–151.
[6] Ivan Laptev, Barbara Caputo, Christian Schuldt, Tony Lindeberg, Local velocity-adapted
motion events for spatio-temporal recognition, Computer Vision and Image Understanding
(CVIU) 108 (3) (2007) 207–229.
[7] Piotr Dollar, Vincent Rabaud, Garrison Cottrell, Serge Belongie, Behavior recognition
via sparse spatio-temporal features, in: Proceedings of the International Workshop on
Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VSPETS’05), Beijing, China, October 2005, pp. 65–72.
[8] Shu-Fai Wong, Roberto Cipolla, Extracting spatiotemporal interest points using global
information, in: Proceedings of the International Conference On Computer Vision
(ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8.
5. References(2/4)






[9] Juan Carlos Niebles, Hongcheng Wang, Li Fei-Fei, Unsupervised learning of
human action categories using spatial–temporal words, International Journal of
Computer Vision (IJCV) 79 (3) (2008) 299–318.
[10] Christian Schuldt, Ivan Laptev, Barbara Caputo, Recognizing human actions: a
local SVM approach, Proceedings of the International Conference on Pattern
Recognition (ICPR’04), 2004, vol. 3, Cambridge, United Kingdom, 2004, pp.32–36.
[11] Paul Scovanner, Saad Ali, Mubarak Shah, A 3-dimensional SIFT descriptor and
its application to action recognition, in: Proceedings of the International Conference
on Multimedia (MultiMedia’07), Augsburg, Germany, September 2007, pp. 357–360.
[12] Ju Sun, Xiao Wu, Shuicheng Yan, Loong-Fah Cheong, Tat-Seng Chua, Jintao Li,
Hierarchical spatio-temporal context modeling for action recognition, in: Proceedings
of the Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami,
FL, June 2009, pp. 1–8.
[13] Paul Smith, Niels da Vitoria Lobo, Mubarak Shah, TemporalBoost for event
recognition, in: Proceedings of the International Conference On Computer Vision
(ICCV’05), vol. 1, Beijing, China, October 2005, pp. 733–740.
[14] Shiv N. Vitaladevuni, Vili Kellokumpu, Larry S. Davis, Action recognition using
ballistic dynamics, in: Proceedings of the Conference on Computer Vision and
Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8.
5. References(3/4)






[15] Xiaolin Feng, Pietro Perona, Human action recognition by sequence of movelet
codewords, in: Proceedings of the International Symposium on 3D Data Processing,
Visualization, and Transmission 3DPVT’02), Padova, Italy, June 2002, pp. 717–721.
[16] Daniel Weinland, Edmond Boyer, Remi Ronfard, Action recognition from arbitrary
views using 3D exemplars, in: Proceedings of the International Conference On
Computer Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8.
[17] Fengjun Lv, Ram Nevatia, Single view human action recognition using key pose
matching and Viterbi path searching, in: roceedings of the Conference on Computer
Vision and Pattern Recognition (CVPR’07), Minneapolis, MN, June 2007, pp. 1–8.
[18] Mohiuddin Ahmad, Seong-Whan Lee, Human action recognition using shape and
CLG-motion flow from multi-view image sequences, Pattern Recognition 41 (7) (2008)
2237–2252.
[19] Qinfeng Shi, Li Wang, Li Cheng, Alex Smola, Discriminative human action
segmentation and recognition using semi-Markov model, in: Proceedings of the
Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK,
June 2008, pp. 1–8.
[20] Steven M. Seitz, Charles R. Dyer, View-invariant analysis of cyclic motion,
International Journal of Computer Vision (IJCV) 25 (3) (1997) 231–251.
5. References(4/4)





[21] Ross Cutler, Larry S. Davis, Robust real-time periodic motion detection, analysis,
and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI) 22 (8) (2000) 781–796.
[22] Ramprasad Polana, Randal C. Nelson, Detection and recognition of periodic,
nonrigid motion, International Journal of Computer Vision (IJCV) 23 (3) (1997) 261–
282.
[23] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, enjamin Rozenfeld, Learning
realistic human actions from movies, in: Proceedings of the Conference on Computer
Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8.
[24] Mikel D. Rodriguez, Javed Ahmed, Mubarak Shah, Action MACH: a
spatiotemporal maximum average correlation height filter for action recognition, in:
Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR’08), Anchorage, AK, June 2008, pp. 1–8.
[25] Marcin Marszałek, Ivan Laptev, Cordelia Schmid, Actions in context, in:
Proceedings of the Conference on Computer Vision and Pattern Recognition
(CVPR’09), Miami, FL, June 2009, pp. 1–8.
Download