6.心得

advertisement
National Cheng-kung University
Institute of Manufacturing Information Systems
Human-Computer Interaction :
From Theory to Applications
Final Report (Paper Study)
**********************************************
A survey on vision-based human
action recognition
Image and Vision Computing 28 (2010) 976–990
**********************************************
Instructor : Jenn-Jier Lien Ph.D.
Student : Husan-Pei Wang(王瑄珮)
Student ID : P96994020
2011/01/14
目錄
1.
Introduction ............................................................................................................ 3
1.1 動作階層定義................................................................................................. 3
1.2 特徵域之定義................................................................................................. 4
1.3 Dataset 的介紹................................................................................................. 5
2.
Image Representation............................................................................................. 7
2.1 Global Representations .................................................................................... 8
2.1.1 Space–time volumes ............................................................................. 8
2.2 Local Representations ...................................................................................... 9
2.2.1 Local descriptors ................................................................................. 10
2.2.2 Correlations between local descriptors ............................................... 11
2.2.3 Application-specific representations ................................................... 11
3.
4.
Action classification............................................................................................. 12
3.1 Direct classification ....................................................................................... 12
3.2 Temporal state-space models ......................................................................... 13
3.3 Action detection ............................................................................................. 14
Discussion ............................................................................................................ 15
4.1. Image representation ..................................................................................... 15
4.2 About viewpoints ........................................................................................... 15
4,3 About Classification ....................................................................................... 15
4.4 About action detection ................................................................................... 16
4.5 The problem of labeling data ......................................................................... 16
5. References ............................................................................................................ 17
6. 心得...................................................................................................................... 18
1. Introduction
這篇 paper 認為影片標記應包含人類動作辨識的功能,而自動化的影像辨
識以逐漸應用於許多研究上,像是老人照護,無人商店監控等。 但若要透
過影向正確地辨識人類的動作,尚面臨了以下幾種挑戰:
1. 動作的多樣性:人類的動作有千奇百種,故難以確切的給予一個完整
的 database 供我們正確的辨識出屬於何種動作。
2. 記錄方式
3. 個人動作的差異:每個人都有自己習慣的運動方式,造成同一個動作
有多種表現方式。
這篇 paper 將廣泛蒐尋文獻,針對此領域目前的發展提出衣詳細之概論。
1.1 動作階層定義
人類的動作辨識又分為許多階層,此篇 paper 是採用 Moeslund[1]提出的
階層。如圖 1 所示。
Activity
Action
Action
primitive
圖 1 Moeslund 提出的動作階層示意圖
1.2 特徵域之定義
特徵之擷取需要定義一特徵域,其相關因素有:各類別需可正確的區分、
環境和記錄方式的影響、時間的變化、訓練資料的取得和標記。各因素之說
明如下:
Intra- and
inter-class
variations
•A good human action recognition approach should be able to
generalize over variations within one class and distinguish
between actions of different classes.
Environment
and recording
settings
•The environment in which the action performance takes place is
an important source of variation in the recording.
•The same action, observed from different viewpoints, can lead
to very different image observations.
Temporal
variations
•Actions are assumed to be readily segmented in time.
•The rate at which the action is recorded has an important effect
on the temporal extent of an action.
Obtaining
and labeling
training data
•Use publicly available datasets hat for training, it provides a sound
mechanism for comparison.
•When no labels are available, an unsupervised approach needs to
be pursued but there is no guarantee that the discovered classes
are semantically meaningful.
圖 2 影響特徵域之相關因素
1.3 Dataset 的介紹
使用相同的 dataset 測試,才可以比較各種方法,這邊將介紹幾種常用的
dataset:
 KTH human motion dataset
 Weizmann human action dataset
 INRIA XMAS multi-view dataset
 UCF sports action dataset
 Hollywood human action dataset
2. Image Representation
這部分相討論從影像序列中如何擷取特徵。這篇 paper 將影像表示分為
兩類:
1. Global representations:Obtained in a top-down fashion 步驟如下:
透過背景相減和追蹤,
先抓出人的位置
此區塊(Region of interest)
會整個被編碼(encode)
產生影像符號
(image
descriptor)
2. Local representations : Proceeds in a bottom-up fashion 步驟如下:
先針對時間-空間域
做interest points 偵測
interest points 周圍的
patches就會被計算
這些patches結合成
final representation
2.1 Global Representations
Global representations encode the region of interest (ROI) of a person as a
whole. The ROI is usually obtained through background subtraction or tracking.
They are sensitive to noise, partial occlusions and variations in viewpoint. To
partly overcome above issues: Grid-based approaches spatially divide the
observation into cells, each of which encodes part of the observation locally.
2.1.1 Space–time volumes
A 3D spatio-temporal volume (STV) is formed by stacking frames over a
given sequence. It requires:
1. Accurate localization
2. Alignment and possibly background subtraction.
Blank et al. [2,3] first stack silhouettes over a given sequence to form an
STV (圖 3).
圖 3 Space–time volume of stacked silhouettes
2.2 Local Representations
Local representations describe the observation as a collection of local
descriptors or patches. Accurate localization and background subtraction are not
required and local representations are somewhat invariant to changes in viewpoint,
person appearance and partial occlusions.
Space–time interest points are the locations in space and time where sudden
changes of movement occur in the video.
Laptev and Lindeberg [4] extended the Harris corner detector [5] to 3D.
Space–time interest points are those points where the local neighborhood has a
significant variation in both the spatial and the temporal domain. The work is
extended to compensate for relative camera motions in [6].
 Drawback : the relatively small number of stable interest points.
 Improve: Dollár et al. [7] apply Gabor filtering on the spatial and
temporal dimensions individually. The number of interest points is
adjusted by changing the spatial and temporal size of the neighborhood
in which local minima are selected.
Instead of detecting interest points over the entire volume: Wong and Cipolla
[8] first detect subspaces of correlated movement. These subspaces correspond to
large movements such as an arm wave. Within these spaces, a sparse set of
interest points is detected.
2.2.1 Local descriptors
Local descriptors summarize an image or video patch in a representation that
is ideally invariant to background clutter, appearance and occlusions, and
possibly to rotation and scale. The spatial and temporal size of a patch is usually
determined by the scale of the interest point.(圖 4)
圖 4 Extraction of space–time cuboids at interest points from similar actions
performed by different persons[6]
 Challenge:
Different number and the usually high dimensionality of the descriptors. It’s
hard to compare sets of local descriptors.





Overcome:
A codebook is generated by clustering patches
Selecting either cluster enters or the closest patches as code words.
A local descriptor is described as a codeword contribution.
A frame or sequence can be represented as a bag-of-words, a histogram of
codeword frequencies.
2.2.2 Correlations between local descriptors
In this section, it will describe approaches that exploit correlations between local
descriptors for selection or the construction of higher-level descriptors. Scovanner et
al. [11] construct a word co-occurrence matrix, and iteratively merge words with
similar co-occurrences until the difference between all pairs of words is above a
specified threshold. This leads to a reduced codebook size and similar actions are
likely to generate more similar distributions of code words. Correlations between
descriptors can also be obtained by tracking features. Sun et al. [12] calculate SIFT(尺
度不變特徵轉換) descriptors around interest points in each frame and use Markov
chaining to determine tracks of these features.
2.2.3 Application-specific representations
This section discuss the works which use representations that are directly
motivated by the domain of human action recognition.
 Smith et al. [13] use a number of specifically selected features.
 low-level : deal with color and movement.
 higher-level : obtained from head and hand regions.
 A boosting scheme: account the history of the action performance.
 Vitaladevuni et al. [14] is inspired by the observation that human actions differ
in accelerating and decelerating force.
 Identify: reach, yank and throw types.
 Temporal segmentation into atomic movements described with
movement type,
 spatial location and direction of movement is performed first.
3. Action classification
When an image representation is available for an observed frame or sequence,
human action recognition becomes a classification problem. An action label or
distribution over labels is given for each frame or sequence.
 Direct classification
 Temporal state-space models
 Action detection
3.1 Direct classification
Not pay special attention to the temporal domain. Summarize all frames of an
observed sequence into a single representation or perform action recognition for
each frame individually.
 Dimensionality reduction
 降維方法即是透過分析資料來尋找一種 Embedding 的方式,將資料從原
先的高維空間映射到低維空間。
 降低運算複雜度

取得更有本質意義的資料表示方式
容易將高維資料視覺化(Visualization)
 Nearest neighbor classification
 欲判斷某未知資料的類別時,僅須找出距離它最近的已知類別資料再透
過已知資料類別即可決定該未知資料的類別。
 優點: 簡單、有一定的精度
 缺點: 計算時間以及對記憶體空間需求會隨著原型資料點個數或特徵變
數增加而增加。
 Discriminative classifiers
 主要將資料分類成兩種或更多的類別,而非將它們 model 化。

最後會型成一個很大的分類結果,但每個類別都很小。
3.2 Temporal state-space models
State-space models consist of states connected by edges. These edges model
probabilities between states, and between states and observations.
 Model:
 State: action performance (1state, 1 action performance)
 Observation: image representation at a given time.
 Dynamic time warping(DTW)
 DTW 是計算輸入的音高向量和資料庫中標準答案的音高向量之前的歐
幾里得距離。
 需時較長、但有較高的辨識率。
 Generative models
 Hidden Markov models (HMM):
 以統計的方式來建立每個類別的(動態)機率模型
 此種模型特別適用於長度不固定的輸入向量
 不知道有多少個 states,states 的多寡需要由經驗來假設。
 三個組成要素
 Observation probabilities:
是我們觀察到的某個東西是從某一個 hidden state 來的機率。
 Transition probabilities:
是 hidden states 之間轉換的機率。
 initial probabilities:
probabilities 是一開始的時候,落在某一個 hidden state 的機率。
Generative models Applications: Feng and Perona [15] use a static HMM where
key poses correspond to states. Weinland et al. [16] construct a codebook by
discriminatively selecting templates. In the HMM, they condition the observation on
the viewpoint. Lv and Nevatia [17] uses an Action Net, which is constructed by
considering key poses and viewpoints. Transitions between views and poses are
encoded explicitly. Ahmad and Lee [18] take into account multiple viewpoints and
use a multi-dimensional HMM to deal with the different observations.
Instead of modeling the human body as a single observation, one HMM can be
used for each every body-part. This makes training easier. Because the combinatorial
complexity is reduced to learning dynamical models for each limb individually and
the composite movements that are not in the training set can be recognized.
 Discriminative models:將一個訓練集(training set)輸出的質最大化。
HMMs assume that observations in time are independent, which is often not
the case. Discriminative models overcome this issue by modeling a conditional
distribution over action labels given the observations. Discriminative models are
suitable for classification of related actions. Discriminative graphical models require
many training sequences to robustly determine all parameters. Conditional random
fields (CRF) are discriminative models that can use multiple overlapping features.
Variants of CRFs have also been proposed. Shi et al. [19] use a semi-Markov model
(SMM), which is suitable for both action segmentation and action recognition.
3.3 Action detection
Some works assume motion periodicity, which allows for temporal
segmentation by analyzing the self-similarity matrix. Seitz and Dyer [20] introduce a
periodicity detection algorithm that is able to cope with small variations in the
temporal extent of a motion. Cutler and Davis [21] perform a frequency transform
on the self similarity matrix of a tracked object. Peaks in the spectrum correspond to
the frequency of the motion. The type of action is determined by analyzing the
matrix structure. Polana and Nelson [22] use Fourier transforms to find the
periodicity and temporally segment the video. They match motion features to
labeled 2D motion templates
4. Discussion
4.1. Image representation
 Global image representations
 優點:
 Good results
 They can usually be extracted with low cost.
 缺點:
 Limited to scenarios where ROIs can be determined reliably.
 Cannot deal with occlusions.
 Local representations
 Takes into account spatial and temporal correlations between patches.
 Occlusions has largely been ignored.
4.2 About viewpoints
 Most of the reported work is restricted to fixed.
 Multiple view-dependent action models solves this issue.
 BUT-> Increased training complexity

4,3 About Classification
 Temporal variations are not explicitly modeled, which proved to be a
reasonable approach in many cases.
 But-> For more complex motions, it is questionable whether this approach
is suitable.
 Generative state-space models such as HMMs can model temporal variations.
 But-> Have difficulties distinguishing between related actions.
 Discriminative graphical approaches are more suitable.
4.4 About action detection
 Many approaches assume that…
 The video is readily segmented into sequences
 It contain one instance of a known set of action labels.
 The location and approximate scale of the person in the video is known or
can easily be estimated.
 Thus-> The action detection task is ignored, which limits the applicability
to situations where segmentation in space and time is possible.
 It remains a challenge to perform action detection for online applications.
The HOHA dataset [23] targets action recognition in movies, whereas the UFC
sport dataset [24] contains sport footage. The use of application-specific datasets
allows for the use of evaluation metrics that go beyond precision and recall. Such as :
speed of processing or detection accuracy. The compilation or recording of datasets
that contain sufficient variation in movements, recording settings and environmental
settings remains challenging and should continue to be a topic of discussion.
4.5 The problem of labeling data
 For increasingly large and complex datasets, manual labeling will become
prohibitive.
 Multi-modal approach could improve recognition in some domains
 For example in movie analysis. Also, context such as background, camera
motion, interaction between persons and person identity provides
informative cues [25].
This would be a big step towards the fulfillment of the longstanding promise to
achieve robust automatic recognition and interpretation of human action
5. References
[1] Thomas B. Moeslund, Adrian Hilton, Volker Kruger, A survey of advances in
vision-based human motion capture and analysis, Computer Vision and Image
Understanding (CVIU) 104 (2–3) (2006) 90–126.
[2] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, Ronen Basri, Actions as
space–time shapes, in: Proceedings of the International Conference On Computer
Vision (ICCV’05), vol. 2, Beijing, China, October 2005, pp. 1395–1402.
[3] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, Ronen Basri, Actions as
space–time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence
(PAMI) 29 (12) (2007) 2247–2253.
[4] Ivan Laptev, Tony Lindeberg, Space–time interest points, in: Proceedings of the
International Conference on Computer Vision (ICCV’03), vol. 1, Nice, France, October
2003, pp. 432–439.
[5] Chris Harris, Mike Stephens, A combined corner and edge detector, in:
Proceedings of the Alvey Vision Conference, Manchester, United Kingdom, August
1988, pp. 147–151.
[6] Ivan Laptev, Barbara Caputo, Christian Schuldt, Tony Lindeberg, Local
velocity-adapted motion events for spatio-temporal recognition, Computer Vision
and Image Understanding (CVIU) 108 (3) (2007) 207–229.
[7] Piotr Dollar, Vincent Rabaud, Garrison Cottrell, Serge Belongie, Behavior
recognition via sparse spatio-temporal features, in: Proceedings of the International
Workshop on Visual Surveillance and Performance Evaluation of Tracking and
Surveillance (VS-PETS’05), Beijing, China, October 2005, pp. 65–72.
[8] Shu-Fai Wong, Roberto Cipolla, Extracting spatiotemporal interest points using
global information, in: Proceedings of the International Conference On Computer
Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8.
[9] Juan Carlos Niebles, Hongcheng Wang, Li Fei-Fei, Unsupervised learning of human
action categories using spatial–temporal words, International Journal of Computer
Vision (IJCV) 79 (3) (2008) 299–318.
[10] Christian Schuldt, Ivan Laptev, Barbara Caputo, Recognizing human actions: a
local SVM approach, Proceedings of the International Conference on Pattern
Recognition (ICPR’04), 2004, vol. 3, Cambridge, United Kingdom, 2004, pp.32–36.
[11] Paul Scovanner, Saad Ali, Mubarak Shah, A 3-dimensional SIFT descriptor and its
application to action recognition, in: Proceedings of the International Conference on
Multimedia (MultiMedia’07), Augsburg, Germany, September 2007, pp. 357–360.
[12] Ju Sun, Xiao Wu, Shuicheng Yan, Loong-Fah Cheong, Tat-Seng Chua, Jintao Li,
Hierarchical spatio-temporal context modeling for action recognition, in: Proceedings
of the Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL,
June 2009, pp. 1–8.
[13] Paul Smith, Niels da Vitoria Lobo, Mubarak Shah, TemporalBoost for event
recognition, in: Proceedings of the International Conference On Computer Vision
(ICCV’05), vol. 1, Beijing, China, October 2005, pp. 733–740.
[14] Shiv N. Vitaladevuni, Vili Kellokumpu, Larry S. Davis, Action recognition using
ballistic dynamics, in: Proceedings of the Conference on Computer Vision and Pattern
Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8.
[15] Xiaolin Feng, Pietro Perona, Human action recognition by sequence of movelet
codewords, in: Proceedings of the International Symposium on 3D Data Processing,
Visualization, and Transmission 3DPVT’02), Padova, Italy, June 2002, pp. 717–721.
[16] Daniel Weinland, Edmond Boyer, Remi Ronfard, Action recognition from
arbitrary views using 3D exemplars, in: Proceedings of the International Conference
On Computer Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8.
[17] Fengjun Lv, Ram Nevatia, Single view human action recognition using key pose
matching and Viterbi path searching, in: roceedings of the Conference on Computer
Vision and Pattern Recognition (CVPR’07), Minneapolis, MN, June 2007, pp. 1–8.
[18] Mohiuddin Ahmad, Seong-Whan Lee, Human action recognition using shape and
CLG-motion flow from multi-view image sequences, Pattern Recognition 41 (7) (2008)
2237–2252.
[19] Qinfeng Shi, Li Wang, Li Cheng, Alex Smola, Discriminative human action
segmentation and recognition using semi-Markov model, in: Proceedings of the
Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK,
June 2008, pp. 1–8.
[20] Steven M. Seitz, Charles R. Dyer, View-invariant analysis of cyclic motion,
International Journal of Computer Vision (IJCV) 25 (3) (1997) 231–251.
6.心得
這篇 paper 是在歸納動作辨識現在已有的方法並做討論。雖然討論了多種方
法,但大多都只是粗略提到,並無做方法講解,與我原本想像有些落差,我想還
必需再花時間從 references 中找到參考的原 paper,才能融會貫通。其中提到許多
方法,老師課堂上也有教過。感謝老師在這門課的細心指導,讓學生受惠良多。
Download