[Human-Computer Interaction : From Theory to Applications] Final Report (Paper Study) A survey on vision-based human action recognition Image and Vision Computing 28 (2010) 976–990 Student : Husan-Pei Wang(王瑄珮) Student ID : P96994020 Instructor : Jenn-Jier Lien Ph.D. Outline 1. Introduction 1.1 Challenges and characteristics of the domain 1.2 Common dataset 2. Image Representation 2.1 Global Representations 2.1.1 Space–time volumes 2.2 Local Representations 2.2.1 Local descriptors 2.2.2 Correlations between local descriptors 2.3 Application-specific representations 3. Action classification 3.1 Direct classification 3.2 Temporal state-space models 4. Discussion 5. References 1. Introduction(1/2) This paper considers the task of labeling videos containing human motion with action classes. Challenging: Variations in motion performance Recording settings Inter-personal differences. This paper provides a detailed overview of current advances in the field to solve the challenge. 1. Introduction(2/2) The recognition of movement can be performed at various levels of abstraction. This paper adopt the hierarchy used by Moeslund et al. [1]: Activity Action Action primitive Action primitive: 單一的動作 例如: 左腳向前 Action: 持續的單一動作 例如: 跑 Activity: 活動,由各種action所組成 例如: 跨欄競賽由跑、跳所組成 1.1 Challenges and characteristics of the domain Intra- and inter-class variations • A good human action recognition approach should be able to generalize over variations within one class and distinguish between actions of different classes. Environment and recording settings • The environment in which the action performance takes place is an important source of variation in the recording. • The same action, observed from different viewpoints, can lead to very different image observations. Temporal variations • Actions are assumed to be readily segmented in time. • The rate at which the action is recorded has an important effect on the temporal extent of an action. Obtaining and labeling training data • Use publicly available datasets hat for training, it provides a sound mechanism for comparison. • When no labels are available, an unsupervised approach needs to be pursued but there is no guarantee that the discovered classes are semantically meaningful. 1.2 Common dataset Widely used sets: KTH human motion dataset Weizmann human action dataset INRIA XMAS multi-view dataset UCF sports action dataset Hollywood human action dataset Sequences 6 of sport 8 Actions: 14 10 motions: 150 25 Limit Actors: No 11 10 considerable variation: Scenarios: Huge variety: Viewpoint: 54 Background: human appearance、 actions staticperformance camera movement、 Background: occlusions, Camera Views: fixed viewpoint、 relatively static camera movements Include Foreground illumination、 dynamic Background: silhouettes background.. backgrounds . static illumination: static Include Silhouettes and volumetric voxel 2. Image Representation This section will discuss the features that are extracted from the image sequences. This paper divide image representations into two categories: Global representations:Obtained in a top-down fashion 透過背景相減和追蹤, 此區塊(Region of interest) 先抓出人的位置 會整個被編碼(encode) 產生影像符號 (image descriptor) Local representations : Proceeds in a bottom-up fashion 先針對時間-空間域做 interest points 周圍的 interest points 偵測 patches就會被計算 這些patches結合成 final representation 2.1 Global Representations Global representations encode the region of interest (ROI) of a person as a whole. The ROI is usually obtained through background subtraction or tracking. They are sensitive to noise, partial occlusions and variations in viewpoint. To partly overcome above issues: Grid-based approaches spatially divide the observation into cells, each of which encodes part of the observation locally. 2.1.1 Space–time volumes A 3D spatio-temporal volume (STV) is formed by stacking frames over a given sequence. Require:Accurate localization, alignment and possibly background subtraction. Blank et al. [2,3] first stack silhouettes over a given sequence to form an STV (see following picture). 2.2 Local Representations(1/2) Local representations describe the observation as a collection of local descriptors or patches. somewhat invariant to changes in viewpoint, person appearance and partial occlusions. Space–time interest points are the locations in space and time where sudden changes of movement occur in the video. Laptev and Lindeberg [4] extended the Harris corner detector [5] to 3D. Space–time interest points are those points where the local neighborhood has a significant variation in both the spatial and the temporal domain. The work is extended to compensate for relative camera motions in [6]. Drawback: the relatively small number of stable interest points. 2.2 Local Representations(2/2) Improve: Dollár et al. [7] apply Gabor filtering on the spatial and temporal dimensions individually. The number of interest points is adjusted by changing the spatial and temporal size of the neighborhood in which local minima are selected. Instead of detecting interest points over the entire volume: Wong and Cipolla [8] first detect subspaces of correlated movement. These subspaces correspond to large movements such as an arm wave. Within these spaces, a sparse set of interest points is detected. 2.2.1 Local descriptors(1/2) Local descriptors summarize an image or video patch in a representation that is ideally invariant to background clutter, appearance and occlusions, and possibly to rotation and scale. The spatial and temporal size of a patch is usually determined by the scale of the interest point. Extraction of space–time cuboids at interest points from similar actions performed by different persons[6] 2.2.1 Local descriptors(2/2) Challenge: Different number and the usually high dimensionality of the descriptors. It’s hard to compare sets of local descriptors. Overcome: A codebook is generated by clustering patches Selecting either cluster enters or the closest patches as code words. A local descriptor is described as a codeword contribution. A frame or sequence can be represented as a bag-of-words, a histogram of codeword frequencies. 2.2.2 Correlations between local descriptors In this section, it will describe approaches that exploit correlations between local descriptors for selection or the construction of higher-level descriptors. Scovanner et al. [11] construct a word co-occurrence matrix, and iteratively merge words with similar cooccurrences until the difference between all pairs of words is above a specified threshold. This leads to a reduced codebook size and similar actions are likely to generate more similar distributions of code words. Correlations between descriptors can also be obtained by tracking features. Sun et al. [12] calculate SIFT(尺度不變特徵轉換) descriptors around interest points in each frame and use Markov chaining to determine tracks of these features. 2.3 Application-specific representations This section discuss the works which use representations that are directly motivated by the domain of human action recognition. Smith et al. [13] use a number of specifically selected features. low-level : deal with color and movement. higher-level : obtained from head and hand regions. A boosting scheme: account the history of the action performance. Vitaladevuni et al. [14] is inspired by the observation that human actions differ in accelerating and decelerating force. Identify: reach, yank and throw types. Temporal segmentation into atomic movements described with movement type, spatial location and direction of movement is performed first. 3. Action classification When an image representation is available for an observed frame or sequence, human action recognition becomes a classification problem. Direct classification Temporal state-space models Action detection 3.1 Direct classification(1/2) Not pay special attention to the temporal domain. Summarize all frames of an observed sequence into a single representation or perform action recognition for each frame individually. Dimensionality reduction : 降維方法即是透過分析資料來尋找一種Embedding的方式,將 資料從原先的高維空間映射到低維空間。 降低運算複雜度 取得更有本質意義的資料表示方式 容易將高維資料視覺化(Visualization)。 3.1 Direct classification(2/2) Nearest neighbor classification 欲判斷某未知資料的類別時,僅須找出距離它最近的已知類別資 料再透過已知資料類別即可決定該未知資料的類別。 優點: 簡單、有一定的精度 缺點: 計算時間以及對記憶體空間需求會隨著原型資料點個數 或特徵變數增加而增加。 Discriminative classifiers 主要將資料分類成兩種或更多的類別,而非將它們model化。 最後會型成一個很大的分類結果,但每個類別都很小。 3.2 Temporal state-space models(1/6) State-space models consist of states connected by edges. These edges model probabilities between states, and between states and observations. Model: State: action performance (1state, 1 action performance) Observation: image representation at a given time. Dynamic time warping(DTW) DTW是計算輸入的音高向量和資料庫中標準答案的音高向量之前 的歐幾里得距離。 需時較長、但有較高的辨識率。 3.2 Temporal state-space models(2/6) Generative models Hidden Markov models (HMM): 以統計的方式來建立每個類別的(動態)機率模型 此種模型特別適用於長度不固定的輸入向量 不知道有多少個 states,states 的多寡需要由經驗來假設。 三個組成要素 observation probabilities :是我們觀察到的某個東西是從某一個 hidden state 來的機率。 transition probabilities :是 hidden states 之間轉換的機率。 initial probabilities。 :probabilities 是一開始的時候,落在某一個 hidden state 的機 率。 3.2 Temporal state-space models(3/6) Generative models Applications Feng and Perona [15] use a static HMM where key poses correspond to states. Weinland et al. [16] construct a codebook by discriminatively selecting templates. In the HMM, they condition the observation on the viewpoint. Lv and Nevatia [17] uses an Action Net, which is constructed by considering key poses and viewpoints. Transitions between views and poses are encoded explicitly. Ahmad and Lee [18] take into account multiple viewpoints and use a multi-dimensional HMM to deal with the different observations. 3.2 Temporal state-space models(4/6) Generative models Instead of modeling the human body as a single observation, one HMM can be used for each every body-part. This makes training easier. Because.. The combinatorial complexity is reduced to learning dynamical models for each limb individually. Composite movements that are not in the training set can be recognized. 3.2 Temporal state-space models(5/6) Discriminative models 將一個訓練集(training set)輸出的質最大化。 HMMs assume that observations in time are independent, which is often not the case. Discriminative models overcome this issue by modeling a conditional distribution over action labels given the observations. Discriminative models are suitable for classification of related actions. Discriminative graphical models require many training sequences to robustly determine all parameters. 3.2 Temporal state-space models(6/6) Discriminative models Conditional random fields (CRF) are discriminative models that can use multiple overlapping features. CRF同時擁有有限狀態HMM和SVM技術的優點,像是相依特徵和 透過完整順序來做為優先考量。 Variants of CRFs have also been proposed. Shi et al. [19] use a semi-Markov model (SMM), which is suitable for both action segmentation and action recognition. 3.3. Action detection Some works assume motion periodicity, which allows for temporal segmentation by analyzing the self-similarity matrix. Seitz and Dyer [20] introduce a periodicity detection algorithm that is able to cope with small variations in the temporal extent of a motion. Cutler and Davis [21] perform a frequency transform on the self similarity matrix of a tracked object. Peaks in the spectrum correspond to the frequency of the motion. The type of action is determined by analyzing the matrix structure. Polana and Nelson [22] use Fourier transforms to find the periodicity and temporally segment the video. They match motion features to labeled 2D motion templates 4. Discussion(1/5) Image representation Global image representations 優點: Good results They can usually be extracted with low cost. 缺點: Limited to scenarios where ROIs can be determined reliably. Cannot deal with occlusions. Local representations Takes into account spatial and temporal correlations between patches. occlusions has largely been ignored. 4. Discussion(2/5) About viewpoints Most of the reported work is restricted to fixed. Multiple view-dependent action models solves this issue. BUT-> Increased training complexity About Classification Temporal variations are not explicitly modeled, which proved to be a reasonable approach in many cases. But-> For more complex motions, it is questionable whether this approach is suitable. Generative state-space models such as HMMs can model temporal variations. But-> Have difficulties distinguishing between related actions. Discriminative graphical approaches are more suitable. 4. Discussion(3/5) About action detection Many approaches assume that… The video is readily segmented into sequences It contain one instance of a known set of action labels. The location and approximate scale of the person in the video is known or can easily be estimated. Thus-> The action detection task is ignored, which limits the applicability to situations where segmentation in space and time is possible. It remains a challenge to perform action detection for online applications. 4. Discussion(4/5) The HOHA dataset [23] targets action recognition in movies, whereas the UFC sport dataset [24] contains sport footage. The use of application-specific datasets allows for the use of evaluation metrics that go beyond precision and recall. Such as : speed of processing or detection accuracy. The compilation or recording of datasets that contain sufficient variation in movements, recording settings and environmental settings remains challenging and should continue to be a topic of discussion. 4. Discussion(5/5) The problem of labeling data For increasingly large and complex datasets, manual labeling will become prohibitive. Multi-modal approach could improve recognition in some domains For example in movie analysis. Also, context such as background, camera motion, interaction between persons and person identity provides informative cues [25]. This would be a big step towards the fulfillment of the longstanding promise to achieve robust automatic recognition and interpretation of human action. 5. References(1/4) [1] Thomas B. Moeslund, Adrian Hilton, Volker Kruger, A survey of advances in visionbased human motion capture and analysis, Computer Vision and Image Understanding (CVIU) 104 (2–3) (2006) 90–126. [2] Moshe Blank, Lena Gorelick, Eli Shechtman, Michal Irani, Ronen Basri, Actions as space–time shapes, in: Proceedings of the International Conference On Computer Vision (ICCV’05), vol. 2, Beijing, China, October 2005, pp. 1395–1402. [3] Lena Gorelick, Moshe Blank, Eli Shechtman, Michal Irani, Ronen Basri, Actions as space–time shapes, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 29 (12) (2007) 2247–2253. [4] Ivan Laptev, Tony Lindeberg, Space–time interest points, in: Proceedings of the International Conference on Computer Vision (ICCV’03), vol. 1, Nice, France, October 2003, pp. 432–439. [5] Chris Harris, Mike Stephens, A combined corner and edge detector, in: Proceedings of the Alvey Vision Conference, Manchester, United Kingdom, August 1988, pp. 147–151. [6] Ivan Laptev, Barbara Caputo, Christian Schuldt, Tony Lindeberg, Local velocity-adapted motion events for spatio-temporal recognition, Computer Vision and Image Understanding (CVIU) 108 (3) (2007) 207–229. [7] Piotr Dollar, Vincent Rabaud, Garrison Cottrell, Serge Belongie, Behavior recognition via sparse spatio-temporal features, in: Proceedings of the International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VSPETS’05), Beijing, China, October 2005, pp. 65–72. [8] Shu-Fai Wong, Roberto Cipolla, Extracting spatiotemporal interest points using global information, in: Proceedings of the International Conference On Computer Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8. 5. References(2/4) [9] Juan Carlos Niebles, Hongcheng Wang, Li Fei-Fei, Unsupervised learning of human action categories using spatial–temporal words, International Journal of Computer Vision (IJCV) 79 (3) (2008) 299–318. [10] Christian Schuldt, Ivan Laptev, Barbara Caputo, Recognizing human actions: a local SVM approach, Proceedings of the International Conference on Pattern Recognition (ICPR’04), 2004, vol. 3, Cambridge, United Kingdom, 2004, pp.32–36. [11] Paul Scovanner, Saad Ali, Mubarak Shah, A 3-dimensional SIFT descriptor and its application to action recognition, in: Proceedings of the International Conference on Multimedia (MultiMedia’07), Augsburg, Germany, September 2007, pp. 357–360. [12] Ju Sun, Xiao Wu, Shuicheng Yan, Loong-Fah Cheong, Tat-Seng Chua, Jintao Li, Hierarchical spatio-temporal context modeling for action recognition, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, June 2009, pp. 1–8. [13] Paul Smith, Niels da Vitoria Lobo, Mubarak Shah, TemporalBoost for event recognition, in: Proceedings of the International Conference On Computer Vision (ICCV’05), vol. 1, Beijing, China, October 2005, pp. 733–740. [14] Shiv N. Vitaladevuni, Vili Kellokumpu, Larry S. Davis, Action recognition using ballistic dynamics, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. 5. References(3/4) [15] Xiaolin Feng, Pietro Perona, Human action recognition by sequence of movelet codewords, in: Proceedings of the International Symposium on 3D Data Processing, Visualization, and Transmission 3DPVT’02), Padova, Italy, June 2002, pp. 717–721. [16] Daniel Weinland, Edmond Boyer, Remi Ronfard, Action recognition from arbitrary views using 3D exemplars, in: Proceedings of the International Conference On Computer Vision (ICCV’07), Rio de Janeiro, Brazil, October 2007, pp. 1–8. [17] Fengjun Lv, Ram Nevatia, Single view human action recognition using key pose matching and Viterbi path searching, in: roceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’07), Minneapolis, MN, June 2007, pp. 1–8. [18] Mohiuddin Ahmad, Seong-Whan Lee, Human action recognition using shape and CLG-motion flow from multi-view image sequences, Pattern Recognition 41 (7) (2008) 2237–2252. [19] Qinfeng Shi, Li Wang, Li Cheng, Alex Smola, Discriminative human action segmentation and recognition using semi-Markov model, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. [20] Steven M. Seitz, Charles R. Dyer, View-invariant analysis of cyclic motion, International Journal of Computer Vision (IJCV) 25 (3) (1997) 231–251. 5. References(4/4) [21] Ross Cutler, Larry S. Davis, Robust real-time periodic motion detection, analysis, and applications, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22 (8) (2000) 781–796. [22] Ramprasad Polana, Randal C. Nelson, Detection and recognition of periodic, nonrigid motion, International Journal of Computer Vision (IJCV) 23 (3) (1997) 261– 282. [23] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, enjamin Rozenfeld, Learning realistic human actions from movies, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. [24] Mikel D. Rodriguez, Javed Ahmed, Mubarak Shah, Action MACH: a spatiotemporal maximum average correlation height filter for action recognition, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’08), Anchorage, AK, June 2008, pp. 1–8. [25] Marcin Marszałek, Ivan Laptev, Cordelia Schmid, Actions in context, in: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09), Miami, FL, June 2009, pp. 1–8.