CVPR 2009 Quick Review: Action Recognition 讲解人:李 哲 中科院计算所JDL 2009年9月18日 2016/3/15 1 # Paper Title Recognizing Realistic Actions from Videos “in the Wild” Authors Jingen Liu, Jiebo Luo, Mubarak Shah Paper ID : 0598 2016/3/15 2 # 提纲 作者介绍 摘要 相关背景 篇章结构 问题的提出 算法介绍 实验结果 结论 2016/3/15 3 # 第一作者 Jingen Liu PhD student School of Electrical Engineering and Computer Science University of Central Florida, Orlando, FL, USA Research Interests scene understanding and recognition, action recognition, video content analysis and retrieval, object recognition, and crowd tracking. Papers 09: CVPR(2), ICCV(2), ICASSP(1) 08: ICPR(1), CVPR(2), TRECVID(1) Background Ph.D. (now) : University of Central Florida, Orlando, FL, USA; B.S. ,M.S. degree: Huazhong University of Science and Technology, Wuhan, China. Homepage 2016/3/15 http://www.cs.ucf.edu/~liujg/ 4 # 第二作者 Jiebo Luo Senior Principal Scientist Kodak Research Laboratories in Rochester, NY. Research Interests image processing, pattern recognition, computer vision, computational photography, medical imaging, and multimedia communication. Academic Contributions Fellow, IEEE (2009) 120+ papers,40+ granted U.S. patents Background Senior Principal Scientist(1999-present), Principal Research Scientist (19961999), Senior Research Scientist(1995-1996): Kodak Research Laboratories; Ph.D. (1995) degree: Electrical Engineering, University of Rochester in 1995; B.S. (1989), M.S. (1992) degree: Electrical Engineering, University of Science and Technology of China. Homepage 2016/3/15 http://sites.google.com/site/jieboluo/Home 5 # 第三作者 Mubarak Shah Agere Chair Professor School of Electrical Engineering & Computer Science University of Central Florida, Orlando, FL Research Interests image processing, pattern recognition, computer vision, computational photography, medical imaging, and multimedia communication. Academic Contributions Fellow, IEEE (2003) Books(2), Book Chapters(10), Journal paper (60), Conference papers(130) …before 2006 Background M.S. (1982) & Ph.D. (1986) degree: Wayne State University Detroit, Michigan (Major: Computer Engineering, Minor Area: Mathematics); E.D.E. (1980): A post graduate diploma, Philips International Institute of Technological Studies, Eindhoven, The Netherlands.(Major: Speech Recognition); B.S. (1979) degree: National College of Engineering & Technology, Karachi, Pakistan(major: Electronics). Homepage 2016/3/15 http://server.cs.ucf.edu/~vision/faculty/shah.html (not available ??) http://unjobs.org/authors/mubarak-shah (CV2006) 6 # Abstract In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild.” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization. 2016/3/15 7 # 摘要 本文针对自然场景视频中实际行为的识别问题提出了一种系统框架。 无约束的视频大量存在于个人收藏及网络中,但针对这类视频的行为 识别问题仍未彻底解决,原因在于由相机移动、背景混乱、物体外观 及尺度变化等带来的大量不确定性。如何从这些无约束的视频中提取 可靠和充实的特征,是面临的主要挑战。 我们从视频中提取了运动和静态两种特征。由于这两种原始特征均密 集且富含噪声,因此我们提了出修整这些特征的策略。 我们使用motion statistics获取可靠的运动特征以及干净的静态特征; 使用PageRank用于挖掘最informative的静态特征;采用一种divisive information-theoretic算法将语义相关的特征分组用于构造紧凑且可区 分的visual vocabulary;最后,使用AdaBoost将所有混杂但互补的特征 整合到一起。 在通用的KTH及我们自己搭建的数据库(包含来自于YouTube及个人视 频的11种行为)上完成行为定位和识别的测试,均得到了impressive results。 2016/3/15 8 # Background(1/2)——Interest Point Detection Blob detection Aimed at detecting points and/or regions in the image that are either brighter or darker than the surrounding. Corner detection Aimed at detecting “corner” points in the image. A “corner” can be defined as the intersection of two edges, or a point for which there are two dominant and different edge directions in a local neighborhood of the point. Fig. Input & Output of a typical Corner Detection alogrithm 2016/3/15 9 # Background(2/2)——Bag of Video-Words 2016/3/15 From ?? 10 # Problem Recognizing Realistic Action from Video “in the Wild” Realistic Action Video “in the wild” Basket shooting swing biking Tennis shooting YouTube dataset 2016/3/15 VS Challenges in Realistic videos Large variation in • Camera motion • Cluttered background • Viewpoint • Object scale •Illumination condition •Object appearance and pose Template based action Video in Lab environment boxing clapping jogging waving KTH dataset 11 # Framework Input Videos Motion & Static Features Extraction Motion & Static Features Pruning Contributions Motion & Static Vocabularies Learning Histogram-based Video Representation Boosted Learning & Localization 2016/3/15 12 # Motivation(1/3)——Static Features Why Static Features? In Realistic video, motion features are unreliable due to unpredictable and often unintended camera motion (camera shake). Correlated objects are helpful to action recognition. “Ball” in “Soccer Juggling”, “Horse” in “Horseback Ridding”, etc. Static features are complementary to motion features. How to get Static Features? Interest point detectors: corner features & blob features 2016/3/15 13 # Motivation(2/3)——Feature Pruning Why Feature Pruning? Motion feature pruning: discard the motion features caused by camera moving or shaking. Static feature pruning: select the significant static features. How to prune features? Motion feature Pruning: use feature statistics and the distribution of spatial locations. Static feature pruning: PageRank. 2016/3/15 14 # Motivation(3/3)——Vocabulary Learning Why vocabulary learning? Obtain compact yet discriminative visual vocabularies for motion and static features. Large visual vocabulary performs better, but over-specific visual words may eventually over-fit the data. The combination of two features may be more useful than when used individually. How to learn vocabularies? Information-theoretic measure to refine the initial vocabularies by feature grouping. 2016/3/15 15 # Algorithm——Motion Feature Detection Video Input Filters 2-D Gaussian filter in space 1-D Gabor filter in time Point: local maximal response Area: 3D cuboids around the points Spatiotemporal interest point detector [P. Dollar et al., VS-PETS 2005 ] Flat gradient victor of the Areas PCA reduce the dimensions 2016/3/15 Motion features Output 16 # Algorithm——Motion Feature Pruning 相机抖动仅会影响到几帧中motion feature的检测,因而可直接抛弃被 影响帧 Rules Rule1: 某帧feature过多,直接删除该帧 (remove abrupt camera motion) Rule2: 筛选,按比例保留距离该帧中所有特征平均位置较近的特征 (select good features) About 8% improvement in average accuracy 2016/3/15 17 # Algorithm——Static Feature Detection Interest point detectors Harris-Laplacian (HAR) detector Hessian-Laplacian (HES) detector MSER detector Corner feature Blob feature Pruning using context information Detecting regions of interest by motion statistics Using PageRank to preserve consistent features 2016/3/15 18 # Algorithm——Static Feature Pruning(1/2) Motivation: Foreground features have motion consistent matches troughout the entire video sequence. Background features are, however, unstable due to entire video sequence. Using PageRank to preserve consistent (significant ) features Construct a Feature Network G=(V, E) for a given video: W(n×n) V : set of vertex (static features ——image patches) E : set of weighted edges (feature similarity)[24] Rank the features based on their persistence (importance) Pr PrW ( Pr b 1 ) v :scaling factor(0.85 in experiment) b : indicator vector indentifying the verices with zero out-degree W : weights matrix v : n×1 transport vector with uniform probability distribution over the vertices. The Initial PR value for each vertex is 1/n 2016/3/15 19 # PageRank PageRank is a variant of Eigenvector Centrality, which measures the importance of a node in a given network. Ranking vertex by their relative importance A vertex neighbor to an important vertex should rank higher Fig. PageRank from Wiki 2016/3/15 20 # Algorithm——Static Feature Pruning(2/2) Fig. Two examples from riding (top) and cycling (bottom) demonstrate the effects of feature acquisition. The first row shows the selected features. The top 10% features in PR values are retrieved. 2016/3/15 21 # Algorithm——Learning Semantic Vocabulary Information-theoretic divisive algorithm Input: X initial visual words, and distribution p(C | X ) ; Output: X̂ visual word clusters p (C | xˆ ) Initiate X̂ randomly assign the cluster members This is similar to k-means Two major steps For each cluster ( xˆi ) xt xˆi t xˆi ,compute the prior and “centers”. p(C | xˆi ) Update clusters : for each xt t ( xˆ ) p(C | x ) xt xˆi t i ,find the new cluster: i* ( xt ) arg min j KL( p(C | xt ), p(C | xˆ j )) 2016/3/15 22 # Experiments——KTH dataset static features 82.3% motion features 87.1% Hybrid features 91.8% Static feature: shape information 2016/3/15 23 # YouTube dataset b_shooting g_walking cycling t_swing t_jumping s_juggling t_swinging v_spiking 11 categories About 1600 videos diving 2016/3/15 swinging r_riding 24 # Experiments——YouTube dataset(1/3) Figure A:Performance comparison between system with motion feature pruning and without feature pruning Average Accuracy Before pruning : 57% After pruning : 65.4% 2016/3/15 Figure B:Performance comparison between system with static feature pruning and without feature pruning Average Accuracy Before pruning : 58.1% After pruning : 63.0% 25 # Experiments——YouTube dataset(2/3) Fig. Comparison of classification performance for using motion, static and hybrid features. 2016/3/15 Average accuracy: Motion: 65.4%; Static: 63.1%; Hybrid:71.2% 26 # Experiments——YouTube dataset(3/3) Fig. The confusion table for classification using hybrid features. 2016/3/15 27 # Conclusions Interest This paper present a systematic framework for recognizing realistic actions from videos “in the wild”. Static features are complementary to motion features. Using Motion cues to prune motion and static features is helpful. Information-theoretic based divisive clustering reconstruct compact yet discriminative semantic visual vocabularies. 2016/3/15 28 Thank you! 2016/3/15 29