PPT

advertisement
CVPR 2009 Quick Review:
Action Recognition
讲解人:李 哲
中科院计算所JDL
2009年9月18日
2016/3/15
1
# Paper
Title
Recognizing Realistic Actions from Videos “in the Wild”
Authors
Jingen Liu, Jiebo Luo, Mubarak Shah
Paper ID : 0598
2016/3/15
2
# 提纲
作者介绍
摘要
相关背景
篇章结构
问题的提出
算法介绍
实验结果
结论
2016/3/15
3
# 第一作者
Jingen Liu
PhD student
School of Electrical Engineering and Computer Science
University of Central Florida, Orlando, FL, USA
Research Interests
 scene understanding and recognition, action recognition, video content
analysis and retrieval, object recognition, and crowd tracking.
Papers


09: CVPR(2), ICCV(2), ICASSP(1)
08: ICPR(1), CVPR(2), TRECVID(1)
Background

Ph.D. (now) : University of Central Florida, Orlando, FL, USA;

B.S. ,M.S. degree: Huazhong University of Science and Technology, Wuhan, China.
Homepage

2016/3/15
http://www.cs.ucf.edu/~liujg/
4
# 第二作者
Jiebo Luo
Senior Principal Scientist
Kodak Research Laboratories in Rochester, NY.
Research Interests

image processing, pattern recognition, computer vision, computational
photography, medical imaging, and multimedia communication.
Academic Contributions


Fellow, IEEE (2009)
120+ papers,40+ granted U.S. patents
Background



Senior Principal Scientist(1999-present), Principal Research Scientist (19961999), Senior Research Scientist(1995-1996): Kodak Research Laboratories;
Ph.D. (1995) degree: Electrical Engineering, University of Rochester in 1995;
B.S. (1989), M.S. (1992) degree: Electrical Engineering, University of Science
and Technology of China.
Homepage

2016/3/15
http://sites.google.com/site/jieboluo/Home
5
# 第三作者
Mubarak Shah
Agere Chair Professor
School of Electrical Engineering & Computer Science
University of Central Florida, Orlando, FL
Research Interests

image processing, pattern recognition, computer vision, computational photography, medical
imaging, and multimedia communication.
Academic Contributions


Fellow, IEEE (2003)
Books(2), Book Chapters(10), Journal paper (60), Conference papers(130) …before 2006
Background



M.S. (1982) & Ph.D. (1986) degree: Wayne State University Detroit, Michigan (Major:
Computer Engineering, Minor Area: Mathematics);
E.D.E. (1980): A post graduate diploma, Philips International Institute of Technological Studies,
Eindhoven, The Netherlands.(Major: Speech Recognition);
B.S. (1979) degree: National College of Engineering & Technology, Karachi, Pakistan(major:
Electronics).
Homepage


2016/3/15
http://server.cs.ucf.edu/~vision/faculty/shah.html (not available ??)
http://unjobs.org/authors/mubarak-shah (CV2006)
6
# Abstract
In this paper, we present a systematic framework for recognizing realistic actions
from videos “in the wild.”
Such unconstrained videos are abundant in personal collections as well as on the
web. Recognizing action from such videos has not been addressed extensively,
primarily due to the tremendous variations that result from camera motion,
background clutter, changes in object appearance, and scale, etc. The main
challenge is how to extract reliable and informative features from the
unconstrained videos.
We extract both motion and static features from the videos. Since the raw features
of both types are dense yet noisy, we propose strategies to prune these features.
We use motion statistics to acquire stable motion features and clean static
features. Furthermore, PageRank is used to mine the most informative static
features. In order to further construct compact yet discriminative visual
vocabularies, a divisive information-theoretic algorithm is employed to group
semantically related features. Finally, AdaBoost is chosen to integrate all the
heterogeneous yet complementary features for recognition.
We have tested the framework on the KTH dataset and our own dataset consisting
of 11 categories of actions collected from YouTube and personal videos, and have
obtained impressive results for action recognition and action localization.
2016/3/15
7
# 摘要
本文针对自然场景视频中实际行为的识别问题提出了一种系统框架。
无约束的视频大量存在于个人收藏及网络中,但针对这类视频的行为
识别问题仍未彻底解决,原因在于由相机移动、背景混乱、物体外观
及尺度变化等带来的大量不确定性。如何从这些无约束的视频中提取
可靠和充实的特征,是面临的主要挑战。
我们从视频中提取了运动和静态两种特征。由于这两种原始特征均密
集且富含噪声,因此我们提了出修整这些特征的策略。
我们使用motion statistics获取可靠的运动特征以及干净的静态特征;
使用PageRank用于挖掘最informative的静态特征;采用一种divisive
information-theoretic算法将语义相关的特征分组用于构造紧凑且可区
分的visual vocabulary;最后,使用AdaBoost将所有混杂但互补的特征
整合到一起。
在通用的KTH及我们自己搭建的数据库(包含来自于YouTube及个人视
频的11种行为)上完成行为定位和识别的测试,均得到了impressive
results。
2016/3/15
8
# Background(1/2)——Interest Point Detection
Blob detection
Aimed at detecting points and/or regions in the image that are
either brighter or darker than the surrounding.
Corner detection
Aimed at detecting “corner” points in the image. A “corner” can
be defined as the intersection of two edges, or a point for which
there are two dominant and different edge directions in a local
neighborhood of the point.
Fig. Input & Output of a typical Corner Detection alogrithm
2016/3/15
9
# Background(2/2)——Bag of Video-Words
2016/3/15
From ??
10
# Problem
Recognizing Realistic Action from Video “in the Wild”
Realistic Action
Video “in the wild”
Basket shooting
swing
biking
Tennis shooting
YouTube dataset
2016/3/15
VS
Challenges in Realistic
videos
Large variation in
• Camera motion
• Cluttered background
• Viewpoint
• Object scale
•Illumination condition
•Object appearance
and pose
Template based action
Video in Lab environment
boxing
clapping
jogging
waving
KTH dataset
11
# Framework
Input Videos
Motion & Static Features Extraction
Motion & Static Features Pruning
Contributions
Motion & Static Vocabularies Learning
Histogram-based Video Representation
Boosted Learning & Localization
2016/3/15
12
# Motivation(1/3)——Static Features
Why Static Features?
In Realistic video, motion features are unreliable due to
unpredictable and often unintended camera motion (camera
shake).
Correlated objects are helpful to action recognition.
“Ball” in “Soccer Juggling”, “Horse” in “Horseback Ridding”, etc.
Static features are complementary to motion features.
How to get Static Features?
Interest point detectors: corner features & blob features
2016/3/15
13
# Motivation(2/3)——Feature Pruning
Why Feature Pruning?
Motion feature pruning: discard the motion features caused by
camera moving or shaking.
Static feature pruning: select the significant static features.
How to prune features?
Motion feature Pruning: use feature statistics and the distribution
of spatial locations.
Static feature pruning: PageRank.
2016/3/15
14
# Motivation(3/3)——Vocabulary Learning
Why vocabulary learning?
Obtain compact yet discriminative visual vocabularies for motion
and static features.
Large visual vocabulary performs better, but over-specific visual
words may eventually over-fit the data.
The combination of two features may be more useful than when
used individually.
How to learn vocabularies?
Information-theoretic measure to refine the initial vocabularies by
feature grouping.
2016/3/15
15
# Algorithm——Motion Feature Detection
Video
Input
Filters
2-D Gaussian filter in space
1-D Gabor filter in time
Point: local maximal response
Area: 3D cuboids around the points
Spatiotemporal
interest point detector
[P. Dollar et al., VS-PETS 2005 ]
Flat gradient victor of the Areas
PCA reduce the dimensions
2016/3/15
Motion features
Output
16
# Algorithm——Motion Feature Pruning
相机抖动仅会影响到几帧中motion feature的检测,因而可直接抛弃被
影响帧
Rules
Rule1: 某帧feature过多,直接删除该帧 (remove abrupt camera motion)
Rule2: 筛选,按比例保留距离该帧中所有特征平均位置较近的特征
(select good features)
About 8% improvement in average accuracy
2016/3/15
17
# Algorithm——Static Feature Detection
Interest point detectors
Harris-Laplacian (HAR) detector
Hessian-Laplacian (HES) detector
MSER detector
Corner feature
Blob feature
Pruning using context information
Detecting regions of interest by motion statistics
Using PageRank to preserve consistent features
2016/3/15
18
# Algorithm——Static Feature Pruning(1/2)
Motivation:
Foreground features have motion consistent matches troughout the
entire video sequence. Background features are, however, unstable
due to entire video sequence.
Using PageRank to preserve consistent (significant ) features
Construct a Feature Network G=(V, E) for a given video: W(n×n)
V : set of vertex (static features ——image patches)
E : set of weighted edges (feature similarity)[24]
Rank the features based on their persistence (importance)

Pr    PrW  (  Pr b  1   )  v
:scaling factor(0.85 in experiment)
b : indicator vector indentifying the verices with zero out-degree
W : weights matrix
v : n×1 transport vector with uniform probability distribution
over the vertices.
The Initial PR value for each vertex is 1/n
2016/3/15
19
# PageRank
PageRank is a variant of Eigenvector Centrality, which
measures the importance of a node in a given network.
Ranking vertex by their relative importance
A vertex neighbor to an important vertex should rank higher
Fig. PageRank from Wiki
2016/3/15
20
# Algorithm——Static Feature Pruning(2/2)
Fig. Two examples from
riding (top) and cycling
(bottom) demonstrate
the effects of feature
acquisition. The first
row shows the selected
features. The top 10%
features in PR values
are retrieved.
2016/3/15
21
# Algorithm——Learning Semantic Vocabulary
Information-theoretic divisive algorithm
Input: X initial visual words, and distribution p(C | X ) ;
Output: X̂ visual word clusters p (C | xˆ )
Initiate X̂ randomly assign the cluster members
This is similar to k-means
Two major steps
For each cluster
 ( xˆi ) 

xt xˆi
t
xˆi
,compute the prior and “centers”.
p(C | xˆi ) 
Update clusters : for each
xt
t
  ( xˆ ) p(C | x )
xt xˆi
t
i
,find the new cluster:
i* ( xt )  arg min j KL( p(C | xt ), p(C | xˆ j ))
2016/3/15
22
# Experiments——KTH dataset
static features
82.3%
motion features
87.1%
Hybrid features
91.8%
Static feature: shape information
2016/3/15
23
# YouTube dataset
b_shooting
g_walking
cycling
t_swing
t_jumping
s_juggling
t_swinging
v_spiking
11 categories
About 1600 videos
diving
2016/3/15
swinging
r_riding
24
# Experiments——YouTube dataset(1/3)
Figure A:Performance comparison
between system with motion feature
pruning and without feature pruning
Average Accuracy
Before pruning : 57%
After pruning : 65.4%
2016/3/15
Figure B:Performance comparison
between system with static feature
pruning and without feature pruning
Average Accuracy
Before pruning : 58.1%
After pruning : 63.0%
25
# Experiments——YouTube dataset(2/3)
Fig. Comparison of classification performance for using motion, static and hybrid
features.
2016/3/15
Average accuracy:
Motion: 65.4%; Static: 63.1%; Hybrid:71.2%
26
# Experiments——YouTube dataset(3/3)
Fig. The confusion table for classification using hybrid features.
2016/3/15
27
# Conclusions
Interest This paper present a systematic framework for
recognizing realistic actions from videos “in the wild”.
Static features are complementary to motion features.
Using Motion cues to prune motion and static features is
helpful.
Information-theoretic based divisive clustering reconstruct
compact yet discriminative semantic visual vocabularies.
2016/3/15
28
Thank you!
2016/3/15
29
Download