FCV_Scene_Schmid - Frontiers in Computer Vision

advertisement
Action recognition in videos
Cordelia Schmid
INRIA Grenoble
Action recognition - problem
• Short actions, i.e. drinking, sit down
Coffee & Cigarettes dataset
Hollywood dataset
Action recognition - problem
• Short actions, i.e. drinking, sit down
• Activities/events, i.e. making a sandwich, depositing a
suspicious object
TRECVID Multimedia Event Detection
TRECVID - Multimedia Event Detection
Attempting a board trick
Feeding an animal
Wedding ceremony
Getting a vehicle unstuck
Action recognition
• Action recognition is person-centric
• Vision is person-centric: We mostly care about things
which are important
Movies
TV
YouTube
Source I.Laptev
Action recognition
• Action recognition is person-centric
• Vision is person-centric: We mostly care about things
which are important
35%
34%
Movies
TV
40%
YouTube
Source I.Laptev
Action recognition from still images
• Description of the human pose
– Silhouette description [Sullivan & Carlsson, 2002]
– Histogram of gradients (HOG) [Dalal & Triggs 2005]
– Human body part layout [Felzenszwalb & Huttenlocher, 2000]
Action recognition from still images
• Supervised modeling interaction between human & object
[Gupta et al. 2009, Yao & Fei-Fei 2009]
• Weakly-supervised learning of objects [Prest, Schmid & Ferrari 2011]
Results on PASCAL VOC 2010 Human action classification dataset
Importance of action objects
• Human pose often not sufficient by itself
• Objects define the actions
Importance of temporal information
• Video/temporal information necessary to disambiguate
actions
• Temporal context describes the action/activity
• Key frames provide significant less information
Action recognition in videos
• Temporal information allows to stabilize human and object
detection by tracking – J. Malik: tracking by detection is
difficult ?
• Large amount of data, very fast growing – H. Sawhney:
large amount of data, not often well explored
• Often comes with some sort of supervision, scripts,
subtitles – similar in spirit to M. Hebert’s comment on
large amount of data collected by a robot
Action recognition in videos
Motion history image
[Bobick & Davis, 2001]
Learning dynamic prior
[Blake et al. 1998]
Spatial motion descriptor
[Efros et al. ICCV 2003]
Sign language recognition
[Zisserman et al. 2009]
Action recognition in videos
• Bag of space-time features [Laptev’03, Schuldt’04, Niebles’06, Zhang’07]
Collection of space-time patches
Histogram of visual words
HOG & HOF
patch descriptors
SVM classifier
Action recognition in videos
• Bag of space-time features
– Many recent extensions: new features / tracklets, temporal
structuring etc.
– Advantages
• Very useful as a baseline
• Captures spatial and temporal context – see Efros comment on image
classification
– Disadvantages
• No interpretation of the action
• Not sufficient for localization & description
Action recognition in videos
• Localization by 3D HOG/HOF, interaction with objects
Tracking by detection and tracking
Interaction with objects
Space-time description
Action recognition in videos
• HOG 3D tracks + description
– Detection & tracking of human with part based models works well
• move towards more flexible models, similar to P. Felzenszwalb’s
• integrate motion information
• very good baseline
– Towards more flexible descriptions, based on body parts, a lot of
recent work on finding human body parts
– Interaction with object important, but hard
Discussion
• Need for more challenging datasets
– Need for realistic datasets
KTH dataset
Hollywood dataset
– Scale up number of classes (today ~10 actions per dataset)
– Increase number of examples per class, possibly with weakly
supervised learning (the number of examples per videos is low)
– Define a taxonomy, use redundancy between action classes to
improve training
– Manual exhaustive labeling of all actions impossible
Discussion
• Make better use of the large amount of information inherent
in videos
– automatic collection of additional examples
– improve models incrementally
– use weak labels from associated data (text, sound, subtitles)
• Many existing techniques are straightforward extensions of
methods for images
– almost no use of 3D information
– learn better interaction and temporal models
– design activity models by decomposition into simple actions
Download