Low-Level Features and Descriptors in Video Jason J. Corso SUNY at Buffalo

advertisement
1
Low-Level Features and
Descriptors in Video
Jason J. Corso
SUNY at Buffalo
jcorso@buffalo.edu
http://www.cse.buffalo.edu/~jcorso
UCLA IPAM GSS 2013: Computer Vision July-August 2013
2
Sources: Maas 1971 with Johansson; downloaded from Youtube.
3
Human Activity Perception and Cognition
• Summary of findings (to be presented)
– Humans can perceive biological motion, such as walking and
dancing, from as few as 10-12 stimuli in the form of point.
– Biological motion activates the superior temporal sulcus (STS).
– Mammalian recognition of biological motion is viewpoint
specific.
– The STS has cells tuned to specific hand gestures, such as
tearing and pointing.
– The STS has specific cells tuned to interactions of objects and
mammalian agents in the environment, e.g., hands and food.
4
Johansson: Perception of Biological Motion
Sources: Johansson, G. “Visual perception of biological motion and a model for its analysis.” Perception & Psychophysics. 14(2):201-211. 1973.
Videos were made by JB Maas in 1971 (released via Houghton-Mifflin and now available on Youtube).
5
Mammalian Activity Recognition is Viewpoint Specific
Source: M. A. Giese. Neural model for the recognition of biological motion, In Dynamische Perzeption, 105–110. Infix Verlag, 2000.
•
•
•
The paper explores the basic question of viewpoint
invariance of these mechanisms to recognize
complex biological motion within the STS.
Not a human study.
Builds a simple model of the known visual pathway.
– Two parts: form and motion
– Motion encodes
• Optical flow,
• Max-pooling (invariance),
• Non-linear templates on max-pooled outputs.
•
Results
– Biological motion recognition can be achieved
independently by each pathway.
– Shift and scale invariance (up to 1.5 octaves)
observed.
– Model predicts view variance with recognition
performance degrading as a function of the viewpoint
angle.
– Motion pathway is less tolerant against temporal
distortion than form pathway.
– Individual motions form basis of predicted smooth
motion fields (linear combinations of models look
biologically plausible).
6
Pelphrey Study: STS Prefers Biological Motion
Source: Pelphrey et al. “Brain Activity Evoked by the Perception of Human Walking.” Journal of Neuroscience 23(17):6819-6825, 2003.
•
•
Explores the question of whether or not
the superior temporal sulcus specifically
focuses on biological motion.
Key Findings:
Experiment Setup
– The STS is sensitive specifically to
biological motion.
– The STS responds more strongly to
biological motion than to nonmeaningful
but complex nonbiological motion.
– The STS responds more strongly to
biological motion than to complex and
meaningful nonbiological motion.
Peak Results
Image Source: Wikimedia Commons.
7
Select STS Cells Tuned to Object-Agent Interactions
Source: Perrett et al. Frameworks of analysis for the neural representation of animate objects and actions. J. Experimental Biology, 146:87–113, 1989.
• Studied a population of cells tuned to hands (in macaque monkeys).
– Studied cells unresponsive to simple bars/gratings as well as complex
body movements, faces, or food.
• Findings
– Different actions of the hand activated different subpopulations of cells.
– Isolated cells for reach for, retrieve, manipulate, pick, tear, present to the
monkey, and hold.
– Selectivity is clearly detected for hand-object interactions over objectobject interactions.
8
Activity Recognition Overview
9
Activity Recognition Has Many Faces
Multiview Video
Description
Summarization
Multiview Images
Single Video
Segmentation
Localization
RGBD
Classification
Detection
Single Image
Events
Group Actions
Interactions
Actions
Gestures
Videos on this slide are sourced from my work, Youtube and other standard data sets.
10
Applications of Activity Recognition
•
•
•
•
•
•
•
Automated surveillance systems.
Real-time monitoring of patients, elderly persons.
Gesture- and action-based interfaces (e.g., Kinect).
Anomaly detection in video.
Sports-video analysis.
Semantic video retrieval.
Video-to-text.
11
Applications of Activity Recognition
•
•
•
•
•
•
•
Automated surveillance systems.
Real-time monitoring of patients, elderly persons.
Gesture- and action-based interfaces (e.g., Kinect).
Anomaly detection in video.
Sports-video analysis.
Semantic video retrieval.
Video-to-text.
Source: http://www.youtube.com/
12
Applications of Activity Recognition
•
•
•
•
•
•
•
Automated surveillance systems.
Real-time monitoring of patients, elderly persons.
Gesture- and action-based interfaces (e.g., Kinect).
Anomaly detection in video.
Sports-video analysis.
Semantic video retrieval.
Video-to-text.
13
Inherent Complexity in Activity
Source: http://www.youtube.com/watch?v=6H0D8VaIli0
14
Humans are Highly Articulated
Source: http://www.youtube.com/
15
Motion of the Camera and/or of the Scene
Source: Goodfellas (copyright Columbia Pictures) used under fair use; Video trimmed from GaTech Video Segmentation data set.
16
Action is View Dependent
Source: UCF 50 data set from UCF.
17
Action is Subject Dependent
Source: UCF 50 data set from UCF.
18
Occlusion
Source: GaTech Video Segmentation Data Set.
19
Johansson: Perception of Biological Motion
Sources: Johansson, G. “Visual perception of biological motion and a model for its analysis.” Perception & Psychophysics. 14(2):201-211. 1973.
Videos were made by JB Maas in 1971 (released via Houghton-Mifflin and now available on Youtube).
20
High-level Representations are a Challenge
Pose computed using state of the art Yang and Ramanan method independently on each
frame. Notice the jittery character of the pose due to local variation.
21
Current Performance: Activity Recognition
100
2012
KTH (6)
90
2010
2009
2008
Accuracy of Method
80
70
2004
UCF Sports (9)
76%
71%
60
50
40
30
20
UCF50
Pose-based Activity Recognition
Findings:
•
•
•
•
10
These results depict performance circa 2012.
Performance decreases with number of classes.
Performance increases with time.
All methods are based on low-level features.
• State of the art pose method performs worse.
0
1
2
63
4
5
6
7
9
8
9
10
HMDB (51)
11
Number of Classes in the Data Set
5012
13
5114
15
22
The (Very Common) Bag-of-Features Pipeline
Source: materials adapted from Laptev’s CVPR 2008 slides.
Space-Time Features
Histogram of Visual Words
Space-Time
Patch
Descriptors
Multi-channel
Classifier
• Examples include Schüldt et al. ICPR 2004, Niebles et al.
IJCV 2008, and many works building on this basic idea.
23
Typical Elements of the Bag-of-Features Pipeline
• Feature quantization:
– Feature descriptions are
max-pooled according to
various spatial domains.
– Histograms are computed.
• Most typical classifier is a support vector machine.
• E.g., Laptev 2008 uses a multi-channel Chi-square kernel:
Set of channels (greedy selection)
Mean of training sample distances for channel.
Chi-square distance
24
Outline
• Features
– Local spatiotemporal features
• STIP
• Cuboids
• Dense
– Trajectories
• Keypoint tracking
• Dense
• Descriptors
– Local
•
•
•
•
HOG/HOF
MBH: Motion Boundary Histograms
HOG3D
MIP: Motion Interchange Patterns
– Global space-time energy
26
Detectors / Features
Features 27
STIP: Space-Time Interest Points
Source: Laptev. “On Space-Time Interest Points.” Intl Journal of Computer Vision. 64(2/3):107-123. 2005.
• Basic idea is to detect points in the video that have
significant local variations in both space and time.
• Builds on the existing work of Harris corner detector and
incorporates a scale parameter.
• The original work incorporates a
scale-selection term; most subsequent works densely sample scale.
Features 28
STIP: Space-Time Interest Points
Source: Laptev. “On Space-Time Interest Points.” Intl Journal of Computer Vision. 64(2/3):107-123. 2005.
Features 29
STIP: Space-Time Interest Points
Source: Laptev. “On Space-Time Interest Points.” Intl Journal of Computer Vision. 64(2/3):107-123. 2005.
Video from Laptev’s CVPR 2008 slides.
Features 30
Dollár’s Cuboids
Source: Dollar et al. “Behavior Recognition” ICCV PETS Workshop 2005.
• Detector fires when local
image intensities contain
periodic frequency
components.
• It will fire more frequently than STIP.
• Based on temporal Gabor quadrature
pair filter with response function:
Features 31
Dense Sampling of Locations
• Motivated by successes in object recognition where densely
sampled features outperformed sparse ones, it has become
common to sample densely for activity recognition too.
• Example videos below
– 7x7x7 non-overlapping samples,
– Simple temporal derivative (much simpler than HOF and HOG3D).
– k-Means in 128 visual words.
Features 32
Discussion: Local Spatiotemporal Features
• Benefits of local feature methods:
–
–
–
–
Robustness to viewpoint changes and occlusion.
Relatively computationally inexpensive.
Do not need to detect and track the agent.
Implicitly incorporate motion, form, and context.
• But, they may be too limited for comprehensive activity
recognition.
– Temporal structure is diminished or lost.
– Human performance suggests a broader spatial and temporal
range may be needed for good activity recognition.
– Typically do not incorporate any inter-relationships among the
extracted features or points.
Features 33
Trajectories by Local Keypoint Tracking
Source: Messing et al. “Activity Recognition using velocity histories of tracked keypoints.” ICCV 2009.
• Detects corners in the image and
tracks them using a KLT tracker.
– 500 points at a time w/ replacement.
– Mean duration is 150 frames.
•
•
•
•
Represent trajectories by quantized trajectory velocity.
Learn a mixture model over velocity Markov chains.
Each action has a distribution over the mixture components.
Joint model over
action and observations:
• Learn via EM.
Features 34
Trajectories by Local Keypoint Tracking
Source: Messing et al. “Activity Recognition using velocity histories of tracked keypoints.” ICCV 2009.
Features 35
Dense Trajectories
Source: Wang et al. “Action Recognition by Dense Trajectories.” CVPR 2011.
• Dense sampling improves object
recognition and action recognition; why
not use it for trajectories?
• Matching features across frames is very
expensive.
• Proposes a method to track the
trajectories densely using a single
dense optical flow field calculation.
– Global smoothness enforced.
• Compute the descriptors aligned with
the trajectories using HOG/HOF/MBH.
Features 36
Dense Trajectories: Convincing Improvements
Source: Wang et al. “Action Recognition by Dense Trajectories.” CVPR 2011.
37
Descriptors
Descriptors 38
Local Descriptors: HOG/HOF
Source: materials adapted from Laptev’s CVPR 2008 slides.
Description (sparse/dense) in space-time patches.
Histogram of
oriented spatial
grad. (HOG)
3x3x2x4bins HOG
descriptor
Histogram
of optical
flow (HOF)

3x3x2x5bins HOF
descriptor
39
Motion Boundary Histograms
Source: Dalal et al. “Human Detection Using Oriented Histograms of Flow and Appearance.” ECCV 2006.
• Rather than HOF directly, MBH focuses on histograms of
differential optical flow.
– Descriptive of motion articulation but resistant to background and
camera motion.
• Compute optical flow and take differentials separately over dx and
dy. Use separate histograms over resulting dx and dy images.
Descriptors 40
Local Descriptors: HOG3D
Source: Kläser et al. “A Spatio-Temporal Descriptor Based on 3-D Gradients.” BMVC 2008. And the provided poster.
Descriptors 41
Local Descriptors: HOG3D
Source: Kläser et al. “A Spatio-Temporal Descriptor Based on 3-D Gradients.” BMVC 2008. And the provided poster.
Descriptors 42
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
• Local-binary patterns based video descriptor.
– Dense characterization of motion changes.
– Captures the shape of moving edges.
– Methodology incorporates a stabilization mechanism.
• Incorporates a per-pixel encoding using binary/trinary digits.
• Descriptor is frequency of binary/trinary strings.
Slide from O. Kliper-Gross.
Descriptors 43
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
t
t-1
SSD( , )
t+1
SSD( , )
i
x
α j
Slide from O.
Kliper-Gross.
-1
0
1
Descriptors 44
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
α=0
t
t-1
SSD( , )
SSD( , )
α=0
Slide from O.
Kliper-Gross.
-1
t+1
x
0
1
Descriptors 45
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
different α = different channels = diagonals
SSD( , )
64digits
trinary
code
Slide from O.
Kliper-Gross.
SSD( , )
α
-1
0
1
Descriptors 46
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
Each α defines a channel  8 channels
Per-pixel 64-digits
trinary code
0
0
1
0
0
-1
1 00
1
0
0 11
0
1
-1 0 0
-1
0
0 -1-1
0
-1
1 00
1
0
-1 1 1
-1
1
1 -1-1
1
1 -1
1
1
0
1
0
1
0
0
2 integers per-pixel
Per Channel
0
-1
0
0
-1
Slide from O.
Kliper-Gross.
0-255 integer
0
0-255 integer
Descriptors 47
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
An example - one channel basic coding
• Vote for next frame
• Vote for prev frame
• Static edges
MIP captures:
Motion, Motion Changes, and Shape
Slide from O.
Kliper-Gross.
Descriptors 48
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
Suppress background structure and noise
Original Coding = 1
Switched Patch Suppression
Switched Locations Coding = -1
2 ways to look at this:
- No motion.
- Contradicted motion voting.
i.e.
Original coding voted down
Switched patches voted up
Suppress the code
Slide from O. Kliper-Gross.
Descriptors 49
Local Descriptors: Motion Interchange Patterns
Source: Kliper-Gross et al. “Motion Interchange Patterns for Action Recognition in Unconstrained Videos.” ECCV 2012. And the provided slides.
An example of MIP suppression.
Without
Suppression
Slide from O. Kliper-Gross.
Original
With
Suppression
Descriptors 50
Global Descriptors: Templates
Source: Derpanis et al. “Efficient action spotting based on a spacetime oriented structure representation.” CVPR 2010. And supporting videos.
Action Templates
Squat
Spin-Left
Spotted Actions
Jump-Right
Descriptors 51
Action Spotting: Space-Time Oriented Energy
Source: Derpanis et al. “Efficient action spotting based on a spacetime oriented structure representation.” CVPR 2010. And supporting videos.
•
Templates are decomposed into space-time
oriented energy by broadly tuned third order
Gaussian filters (classical).
•
Sets up a basis that makes it plausible to
compute the energy along any
spatiotemporal orientation.
•
The
directions are specified according to
the application: leftward, rightward, upward,
download, flicker, and so on.
Descriptors 52
Action Spotting: Space-Time Oriented Energy
Source: Derpanis et al. “Efficient action spotting based on a spacetime oriented structure representation.” CVPR 2010. And supporting videos.
Input Video
Space-Time Oriented Energies
Left
Right
Up
Down
Static
Flicker
~
Descriptors 53
Pure Motion Energy
Descriptors 54
Pure Motion Energy
Raw Space-Time Oriented Energies
Left
Right
Up
Down
Static
Flicker
~
Capture structure and not motion; so
take difference for motion.
Pure Space-Time Oriented Energies
Left
Right
Up
Down
Static
Flicker
~
Descriptors 55
Spotting – Template Matching
• Standard Bhattacharyya matching is used to correlate the
templates into a query video.
– Space-time oriented energy vector-videos are normalized.
– Correlate in each energy channel separately; sum.
Output correlation video.
Template
Ranges over support of the template.
Query Video
Correlation Video
56
Descriptors 57
Basic Idea in Action Bank
Bank of Action
Detectors
View 2
Action Bank
Representation
View n
Max Pooling
View 1
Detection
Volumes
Biking
Volumetric
Javelin
Jump
Rope
Fencing
Semantics Transfer
via high-level representation
Positive: jumping, throwing, running, ...
Negative: biking, fencing, drumming, ...
Input Video
SVM Classifiers
[Sadanand and Corso, CVPR 2012]
58
Action Bank Representation
• Each correlation video is
max-pooled for the action
bank vector.
– Hierarchical (3 layers).
– Volumetric/Space-time.
• We have used a standard
SVM as the final classifier on this representation.
– Tried L1 regularization (no significant change).
– Tried random forests (no significant change).
• Depending on the data set, we use one or two scales of the
detectors.
• All action bank vectors are available on our website for
major data sets in Python and Matlab formats.
[Sadanand and Corso, CVPR 2012]
59
Building the Bank
• Current bank has 205 templates in it
– Examples from all 50 UCF 50 actions and all 6 KTH actions as
well as the digging actions from visint.org.
– 3-6 examples for each action from various viewpoints and
styles of the particular action. Tempo variation when possible.
– Average spatial resolution of 50x120 pixels and range in
number of frames from 40-50 frames.
• Template preparation
– Each template was selected as soon as a plausible new
viewpoint/tempo/style was found.
– Each template was manually cropped to the full space-time
extent of the human within it.
– No optimization of the bank was done at any point.
60
Performance Results of Action Bank
KTH Results
UCF Sports
[Sadanand and Corso, CVPR 2012]
61
Large-Scale Results of Action Bank
HMDB51-V: video-wise 10-fold CV
HMDB51-B: three benchmark splits
UCF50-V: video-wise 10-fold CV
UCF50-G: group-wise 5-fold CV
[Sadanand and Corso, CVPR 2012]
62
Worst
Best
63
Discussion: Semantics Transfer
• Can action bank permit a high-level transfer of semantics
from the bank templates through to the final classifier?
The KTH data set is used for this study.
[Sadanand and Corso, CVPR 2012]
64
Discussion: Does Size Matter?
The UCF Sports data set is used for this study.
[Sadanand and Corso, CVPR 2012]
65
Closing Remarks
• Summary
– Overview of motion perception and activity recognition.
– Low-level features
• STIP
• Cuboids
• Trajectories
– Low-level descriptors
•
•
•
•
HOG/HOF
HOG3D
MBH
Motion Interchange Patterns
• Challenges
– Space-time or “space and time”?
– Building intuition and understanding for the low-level features and
descriptors. Where are they good? Where do they fail?
Download