Slide - Microsoft Research

advertisement
Activity Recognition
Ram Nevatia
Presents work of
F. Lv, P. Natarajan and V. Singh
Institute of Robotics and Intelligent Systems
Computer Science Department
Viterbi School of Engineering
University of Southern California
Activity Recognition: Motivation
• Is the key content of a video (along with scene
description)
• Useful for
•
•
•
•
Monitoring (alerts)
Indexing (forensic, deep analysis, entertainment…)
HCI
..
Issues in Activity Recognition
• Inherent ambiguities of 2-D videos
• Variations in image/video appearance due to changes in
viewpoint, illumination, clothing (texture)….
• Variations in style: different actors or even the same
actor at different times
• Reliable detection and tracking of objects, especially
those directly involved in activities
• Temporal segmentation
• Most work assumes single activity in a given clip
• “Recognition” of novel events
Possible Approaches
• Match video signals directly
• Dynamic time warping
• Extract spatio-temporal features, classify based on
them
• Bag of words, histograms, “clouds”…
• Work of Laptev et al
• Most earlier work assumes action segmentation (detection vs
classification)
• Andrew’s talk on use of localization and tracking
• Structural Approach
• Based on detection of objects, their tracks and relationships
• Requires ability to perform above operations
Event Hierarchy
• Composite Events
• Compositions of other, simpler events.
• Composition is usually, but not necessarily, a sequence
operation, e.g. getting out of a car, opening a door and entering
a building.
• Form a natural hierarchy (or lattice)
• Primitive events: those we choose not to decompose,
e.g. walking
• Recognized directly from observations,
• Graphical models, such as HMMs and CRFs are
natural tools for recognition of composite events.
Key Ideas
– Only need a few primitive actions in any domain.
– Sign Language: Moves and Holds.
– Human Pose Articulation: Rotate, Flex and Pause.
– Rigid Objects (cars, people): Translate, Rotate, Scale.
– Can be represented symbolically using formal rules.
– Composite Actions can be represented as combinations
of the primitive actions.
– Handle uncertainty and error in video by mapping
rules to a Graphical Models.
– HMM.
– DBN.
– CRF.
Graphical Models
• A network, normally used to represent temporal evolution of a
state
• Next state depends only on previous state; observation depends
only on current state, single state variable
• Typical task is to estimate most likely state sequence given an
observation sequence- Viterbi algorithm
An HMM
A CRF
Mid vs Near Range
• Mid-range
• Limbs of human body, particularly the arms, are not
distinguishable
• Common approach is to detect and track moving objects and
make inferences based on trajectories
• Near-range
• Hands/arms are visible; activities are defined by pose
transitions, not just the position transitions
• Pose tracking is difficult; top-down methods are commonly
used
Mid-Range Example
•
•
•
•
Example of abandoned luggage detection
Based on trajectory analysis and simple object
detection/recognition
Uses a simple Bayesian classifier and logical reasoning about
order of sub-events
Tested on PETS, ETISEO and TRECVID data
Top-Down Approaches
• Bottom-up methods remain slow, are not robust; many
methods are based on use of multiple video streams
• An alternative is of top-down approaches where
processing is driven by event models
• Simultaneous Tracking and Action Recognition
(STAR)
• In analogy with SLAM in robotics
• Provides action segmentation, in addition to recogntion
• Closed-world assumption
• Current work limited to single actor actions
Activity Recognition w/o Tracking
…
Input
sequence
Action
segments
check
watch
…
punch
…
kick
…
pick up
throw
+
3D body pose
…
…
…
…
Difficulties
• Viewpoint change & pose ambiguity (with a single camera view)
• Spatial and temporal variations (style, speed)
Key Poses and Action Nets
• Key poses are determined from MoCap data by an
automatic method that computes large changes in energy;
key poses may be shared among different actions
Experiments: Training Set
15 action models
177 key poses
6372 nodes in Action Net
Action Net: Apply constraints
0o
10o
…
Experiments: Test Set
50 clips, average length 1165 frames
5 viewpoints
10 actors (5 men, 5 women)
A Video Result
original frame
extracted blob
&
ground truth
without action net
with action net
Working with Natural Environments
• Reduce reliance on good foreground segmentation
• Key poses may not be discriminative enough w/o
accurate segmentation; include models for motion
between key poses
• More general graphical models that include
•
•
•
•
Hierarchy
Transition probabilities may depend on observations
Observations may depend on multiple states
Duration models (HMMs imply an exponential decay)
• Remove need for MoCap data to acquire models
Composite Event Representation
CE: Sequence(P1,P2)
P1: Rotate( Right, Arm, 90o,z-axis)
P2: Rotate( Right, Arm, 90o,-z-axis)
Learning Event Models
Primitive Event P1
Primitive Event P2
Composite Event = Sequence(P1,P2)
Dynamic Bayesian Action Network
– Map action models to a
Dynamic Bayesian
Network
– Decompose a composite
action into a sequence of
primitive actions
– Each primitive is
expressed in a function
form fpe(s,s’,N).
– Maps current state s to
next state s’ given
parameters N.
– Assume a known, finite set
of functions f for primitives.
Inference Overview
• Given a video, obtain initial state distribution with
start key pose for all composite actions
• For each current state:
• Predict the primitive based on the current duration
• Predict a 3D pose given the primitive and current duration
• Collect the observation potential of the pose using
foreground overlap and difference image
• Obtain the best state sequence using dynamic
programming (Viterbi Algorithm)
• Features used to match models with observations
• If “foreground” can be extracted reliably, then we can use
blob shape properties; otherwise, use edge and motion
flow matching
Pose Tracking & Action Recognition
• Obtain state distributions by matching poses sampled
from action models
• Infer the action by finding the max likelihood state
sequence,
Inference Algorithm
Observations
• Foreground overlap with full body model,
• Difference Image overlap with body parts in action
• Grid-of-centroids to match foreground blob with pose
Results
• From CVPR08 paper
Action Learning
• Involves two problems
• Model Learning: Learning parameters N in the
primitive event definition fpe(s,s’,N).
– Key Pose Annotation and Lifting.
– Pose Interpolation
• Feature Weight Learning: Learning the weights
wk of the different potentials.
KeyPose Annotation and 3D Lifting
Pose Interpolation
• All limb motions can be expressed in terms of Rotate(part,axis,q).
• We need to learn axis and q.
• Simple to do given the start and end joints of part.
Feature Weight Learning
• Feature weight estimation
Latent State Voted Percepton
as minimization of a loglikelihood error function.
• Learn the weights using
Voted Perceptron Algorithm
• Requires fully labeled
training data -> not
available.
• We propose an extension
to deal with partial
annotations.
Experiments
• Tested method on 3 datasets
• Weizmann dataset
• Gesture set with arm gestures
• Grocery Store set with full body actions
Dataset
Train:Test Action Recognition 2D Tracking
Ratio
(% accuracy)
(% error)
Speed
(fps)
Weizmann
3:6
99.5
--
--
Gesture
3:5
90.18
5.25
8
Grocery Store
1:7
100.0
11.88
1.6
Weizmann Dataset
• Popular dataset for
action recognition
• 10 full body actions
from 9 actors
• Each video has
multiple instance of
one action
Train:Test
Recognition
Accuracy
Jhuang et al [9]
6:3
98.8
Space-Time Shapes [6]
8:1
100.0
Fathi Et al [5]
8:1
100.0
Sun et al [20]
3:6
87.3
DBAN
1:8
96.7
DBAN
3:6
99.5
Gesture Dataset
• 5 instances of 12 gestures
from 8 actors.
• Indoor lab setting.
• 500 instances of all actions.
• 852x480 pixel resolution,
person height: 200-250 pix.
Grocery Store Dataset
• Videos of 3 actions collected from a static camera.
• 16 videos from 8 actors, performed at pan angles.
• Actor height varies from 200-375 pixels, in 852x480 resolution
videos.
Incorporating Better Descriptors
•
•
Previous work based on weak lower-level analysis
We can also evaluate 2D part models
Dynamic Bayesian Action Network
with Part Model
Experiments
•
•
Hand gesture dataset in an Indoor lab
• 5 instances of 12 gestures from 8 actors, total of 500 action
segments
Evaluation metrics
• Recognition rate over all action segments
• 2D pose tracking as average 2D part accuracy over 48
randomly selected instances
Dataset
Train:Test
Ratio
Recognition
(% accuracy)
2D Tracking
(% accuracy)
DBAN-FGM
1:7
78.6
75.67 (89.94)
DBAN-Parts
1:7
84.52
91.76 (92.66)
Summary and Conclusions
• Structural approach to activity recognition offers
many attractions and challenges
• Results are descriptive but detecting and tracking objects is
challenging
• Hierarchical representation is natural and can be
used to reduce complexity
• Good bottom-up analysis remains a key to improved
robustness
• Concept of “novel” or “anomalous” events remains
difficult to formalize
Download