Lecture1_Human Activity Analysis1

advertisement
Human Activity Analysis
吴心筱
wuxinxiao@bit.edu.cn
Based on the CVPR 2011 Tutorial: Human Activity Analysis.
http://cvrc.ece.utexas.edu/mryoo/cvpr2011tutorial/
Introduction
Understanding People in Video
Goal
•Find the people
•Infer their poses
•Recognize what they do
Level of People Understanding
Object-Level Understanding
• Locations of persons and objects
Tracking-Level Understanding
• Object trajectories --correspondence
Level of People Understanding
Pose-Level Understanding
• Human body parts
Activity-Level Understanding
• Recognition of human activities and
events
Object Detection
Pedestrian (i.e. human) detection
• Detect all the persons in the video
Object Tracking
Person tracking
• Tracking the person in every frame
Pose Estimation
Human Pose
• Joint locations or angles
of a person measured per
frame
Video as a
sequence of
poses
Human Activity Recognition
Human Activity
• A collection of human/object movements
with a particular semantic meaning
Activity Recognition
•Finding of video segments containing such
movements
Levels of Human Activities
Categorized
based on their
complexity
 Gestures
Hierarchy
 Interactions
# of participants
 Group Activities
 Actions
Gestures
 Atomic components
 Single body-part movements
Actions
Single actor movement
Composed of multiple gestures
organized temporally
Interactions
Human-human interaction
Human-object interaction
Group Activities
Multiple persons or objects
Applications
Surveillance
Monitor suspicious activities
for real-time reactions.
(e.g.,‘Fighting’, ‘stealing’)
Currently, surveillance
systems are mainly for
recording.
Activity recognition is essential for
surveillance and other monitoring
systems in public places
Intelligent Environments (HCI)
Intelligent home, office, and workspace
Monitoring of elderly people and children.
Recognition of ongoing activities and
understanding of current context is
essential.
Sports Play Analysis
Web-based video retrieval
YouTube
20 hours of videos uploaded every minute
Content-based search
Search based on contents of the video,
instead of user-attached keywords
Example: search ‘kiss’ from long movies
Challenges
Robustness
Environment variations
Background
Moving backgrounds
Pedestrians
Occlusions
View-points – moving camera
Motion Style
Each person has his/her own style of
executing an activity
Who stretches his hand first?
How long does one stay his hand stretched?
Various Activities
There are various types of activities
The ultimate goal is to make computer
recognize all of them reliably.
Learning
Insufficient amount of training videos
Traditional setting: Supervised learning
Human efforts are expensive!
Unsupervised learning
Interactive learning
Overview
Activity Classification
Simple task of identifying videos
• Classify given videos into their types.
Known, limited number of classes
Assumes that each video contains a single activity
Activity Detection
Search for the particular time interval
• <starting time, ending time>
•Video segment containing the activity
Activity Detection by Classification
Binary classifier
Yes,
Puch !
Sliding window technique
• Classify all possible time intervals
Recognition Process
Represent videos in terms of features
• Captures properties of activity videos
Classify activities
by comparing video
representations
• Decision boundary
Approaches
Approach based taxonomy
Single-layered vs. hierarchical
Single-layered approaches
Hierarchical approaches
Single-layered
approaches
Two different types
Space-time approaches (data-orientated)
•Activities as video observations
•3D space-time volume (3D XYT volume)
•A set of features extracted from the volume
Two different types
Sequential approaches (semantic-oriented)
• Activities as human movements
•A sequence of particular observations(feature
vectors)
Space-time approaches
Space-time volumes
Space-time local features
Space-time trajectories
Space-time volumes
Problem: matching between two volumes
Motion history images
Bobick and J. Davis,
2001
Motion history images
(MHIs)
Weighted projection of
a XYT foreground
volume
Template matching
Bobick and J. Davis, The recognition of human movement using temporal
templates,IEEE T PAMI 23(3),2001
Segments
Ke, Suktankar, Herbert 2007
Volume
matching based
on its segments.
Segment
matching scores
are combined.
[ Ke, Y., Sukthankar, R., and Hebert, M., Spatio-temporal shape and flow
correlation for action recognition. CVPR 2007]
Space-time local features
Local descriptors / interest points
From 2D to 3D ; Sparse
• Low-level: which local features to be
extracted
•Mid-level: How to represent the activity
using local features
•High-level: What method to classify
activities
Cuboid
Dollar et al., Cuboid, VS-PETS 2005
2D Gaussian smoothing kernel:
1D Gabor filters applied temporally:
[Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S., Behavior recognition via
sparse spatio-temporal features, VS-PETS 2005]
Cuboid
Dollar et al., Cuboid, VS-PETS 2005
Appearances of local 3-D XYT
volumes
Raw appearance
Gradients
Optical flows
Captures salient periodic motion.
[Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S., Behavior recognition via
sparse spatio-temporal features, VS-PETS 2005]
STIP interest points
Laptev and Linderberg 2003
Introduced the KTH dataset
Local descriptor based on Harris
corner detector
Simple periodic actions
•[Schuldt, C., Laptev, I., and Caputo, B., Recognizing human actions: A local SVM
approach, ICPR 2004]
STIP interest points
Laptev and Linderberg 2003
•[Schuldt, C., Laptev, I., and Caputo, B., Recognizing human actions: A local
SVM approach, ICPR 2004]
Bag-of-words Representation
SVMs classification
SVMs classifier
Shake
hands !
Multiple kernel
learning
……
•[Schuldt, C., Laptev, I., and Caputo, B., Recognizing human actions: A
local SVM approach, ICPR 2004]
pLSA models
pLSA from text recognition
Probabilistic latent semantic analysis
Reasoning the probability of features
originated from a particular action video.
•[Niebles, J. C., Wang, H., and Fei-Fei, L., Unsupervised learning of human
action categories using spatial-temporal words, BMVC 2006]
pLSA models
•[Niebles, J. C., Wang, H., and Fei-Fei, L., Unsupervised learning of human
action categories using spatial-temporal words, BMVC 2006]
Space-time trajectories
Trajectory patterns
Yilmaz and Shah, 2005 – UCF
Joint trajectories in 3-D XYT space.
•[[Yilmaz, A. and Shah, M., Recognizing human actions in videos acquired by
uncalibrated moving cameras, ICCV 2005]
Space-time trajectories
Trajectory patterns
Yilmaz and Shah, 2005 – UCF
Compared trajectory shapes to classify
human actions.
•[[Yilmaz, A. and Shah, M., Recognizing human actions in videos acquired by
uncalibrated moving cameras, ICCV 2005]
Space-time approaches Summary
Space-time volumes
• A straightforward solution
• Difficult in handling speed and motion variations
Space-time approaches Summary
Space-time local features
•Robust to noise and illumination changes
•Recognize multiple activities without
background subtraction or body-part
modeling
•Difficult to model more complex activities
Space-time approaches Summary
Space-time trajectories
•Perform detailed-level analysis
•View-invariant in most cases
•Difficult to extract the trajectories
Two different types
Sequential approaches (semantic-oriented)
• Activities as human movements
•A sequence of particular observations(feature
vectors)
Sequential approaches
Exemplar-based approaches
State model-based approaches
Exemplar-based approaches
• Matching between the input sequence
of feature vectors and the template
sequences
• Problem: how to match two sequences
in different styles/different rates
Dynamic Time warping
Match two sequences with variations
Find an optimal nonlinear match
•[[Yilmaz, A. and Shah, M., Recognizing human actions in videos acquired by
uncalibrated moving cameras, ICCV 2005]
State model-based approaches
A human activity as a model composed of a
set of states
Each class has a corresponding model
Measure the likelihood between the model
an the input image sequence
HMMs
HMMs
Given observations V (a sequence of
poses), find the HMM Mi that maximizes
P(V|Mi).
Transition probabilities aij and observations
probabilities bik are pre-trained using training
data.
HMMs for Actions
Each hidden state is trained to generate a
particular body posture.
Each HMM produces a pose sequence:
action
HMMs for Hand Gestures
HMMs for gesture recognition
American Sign Language (ASL)
Sequential HMMs
shapes and position of hands
•[Starner, T. and Pentland, A., Real-time American Sign Language recognition
from video using hidden Markov models. International Symposium on
Computer Vision, 1995.]
Sequential approaches summary
Designed for modeling sequential dynamics
Markov process
Motion features are extracted per frame
Limitations
Feature extraction
Assumes good observation models
Complex human activities?
Large amount of training data
exemplar-based vs. state model-based
Exemplar-based
Provide more flexibility for the recognition
system: multiple sample sequences
Less training data
State model-based
make a probabilistic analysis of the activity
model more complex activity
Hierarchical
approaches
Hierarchy
Hierarchy implies decomposition into subparts
Hierarchical approaches
Statistical approaches
Syntactic approaches
Description-based approaches
Statistical approaches
Statistical state-based models for recognition
multiple layers of state-based models
low-level: a sequence of feature vectors
mid-level: a sequence of atomic action labels
high-level: label of activity
Layered hidden Markov models (LHMMs)
Oliver et al. 2002
Bottom layer HMMs recognize atomic
actions of a single person
Upper layer HMMs treat recognized
atomic actions as observations
……
•[Oliver, N., Horvitz, E., and Garg, A. 2002. Layered representations for
human activity recognition. In Proceedings of the IEEE International
Conference on Multimodal Interfaces (ICMI). IEEE, Los Alamitos, CA, 3-8.
Statistical approaches summary
Reliably recognize activities in the case of
noisy inputs with enough training data
Inability to recognize activities with complex
temporal structures (e.g., concurrent subevents)
subevent A
subevent A
subevent B
Subevent B
subevent A
subevent B
Syntactic approaches
Model activity as a string of symbols, each
symbol corresponds to an atomic-level action
Parsing techniques from the field of
programming languages
Context-free grammars (CFG) and stochastic
context-free grammars (SCFGs)
CFG for human activities
Given the recognized atomic actions
Production rules
Parse tree
Gesture analysis with CFGs
Primitive recognition with HMMs
Parse Tree
Syntactic approaches summary
Difficult in the recognition of concurrent
activities
Require a set of production rules fro all
possible events
Directions: how to learn grammar rules from
observations automatically?
Description-based approaches
Represents a high-level human activity in
terms of simpler activities
Describe various relationships between
subevents (atomic actions)
temporal relationship
spatial relationship
logical relationship
Description-based approaches
Recognize activities using semantic matching.
Hand shake = “two persons do shake-action
(stretches, stays stretched, withdraw)
simultaneously, while touching”.
Recognition by finding observations satisfying the
definition.
Hierarchical approaches summary
Download