Document 13510811

advertisement
9.913 Pattern Recognition for Vision
Class XIII, Motion and Gesture
Yuri Ivanov
Fall 2004
TOC
• Movement – Activity – Action
• View-based representation
• Sequence comparison
• Hidden Markov Models
• Hierarchical representations
Fall 2004
Pattern Recognition for Vision
From Tracking to Classification
How do we describe that?
How do we classify that?
Fall 2004
Figure by MIT OCW.
Pattern Recognition for Vision
Sequence Analysis
• We might want to ask:
– Is <it> doing something meaningful?
– What exactly?
– How does it do it?
• How fast – e.g. conducting
• How accurately – e.g. dance instruction
• What style?
• That leads us to a sequence analysis
Fall 2004
Pattern Recognition for Vision
Motion Taxonomy
• Movement
– Primitive motion
– Self-evidential, is “what it looks like”
• Activity
– Requires explicit sequence model
• Action
Images removed due to copyright considerations. Please see: Bobick, A.
"Movement, Activity, and Action: The Role of Knowledge in Perception
of Motion." Phyl Trans of Royal Statistical Society 352 (1997).
– Requires contextual information
– Requires relational information
– And many other things…
Fall 2004
Pattern Recognition for Vision
Basic Problem
Object trajectory in
the feature space
Ø x11 ø
Œ 1œ
Œ y1 œ
Œq 1 œ
Œ 1œ
Œº � œß
Fall 2004
Ø x1M ø
Œ 1 œ
Œ yM œ
Œq 1 œ
Œ Mœ
Œº � œß
Ø xN2 ø
Œ 2œ
Œ yN œ
Œ 2œ
Œq N œ
Œº � œß
s1 = x11...x1M
?
s1 <> s2
Ø x12 ø
Œ 2œ
Œ y1 œ
Œq 2 œ
Œ 1œ
Œº � œß
s2 = x12 ...x 2N
Pattern Recognition for Vision
Motion Energy Image
First idea –implicit representation of time
D (x, y , t )
- frame differencing
Sum the differences over the last τ frames:
t -1
Et ( x , y , t ) = ∪ D ( x, y , t - i )
- WHERE motion happened
i =0
Photographs and figures from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal Templates."
IEEE Transactions on Pattern Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE, A. Bobick, and J. Davis. Copyright 2002
IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Motion History Image
Step two: include temporal information
if D( x, y , t ) = 1
�t
Ht ( x, y , t ) = �
� max(0, Ht ( x, y , t - 1) - 1) otherwise
t
- HOW motion happened
Aside - you can compute a similar measure recursively:
Ht ( x, y , t ) = Ht ( x, y , t - 1) + a ( D( x , y , t ) - Ht ( x, y , t - 1))
Photographs and figures from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal
Templates." IEEE Transactions on Pattern Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE, A. Bobick,
and J. Davis. Copyright 2002 IEEE. Used with Permission.
Fall 2004
t
Pattern Recognition for Vision
Illustration
OpenCV – Intel Open source Computer Vision Library
Fall 2004
Pattern Recognition for Vision
Classification
Feature vector:
x = [7 Hu moments for MEI + 7 Hu moments for MHI]
RTS invariant shape descriptors (see the end of notes)
With the usual Gaussian assumption on distribution of x:
mw = E[ xw ];
Sw = E[( xw - mw ) 2 ]
Then the class, ω :
T
w = argmin Ø( x - mw ) Sw-1 ( x - mw ) ø
º
ß
Fall 2004
Pattern Recognition for Vision
Multi-View Recognition
The model is replicated for a discrete number of views:
For q = {0�...90� }
(
V (wi ) = min Ø x - m
q Œ
º
w = argmin [V (wi )]
q
wi
)
T
S
- 1, q
wi
( x - m )øœß
q
wi
Closest member of each class
Nearest class
Photographs and figures from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal
Templates." IEEE Transactions on Pattern Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE,
A. Bobick, and J. Davis. Copyright 2002 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Example Application
KidsRoom
• Interactive story
• Autonomous system
• Narration is controlled
• Input from cameras and mike
• Visual events:
• position
• motion energy
• motion direction
• gross body motion
Fall 2004
Image removed due to copyright considerations. See:
http://whitechapel.media.mit.edu/vismod/demos/kidsroom/kidsroom.html
Pattern Recognition for Vision
Motion Energy
Reference object
MEI
Move right
Move left
Move straight
Vismod Tech Report # 398
Fall 2004
Pattern Recognition for Vision
Movement Classification
“Flap”
MEI/MHI
“Spin”
Last game sequence
Fall 2004
Pattern Recognition for Vision
Temporal Alignment
Another idea – temporal alignment
X
Y
?
<>
If sequences are aligned to a common time axis, then we can treat
them as vectors
Fall 2004
Pattern Recognition for Vision
Temporal Alignment
Find re-indexing sequences ix and iy that align X and Y to a common
time axis k while minimizing dissimilarity.
iy
ix = f x ( k )
Ty
Tx
1
1
T
i y = f y (k )
1
Tx
1
k
ix
Ty
1
1
T
k
One solution – Dynamic Time Warp algorithm
1
E=
Mf
�{m(n) (s [i (n)] - s
T
n=1
Global normalization
Fall 2004
x
x
y
غiy (n ) øß
)}
2
Local weighting
Pattern Recognition for Vision
Example: Utterance Classification
“sit”
Class
Prototypes
Master
Prototype
Time normalization:
1. Find the least distortion
prototype in each class
2. Pick the longest one
3. Warp all data to it
4. Train classifier
“roll over”
Original
30-70
samples
Fall 2004
Normalized
59
samples
Pattern Recognition for Vision
Example: Utterance Classification
Alternative - pair-wise alignment:
N
SVM: f ( x) = �a i yi K ( x, xi ) + b
i =1
1. Compute the symmetric DTW between all pairs
d ij =
D( si , s j ) + D( s j , si )
2
2. Compute an RBF Kernel
K ( si , s j ) = exp( -g dij )
Danger: K might not be a proper kernel matrix – need to regularize
Fall 2004
Pattern Recognition for Vision
Example: Utterance Classification
Japanese Vowel Set (UCI Machine Learning Repository):
• Speaker identification task
• 9 speakers
• saying the same Japanese vowel
• features 12 cepstral coefficients
• each utterance – 7-30 samples
• 340 training examples
• 240 testing examples
Accuracy
KNN
MCC
HMM
SVM
DynSVM
Fall 2004
94.60%
94.10%
96.20%
98.20%
98.20%
Pattern Recognition for Vision
Hidden Markov Model (HM&M)
Yet another idea:
Fall 2004
Figure by MIT OCW.
Pattern Recognition for Vision
Hidden Markov Model (HM&M)
Yet another idea:
Fall 2004
Figure by MIT OCW
Pattern Recognition for Vision
HMMs
Another view – “Graphical Model”:
Figure by MIT OCW.
Fall 2004
Pattern Recognition for Vision
Components of an HMM
l = {p , A, B}
1) π - probability of starting from a particular state
⎪ πi = p(q1=i)
N
�p
i =1
i
=1
2) A - probability of moving to a state, given the history
aij=p(qt=i| qt-1=j) – Markov assumption
N
�a
ij
=1
j =1
3) B - probability of outputing a particular observation from a
given state:
bi (o)=p(ot |qt=i)
� b ( x)dx = 1
i
Fall 2004
Pattern Recognition for Vision
HMM in Pictures
T transitions
π
q1
qT
N states
…
qt
qt-1
o1
• A = p(qt| qt-1,…, q1)= p(qt| qt-1) – Markov assumption
• p(qt| qt-1) is independent of t – stationary => a matrix
Fall 2004
Pattern Recognition for Vision
HMM Example
Input
Alphabet
Output distributions
Fall 2004
How HMM sees it
Initial
state
Transition
matrix
Pattern Recognition for Vision
Three Tasks of HMM
1. Given a sequence of observations find a probability
of it given the model, p(O|λ)
2. Given a sequence of observations recover a
sequence of states, P(q|O,λ)
3. Given a sequence, estimate parameters of the model
Fall 2004
Pattern Recognition for Vision
Problem I – Probability Calculation
Take I – brute force:
Given:
O = (o1 ,..., oT )
Calculate: P(O | l )
Marginalize:
P(O | l ) = � P(O, q | l ) = � P(O | q, l )P (q | l )
"q
"q
P(O | q, l ) = bq1 (o1 )bq2 (o2 )...bqT ( o3 )
P(q | l ) = p q1 aq1q2 aq2q3 ...aqT -1qT
P(O | q, l ) P (q | l ) = p q1 bq1 (o1 ) aq1q2 bq2 (o2 )aq2 q3b q3 (o3 )...aqT -1q T bqT (oT )
N states, T transitions => |q|=NT !!!!
N=5, T=100 => 2TNT= 2*100*5 100 ~ 1072 computations
65536*1072 particles in the universe
Fall 2004
Pattern Recognition for Vision
Try Again
P (O | l ) = � P (O , q | l )
"q
q
1072
= � p q (1)bq (1) (o1 ) aq (1) q (2) bq (2) (o2 ) aq (2) q (3) bq (3) (o3 )...aq (T -1) q (T )bq (T ) (oT )
q= q1
S
» 2TN T
= ��� �� p i bi ( o1 ) aij b j (o2 ) a jk bk (o3 )...almbm (oT )
m
l
j
i
S
Fall 2004
» N 2T
Pattern Recognition for Vision
Problem I – Probability Calculation
Take II – forward procedure:
Define a “forward variable”, α
a t (i) = P(o1o2 ...ot , qt = i | l ) - probability of seeing the string up to t
and ending up in state i
1. Initialize
a1 (i) = p i bi ( o1 )
…
2. Induce
ØN
ø
a t +1 ( j ) = Œ� a t (i )aij œ b j (ot +1 )
º i =1
ß
…
3. Terminate
N
P(O | l ) = �a T (i)
…
i =1
Fall 2004
Pattern Recognition for Vision
While We Are At It…
Define a “backward variable”, β
b t (i) = P(ot +1ot + 2 ...oT | qt = i, l ) - probability of seeing the rest of the
string after t and after visiting state i at t
1. Initialize
b T (i ) = 1
2. Induce
N
bt (i) = � aij b j (ot +1 ) bt +1 ( j )
…
…
j =1
3. Terminate
…
Fall 2004
Pattern Recognition for Vision
Task II – Optimal State Sequence
“Optimality” - maximum probability of being in a state i at time t.
Given:
O = (o1 ,..., oT )
Find: qt = argmax P( qt | O, l )
q
by Bayes rule
P( qt = i | O) =
P(O, qt = i )
N
� P(O, q
i =1
t
…
= i)
o1
o2
o3
oT -1 oT
P(O, qt ) = P( o1...ot , ot +1 ...oT , qt ) = P( o1...ot , qt ) P( ot +1...oT | o1...ot , qt )
= P( o1...ot , qt ) P( ot +1...oT | qt ) = a t bt
Fall 2004
Pattern Recognition for Vision
State Posterior
So,
P( qt = i | O) =
P(O, qt = i)
N
� P(O, q
j =1
1.
2.
3.
4.
t
= j)
=
a t (i) bt (i)
N
�a ( j )b ( j )
j =1
t
Forward pass – compute α matrix
Backward pass – compute β matrix
Multiply element-by element
Normalize columns
= g t (i )
t
» N 2T
» N 2T
NT
» N 2T
What’s the problem?
Inconsistent paths – some might not even be allowed
But not entirely useless! We will need it later.
Fall 2004
Pattern Recognition for Vision
Task II – Viterbi Algorithm
“Optimality” – single maximum probability path.
Given:
Find:
O = (o1 ,..., oT )
argmax P( q | O, l )
q
Define:
d t (i ) = max P(q1...qt -1 , qt = i, o1 ...ot )
q1q2 ... qt-1
Max prob. path so far
By the optimality principle (Bellman, ’57):
d t +1 ( j) = Ømax d t (i)aij ø b j (ot +1 )
º i
ß
Just need to keep track of max probability states along the way
Fall 2004
Pattern Recognition for Vision
Task II – Viterbi Algorithm (cont.)
1. Initialize
d1 (i) = p i bi ( o1 )
y 1 (i ) = 0
1£ i £ N
Housekeeping variable
2. Recurse
d t ( j) = max غd t -1 (i) aij øß b j ( ot )
2£ t £T
1£ j £ N
y t ( j ) = argmax غd t -1 (i )aij øß
2£ t £T
1£ j £ N
1£i£ N
1£i £ N
3. Terminate
P* = max d T (i)
1£i £ N
q = argmax d T (i )
*
T
1£ i £ N
4. Backtrack
qt* = y t +1 (qt*+1 )
Fall 2004
t = (T - 1), ..., 1
Pattern Recognition for Vision
Viterbi Illustration
• Similar to the forward procedure
• Typically, you’ll do it in log space for speed and underflows:
- replace all parameters with their logarithms
- replace all multiplications with additions
Fall 2004
Pattern Recognition for Vision
Task III – Parameter Estimation
Baum-Welch algorithm (EM for HMMs)
Given:
Find:
O = (o1 ,..., oT )
p , A, B
First, introduce another greek letter:
xt (i, j ) = P(qt = i, qt +1 = j | O) =
P( qt = i, qt +1 = j , O)
P(O)
aij bj (ot +1 )
=
a i (t )
Fall 2004
a t (i )aij b j (ot +1 ) bt ( j )
P(O)
b j (t + 1)
Pattern Recognition for Vision
Transition Probability
S
…
o1
o2
o3
expected # of transitions
from i to j
oT -1 oT
This leads to:
T -1
aij
x (i , j )
�
E [ #(i fi j )]
=
=
E [ #(i fi .) ]
� g (i )
t =1
T -1
t =1
t
t
The rest is easy
Fall 2004
Pattern Recognition for Vision
Priors and Outputs
Prior distribution:
p i = E [ #(i, t = 1) ] = g 1 (i)
Output distribution (discrete):
T
� g (i)
E [ #(i, vk )] to=t 1=vk
bi (k ) =
= T
E [ #(i) ]
t
� g (i)
t =1
Sum probabilities of
being in state i while
seeing symbol vk
t
Normalize
Figure by MIT OCW.
Fall 2004
Pattern Recognition for Vision
Continuous Output Case
Output distribution (continuous, Gaussian):
bi (o) = N (mi ,S i )
Observation at time t weighted
by the probability of being in the
state at that time
T
mi =
� g (i ) � o
t
t =1
t
T
� g (i )
t =1
t
These should look VERY familiar
T
Si =
T
g
(
i
)
�
(
o
m
)(
o
m
)
�t
t
i
t
i
t =1
T
� g (i )
t =1
Fall 2004
t
Pattern Recognition for Vision
Semi-Continuous HMM Example
Input
Initial
state
Fall 2004
Output distributions
Transition
matrix
How HMM sees it
Pattern Recognition for Vision
Gesture Recognition –Trajectory Model
Modeling a tracked hand trajectory.
Fall 2004
Pattern Recognition for Vision
HMM Classifier
Nothing unusual:
l1
l2
Input Sequence
l3
ArgMax
l
l4
Bank of HMMs
Fall 2004
Pattern Recognition for Vision
Applications – American Sign Language
Task: Recognition of sentences of American Sign Language
part of speech
pronoun
verb
40 word lexicon:
noun
• Single camera
• No special markings on hands
• Real-time
adjective
vocabulary
I, you, he, we,
you(pl), they
want, like, lose,
dontwant,
dontlike, love,
pack, hit, loan
box, car, book,
table, paper,
pants, bicycle,
bottle, can,
wristwatch,
umbrella, coat,
pencil, shoes,
food, magazine,
fish, mouse, pill,
bowl
red, brown,
black, gray,
yellow
Table from: Starner, T., and et. al. "Real-Time American Sign Language Recognition Using Desk and Wearable Computer Based Video."
IEEE Transactions on Pattern Analysis and Machine Intelligence (1998). Courtesy of IEEE. Copyright 1998 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
ASL – Features and Model
“Word” model – a 4-state L-R HMM with a single skip transition:
Features
(from skin model):
o = غ( x, y , dx, dy, area,q , lmax ,lmax / lmin )right ,(...)left øß
System 1: Second person
T
System 2: First person
Photo marked with black box
due to copyright consideration.
Nose could be used for initializing the skin model
Courtesy of Thad Starner. Used with permission.
Fall 2004
Pattern Recognition for Vision
ASL – In Action
Courtesy of Thad Starner. Used with permission.
Fall 2004
Pattern Recognition for Vision
Applications – American Sign Language
500 sentences (400 training, 100 testing)
System 1:
o`qs,ne,roddbg
fq`ll`q
experiment
all features
relative features
all features &
unrestricted
grammar
training set
94.10%
89.60%
81.0% (87%)
(D=31, S=287,
I=137, N=2390)
test set
91.90%
87.20%
74.5% (83%)
(D=3, S=76,
I=41, N=470)
grammar
part-of-speech
5-word sentence
training set
99.30%
98.2% (98.4%)
(D = 5, S=36,
I=5 N =2500)
96.4% (97.8%)
(D=24, S=32,
I=35, N=2500)
test set
97.80%
System 2:
unrestricted
Fall 2004
% words recognized
correctly
Word accuracy,
D+ S+ I
1N
97.80%
96.8% (98.0%)
(D=4, S=6, I=6,
N=500)
Pattern Recognition for Vision
Beyond HMM
Where can we go if HMM is not sufficient?
Ideas:
• Hierarchical HMM
• More complex models - SCFG
Explicit representation of structure
Capable of generating only a regular
language
Compact representation
More expressive, may include memory,
but harder to deal with
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern Analysis
and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Structured Gesture
Problem:
2 directions = 2 models
WHY???
Fall 2004
Solution – split the model in two:
• Components (trajectories)
• Structure (events)
Pattern Recognition for Vision
Heterogeneous Representation
• Many high-level activities are sequences of primitives
– Pitching, cooking, dancing, stealing a car from a parking lot
• Components
– Signal level model
– Variability in performance
– Hidden state representation (HMM, etc.)
• Structure
– Event-level model
– Uncertainty in component detections
– State is NOT hidden (SRG, SCFG, etc)
• Right tool for the right task!
Fall 2004
Pattern Recognition for Vision
Two-tier Recognition Architecture
Fall 2004
Pattern Recognition for Vision
Application: Conducting Music
V
V
V
V
4
4
“Dictionary” gestures
Fall 2004
Pattern Recognition for Vision
Application: Conducting Music
Jean Sibelius, Second Symphony, Opus 43, D Major
56
=
3
2
1
6 4
5
456
=
34
12
123
Grammar:
Correct
Fall 2004
Individual
~70%
Component
~85%
Bar
~95%
Pattern Recognition for Vision
Component Detection
Input sequence
HMM Likelihoods
Peaks of likelihood
Single event
Fall 2004
Pattern Recognition for Vision
Temporal Consistency
Grammar:
A fi ab | abA
Input
a
Fall 2004
b
a
b
b
a b
Pattern Recognition for Vision
Temporal Consistency
Grammar:
A fi ab | abA
Terminals have temporal extent!
Input
a
b
a
b
b
a b
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern Analysis
and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Temporal Consistency
Grammar:
A fi ab | abA
Terminals have temporal extent!
Input
a
b
a
b
b
a b
Inconsistent parse
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern
Analysis and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
Consistent parse
Fall 2004
Pattern Recognition for Vision
Temporal Consistency
Grammar:
A fi ab | abA
Terminals have temporal extent!
Input
a
b
a
b
b
a b
Inconsistent parse
Consistent parse
Figures from: Ivanov, Y., and A. Bobick. "Recognition of Visual Activities and Interactions." IEEE Transactions of Pattern
Analysis and Machine Intelligence (2000). Courtesy of IEEE. Copyright 2000 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Parsing
The idea is that the top level parse will filter out mistakes in
low level detections
a
b
c
d
e
Parser
Detection
Fall 2004
adecb
Pattern Recognition for Vision
Stochastic Context-Free Grammar
Example Grammar:
TRACK
->
CAR-TRACK
| PERSON-TRACK
[0.50]
[0.50]
CAR-TRACK
->
CAR-THROUGH
| CAR-PICKUP
| CAR-OUT
| CAR-DROP
[0.25]
[0.25]
[0.25]
[0.25]
car-exit
| SKIP car-exit
[0.70]
[0.30]
CAR-THROUGH . . .
Non-terminals ­
semantically significant
groups of events
Terminals - individual
events
. . .
CAR-EXIT
->
SKIP-rule - noise symbol
Fall 2004
Rule probabilities
Pattern Recognition for Vision
Event Parsing
X fi Zc
- production rules (states)
Z fi ab
X
Z
X - target non-terminal (label)
a
b
c
a x y b p q r c
SKIP
SKIP
Z - intermediate non-terminal
- input stream (tracking events)
- noise rules
For the production X, events a, b and c should be consistent
Fall 2004
Pattern Recognition for Vision
Application: Musical Conducting
Courtesy of Teresa Marrin-Nakra. Used with permission.
Individual
Component
Bar
Fall 2004
Correct
~70%
~85%
~95%
Pattern Recognition for Vision
Application: Surveillance System
• Outdoor environment - occlusions and lighting
changes
• Static cameras
• Real-time performance
• Labeling activities and person-vehicle interactions in a
parking lot
• Handling simultaneous events
Fall 2004
Pattern Recognition for Vision
Monitoring System
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on Pattern
Recognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson. Copyright 2000 IEEE. Used with Permission.
• Tracker (Stauffer, Grimson)
– assigns identity to the moving objects
– collects the trajectory data into partial tracks
• Event Generator
– maps partial tracks onto a set of events
• Parser
– labels sequences of events according to a grammar
– enforces spatial and temporal constraints
Fall 2004
Pattern Recognition for Vision
Tracker
• Adaptive to slow lighting changes:
– Each pixel is modeled by a mixture
K
P ( X t ) = � wi ,t *h ( X t , m i ,t , S i ,t )
i =1
• Foreground regions are found by connected
components algorithm
• Object dynamics is modeled in 2D by a set of
Kalman filters
• Details - (Stauffer, Grimson CVPR 99)
Fall 2004
Pattern Recognition for Vision
Tracker
Camera view
Trajectories over time
Connected components
An object
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE
Transactions on PatternRecognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris
Stauffer, and Eric Grimson. Copyright 2000 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Event Generator
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on Pattern
Recognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson. Copyright 2000 IEEE.
Used with permission.
Map tracks onto events: car-enter, person-enter, car-found,
person-found, car-lost, person-lost, stopped
• Events along with class likelihoods are posted at the endpoints of each track
(car-appear [0.5], car-disappear [1.0])
• Action label is assigned to each event in accordance with the environment map
(car-enter [0.5], car-exit [1.0])
• Each event is complemented if the label probability is < 1
(car-enter [0.5], person-enter [0.5], car-exit [1.0])
Fall 2004
Pattern Recognition for Vision
Parking Lot Grammar (Partial)
Photos and figures from: Stauffer, Chris, and Eric Grimson, "Learning Patterns of Activity Using Real-Time Tracking." IEEE Transactions on
Pattern Recognition and Machine Intelligence (TPAMI) 22, no. 8 (2000): 747-757. Courtesy of IEEE, Chris Stauffer, and Eric Grimson.
Copyright 2000 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Consistency
• Temporal
– Events should happen in particular order
– Temporally close events are more likely to be related
– Tracks overlapping in time are definitely not related to the
same object
• Spatial
– Spatially close events are more likely to be related
• Other
– Objects don’t change identity within a track
Fall 2004
Pattern Recognition for Vision
Spatio-Temporal Consistency
r = ( x, y ),
reject
X
dr = ( dx , dy )
Z
X
t
Predict new position:
t
Penalize:
Z
X
penalize
Z
R|0, if ( t - t ) < 0
f ( r , r ) = S F (r - r ) (r - r ) I
JK
|TexpGH
q
2
1
T
p
t
Fall 2004
rp = r1 + dr1( t2 - t1 )
accept
2
2
p
2
p
Pattern Recognition for Vision
Input Data
Fall 2004
Pattern Recognition for Vision
Event Generator Interleaved events in the input stream
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions."
IEEE Workshop on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and
W. E. L. Grimson. Copyright 1999 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Parse 1: Person-Pass-Through
Action label
Component labels
Object track
Temporal extent
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop
on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson.
Copyright 1999 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Parse 2: Drive-In
Action label
Component labels
Object tracks
Temporal extent
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop
on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson.
Copyright 1999 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Parse 3: Drop-off
Action label
Component labels
Object track
Temporal extent
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop
on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson.
Copyright 1999 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Parse 4: Car-Pass-Through
Action label
Component labels
Object track
Temporal extent
Figures from Ivanov, Yuri, Chris Stauffer, Aaron Bobick, W. E. L. Grimson. "Video Surveillance of Interactions." IEEE Workshop
on Visual Surveillance (ICCV 2001) (1999). Courtesy of IEEE, Yuri Ivanov, Chris Stauffer, Aaron Bobick, and W. E. L. Grimson.
Copyright 1999 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Summary
Real-time system
First of a kind end-to end system
Extended robust parsing algorithm
Events are staged in real environment with other cars
and people
• ~10-15 events per minute
• Staged events - 100% detected
• Accidental events - ~80% detected
•
•
•
•
Fall 2004
Pattern Recognition for Vision
Automatic Surveillance System
•
•
•
•
Outdoor environment - occlusions and lighting changes
Static cameras
Real-time performance
Labeling activities and person-vehicle interactions in a
parking lot
• Handling simultaneous events
Fall 2004
Pattern Recognition for Vision
Appendix: Hu Moments
Full appendix from: Bobick, A., and J. Davis. "The Representation and Recognition of Action Using Temporal Templates." IEEE Transactions on Pattern
Analysis and Machine Intelligence 23, no. 3 (2002). Courtesy of IEEE. Copyright 2002 IEEE. Used with Permission.
Fall 2004
Pattern Recognition for Vision
Download