Mubarak Shah Computer Vision Lab University of Central Florida

advertisement
Mubarak Shah
Computer Vision Lab
University of Central Florida
shah@cs.ucf.edu
1
Mubarak Shah
Computer Vision Lab
University of Central Florida
shah@cs.ucf.edu
2
Human Action is purposeful behavior.
Ludwig
L d i Von
V Mises
Mi
All human actions have one or
more of these seven causes:
chance, nature, compulsion, habit,
reason, passion and desire.
Aristotle
Change does not come from the sky.
It comes from human action.
Tenzin Gyatso, 14th Dalai Lama
Human beings must have action;
and they will make it if they cannot find it.
Albert Einstein
3






Action
Event
Movement
Activity
Interaction
Verb
4


10 actions
9 actors per action
Bend
Sidestep
Jack
Hop
Wave 2 hands
Walk
Wave 1 hand
Skip
Jump in place
Run

Six Categories, 25 actors, 4 instances, total nearly 600.
clips
Boxing
Hand Clapping
Jogging
Running
Hand Waving
Walking

9 actions, 142 videos.
Kick
Swing
Dive
Bench Swing
Lift
Ride
Golf Swing
Run
Skate

g
13 action categories,
4 camera views, 10 actors, 3 instances.
View 1
View 2
View 3
Check Watch
Get Up
View 4
Cycling
Juggling
Volleyball Spiking
Diving
Golf Swinging
Riding
Basketball Shooting
Swinging
T
Tennis Swinging
i S i i
Trampoline Jumping
Walking Dog
Labeling Feature Points
with the Video class
L1
L2
L3
L3…
Feature Tree
Visual Vocabulary Using Diffusion Maps
What Is
Missing?
10
Saad Ali, Arslan Basharat and
Mubarak
M b k Shah
Sh h
ICCV2007
Copyrights Mubarak Shah, UCF

Treat Trajectories of human joints as the
representation of the non-linear dynamical
system that is generating the particular
action.
f (1t ,  2t ,..., Nt )
f
Rule
f (1t 1 , 2t 1,...,  Nt 1 ) f (1t  2 , 2t  2 ,...,  Nt  2 )
( 1t , 2t ,..., Nt )
True State Space Variables
f (1t  3 , 2t  3 ,..., Nt  3 )
We have the access to the data generated
by the dynamical system controlling this action !



Experimental Data: Trajectories of body joints.
From this data construct the phase space.
That is: Let the data speak to you and tell you what mechanisms
are generating chaotic behavior.
Action Video
Time Series: X,Y
Joint Trajectories
Extract Trajectories
j
Feature Vector
Convert into time series
C o r r e la t io n s u m
-6
-8
ln C(r)
-1 0
-1 2
-1 4
-1 6
-1 8
-2
- 1 .5
-1
-0 .5
0
0 .5
1
1 .5
ln r
S c a lin g o f D 2
0
-1
-2
-3
ld C(r)
-4
-5
-6
-7
-8
-9
-5
-4
-3
-2
-1
0
ld r
P r e d i c t io n e r r o r
For each time series perform Phase Space Embedding
2
p
1 .5
1
0 .5
0
0
Reconstructed Phase
Space
Reconstructed Phase
Space
100
150
200
C o r r e la t io n s u m
-6
-8
-1 0
ln C(r)
Reconstructed Phase
Space
50
-1 2
-1 4
-1 6
-1 8
-2
-1 .5
-1
-0 .5
0
0 .5
1
1 .5
ln r
S c a lin g o f D 2
0
30
20
20
10
10
-1
-2
-3
30
-4
ld C(r)
30
-5
-6
20
-7
-8
10
0
-10
-10
-20
50
-20
50
0
-50
0
-20
20
-9
-5
-4
-3
0
0
40
-50
0
-20
20
40
…………..
-2
ld r
-1
0
P r e d i c t io n e r r o r
2
-10
1 .5
-20
40
p
0
1
0 .5
40
20
0
20
0
0
50
100
150
200
0
-20
-20
C o r r e la t io n s u m
-6
-8
ln C(r)
-1 0
-1 2
-1 4
-1 6
Compute Chaotic Invariants
-1 8
-2
- 1 .5
-1
-0 .5
0
0 .5
1
1 .5
ln r
S c a lin g o f D 2
0
-1
-2
-12
-14
-16
-0.5
0
0.5
1
-14
-16
-18
-2
1.5
-1 .5
-1
-0.5
ln r
-1
-5
-6
-7
-8
-2
-1
Correlation
Integral
-5
-6
-7
-8
-9
-5
0
-4
-3
-2
-1
0
ld r
P red iction e rro r
p
1
0.5
0
Correlation
Dimension
2
1.5
p
Correlation
Dimension
1.5
1
0.5
0
1 00
-1 .5
-1
-0.5
15 0
2 00
0
0.5
1
1.5
ln r
-4
ld r
50
-18
-2
S caling of D 2
0
-3
P red iction e rro r
2
0
-16
1.5
-2
ld C(r)
ld C(r)
-4
-3
1
-1
Correlation
Integral
-3
-4
0.5
-14
S caling of D 2
0
-2
-9
-5
0
-12
ln r
S caling of D 2
0
Lyapunov
Exponent
-10
-1
Correlation
Integral
-2
-3
ld C(r)
-1
-12
-4
-5
-6
…………..
-7
-8
-9
-5
-4
-3
-2
-1
0
ld r
P red iction e rro r
Correlation
Dimension
2
1.5
1
0.5
0
0
50
1 00
15 0
2 00
-5
-6
-7
-8
-9
-5
-4
-3
0
50
1 00
ld r
-2
-1
0
P r e d i c t io n e r r o r
-8
p
-1 .5
Lyapunov
Exponent
-10
-4
C orrela tion su m
-6
-8
ln C(r)
Lyapunov
Exponent
-10
-18
-2
C orrela tion su m
-6
-8
Chaotic Invariants
ln C(r)
C orrela tion su m
-6
ln C(r)
Chaotic Invariants
15 0
2 00
2
1 .5
p
Chaotic Invariants
ld C(r)
-3
1
0 .5
0
0
50
100
150
200
• Six Body Joints
– Two Hands, Two Feet, Head, Belly.
• Normalized with respect to the belly point.
point
• Results in 5 trajectories per action.

Each dimension of the trajectory is considered as a
univariate time series

y g Idea: All the variables of the dynamical
y
Underlying
system influence each other.
 z1 , z 2 , z3 ,...., zt
Every point
of the series
results from the intricate
combination of influences of all
the true state variables.
zi  2 , zi 3 ,...., zi  m

Therefore,
can be considered
as a second substitute variable
which carries the influence of all
the systems variables during
time interval .
Using this reasoning,
reasoning introduce a series
of substitute variables and obtain the
whole m-dimensional space.
Copyrights Mubarak Shah, UCF
m3
z1 , z2 , z3 , z4 , z5 , z6 , z7 ,...., zt
 2
 z1
X  z 2
 .
z3
z4
.
z5 
z6 
. 
m-dimensional reconstructed phase space
Copyrights Mubarak Shah, UCF
Each row is a point
i a m-dimensional
in
di
i
l
phase space.
Phase Spaces
Head
d
Right Hand
Copyrights Mubarak Shah, UCF
Lefft Hand
Right Foott
Left Fo
oot

Maximal Lyapunov exponent



C
Correlation
l ti IIntegral
t
l


Measures exponential divergence of nearby
trajectories in phase space.
Maximal Lyapunov exponent > 0, implies dynamics
of the underlying system are chaotic.
Measures the number of points within a
neighborhood of some radius
radius, averaged over the
entire attractor
Correlation Dimension

Measures the change in the density of phase space
with respect to neighborhood radius.

p
Motion capture
data

Dataset size






Dance : 19
D
Run : 26
Walk : 46
Sit: 14
Jump: 33
Leave-One-Out Cross validation using Kmeans classifier.
Walking
Running
Jumping
Ballet
Sitting
Copyrights Mubarak Shah, UCF
Dance
Run
Walk
Jump
Sit
Dance
Dance
Run
13
2
Walk
2
1
1
22
Sit
Walk
Sit
28
Jump
Run
Jump
1
4
33
3
2
Mean Accuracy: 89.7%
89 7%
Copyrights Mubarak Shah, UCF
43


Wizemann Action Data Set
Nine actions p
performed by
y nine different
actors:


Bend, Jumping Jack, Jump Forward, Jump in Place,
Run Side Gallop,
Run,
Gallop Walk,
Walk Wave One Hand,
Hand Wave
Two Hands
81 videos
Copyrights Mubarak Shah, UCF
Experiments Wizemann Action Data Set
A1 A2 A3 A4 A5 A6 A7 A8 A9
A1 9
A2
9
A3
5
2
2
A4
9
A5
8
1
A6
1
8
A7
9
A8
9
A9
9
A1: Bend, A2: Jumping Jack, A3: Jump in Place, A4: Run, A5: Side Gallop
A6: Walk, A7: Wave1, A8: Wave2
Mean Accuracy: 92.6%

UCF Sports Dataset

Moving camera

118 videos

14 Diving,
Diving 18 Golf
Golf-Swings,
Swings 20 Kicking,
Kicking 12 HorseHorse
riding, 12 Running, 13 Skateboarding, 12 Swingbench, 13 Swing
Swing-side
side
31
Diving
Golf‐Swing‐Back
Kicking
Riding Horse
Riding‐Horse
Skateboarding
Swing‐Bench
Golf‐Swing‐Side
Run
Swing‐Side
32
33
34
Diving
13
Golf-Swing-Back
1
4
Golf-Swing-Front
1
8
Golf-Swing-Side
4
Kicking
20
Riding-Horse
2
1
Run
7
1
Skateboarding
Swing-Bench
1
1
1
Swing-Side
1
1
2
7
4
7
15
13
Mean Accuracy: 83%
35

Assumes availability of joint trajectories.

Requires


Robust Detection
Robust Tracking
36
Labeling Feature Points
with the Video class
L1
L2
L3
L3…
Feature Tree
Visual Vocabulary Using Diffusion Maps
What Is
Missing?
37




Consider an action as bag of video words.
Represent action as a histogram of video
video-words
words
Perform recognition using classifier (SVM,
KNN)
Ad
Advantages
t


No object detection
No object
b
tracking
k
38
Interest Point Detector
Video-word A
Video-word B
Video-word C
40
41

What is the (vocabulary) codebook?
Vocabulary should be compact and
discriminative.
Video words may not be semantically
Video-words
meaningful.


42
Jingen Liu, Yang Yang and Dr. Mubarak Shah
CVPR 2009
43



Make use of the co
occurrence of the visual
co-occurrence
words.
Embed video words into meaningful
g low
dimensional space.
Use video word clusters ((high
g level features))
instead of Video words.
44

K
means v.s. manifold approaches
K-means
approaches*
K‐means
Spectral clustering
*A. Y. Ng, et al. “On spectral clustering: analysis and algorithm”
Embedded points
45
Raw Feature Extraction (low level Raw Feature Extraction (low level
features (Dollar Interest Points)) Feature Quantization (k‐means, midlevel Feature
Quantization (k means midlevel
features, video words) Midlevel Feature dl l
Embedding (Diffusion Maps)
Vocabulary Construction (k‐
means, high‐level features) 46

g
Advantages:



Provides an explicit metric to reflect the geometric
structure of the feature space.
Di
Discovers
th
the semantic
ti relationships
l ti
hi b
between
t
th
the
feature points.
Ability to analyze data at multiple scales
 Using different diffusion times
 Analysis can be performed at different levels
 i.e. Sports > (Football, Baseball), Baseball > (Pitching,
score)..
47

Use PMI to represent each video word
xi  [m1,i , m2,i ,...mNc ,i ]
mc , w  log(
og(
f cw
  f   f 
w
Nc
)
Nw
c
fcw= ncw/Nw, ncw is the number of times word w appears in clip c
w appears in clip c.

w12
w13
w15
Construct weighted graph
 w11
w
21
W 
 

 wN w1
w12
w22

wN w 2





 wN wNw 



w1Nw
w2 Nw

wij ( xi , x j )  exp( 
xi  x j
2 2

2
)
 is one of the scale parameters

Symmetric
wij = wji
Positive:
wij > 0
48/21

Markov Transition Matrix
 p11
p
21
P
 

 pN 1
w
p12
p22

pN 2
w
 p1N
 p2 N


 pN N
w
w
w
w
w15






pij ( xi , x j ) 
wij ( xi , x j )
di
w12
w13
Nw
d i  rowsum   wij
j 1
p(xi, xj) = transition probability in one
time step.
P t  ( P )t
t is the second scale parameter
Pt is the probability of transition from x
i th
b bilit f t
iti f
i
to xj in t steps
49/21

Goal: relate spectral properties of Markov chain to geometry of the data. [ D (t ) ( xi , x j )] 2  
( piq(t )  p (jqt ) ) 2
 0 ( xq )
q
 0 (x q )  lim p((t)q ) 
t 
dq
d
j
j
d q   wqj
j
Stationary probability distribution: the probability of landing at location q
Stationary
probability distribution: the probability of landing at location q after after
taking infinite steps of random walk. 
Diffusion distances can be computed using eigenvectors and Eigen values of P.

  1 ( xN )  1
0   1T
 T



 2 ( xN )  
2
  1
P
 

 




  T






(
x
)
(
x
)

(
x
)
0
 N
2
N
N
N 
N 
 N 1
 
  1 ( x1 )
  (x )
 2 1
w
 1 ( x2 )
 2 ( x2 )
w
w
w
w
w
w
w






50

The distance may be approximated with the first α Eigen values. 
[ D ( xi , x j )]   (lt ) 2 ( l ( xi )  l ( x j )) 2
(t )
2
l 1
  1 ( x1 )  1 ( x2 )   1 ( xN )  1
  (x )  (x )
 2 ( xN )  
2
2
1
2
2


P    ( x )  ( x )   (x )  
  1


2

N
 








(
x
)

(
x
)

(
x
)
 N 1
N
2
N
N 
  0
w
w
w
w

w
w
w
T

0  1
  T
 1

 


  T
N  N

w
w








The diffusion map embedding.  t ( x)  (1t 1 ( x )), t2 2 ( x)),    , t  ( x ))

The Euclidean distance is equal to the diffusion distance:
[ D ( xi , x j )]   ( xi )   ( x j )
(t )
2
t
t
2
51
spiral points
KTH dataset
diffusion
Diffusion distance
geodesic
Geodesic distance
52
Raw Feature Extraction (low level Raw Feature Extraction (low level
features (Dollar)) Feature Quantization (k‐means, midlevel Feature
Quantization (k means midlevel
features, video words) Midlevel Feature Embedding dl l
b dd
(DM, embedded midlevel features)
Semantic Vocabulary Construction (k‐means
Construction (k
means, high
high‐
level features) 53
Boxing
Clapping
Walking
Jogging
Waving
Running
54
55
• Recognition rate with and without embedding.
embedding
94
92
90
88
86
84
82
80
Embedded
origianl
56
• Comparison of recognition rate between different manifold learning schemes
57
Average Accuracy=92.3%
58
mc , w  log(
f cw
 f    f 
w
94
)
c
PMI
92
Freq.
90
88
86
84
82
50
100
150
200
250
59
60
Average Accuracy=76.1%
61
• Comparison of recognition rate using different manifold
learning schemes on You Tube dataset.
62
B d Bed room (216
( 6)
Kitchen (210)
Forest (328)
Suburb (241)
Coast (360)
Highway (260)
I d t ( )
Industry (311)
Inside of City (308)
Store (315)
Li i R
Living Room (289)
( 8 ) Mountain (374)
M
t i ( )
Street (292)
Open Country (410)
Office (215)
Tall Building (356)
63
M1
M2
M3
Three visual words with their corresponding real image patches Mid level features.
Part of the building
H1
H2
Part of the foliage
H3
Part of the window
H4
H5
H6
Six high level features with their corresponding real image patches.
64
• Comparison of recognition rate using different manifold learning
schemes on fifteen scene dataset.
75.5
75
74.5
74
73.5
73
72.5
72
DM
ISOMAP
PCA
EigenMap
Average Accuracy
Average Accuracy
65
Labeling Feature Points
with the Video class
L1
L2
L3
L3…
Feature Tree
Visual Vocabulary Using Diffusion Maps
What Is
Missing?
66
K Reddy
K.
J Liu
J.
ICCV 2009
M Shah
M.
Feature Detection
Limitations
• Intensive training stage to obtain good performance
• Sensitive to the vocabulary size
• Unable to cope with incremental recognition problems
• Simultaneous
Si lt
multiple
lti l actions
ti
can nott b
be recognized
i d
• Cannot perform recognition frame by frame
K means
Histogram
SVM



Detect Local features:
S ti t
Spatiotemporal
l IInterest
t
t points
i t
Index them Using a Tree
Advantages
Effective integration of indexing and
recognition.
g
 No vocabulary construction and category
model training.
 Incremental action recognition.
recognition
 The tree provides a disk-based data structure,
Feature Detection

 Scalable for large scale datasets.

The recognition can be performed nearly
nearl in real
time.
Feature-Tree



Detect Local features:
S ti t
Spatiotemporal
l IInterest
t
t points
i t
Index them Using a Tree
Advantages
Effective integration of indexing and
recognition.
g
 No vocabulary construction and category
model training.
 Incremental action recognition.
recognition
 The tree provides a disk-based data structure,
Feature Detection

 Scalable for large scale datasets.

The recognition can be performed nearly
nearl in real
time.
Feature Tree
Tree Construction
Input Video
d
Data Set
Feature
F
t
D
Detection
t ti
and Extraction
Labeling
b l
Feature Points
with the Video class
L1
L2
Constructing the tree with
the labeled Feature points
L3…
• Training videos V = {v1, v2,…,vM} with corresponding class label li
{1,2,…,C}.
{1
2
C}
• We extract n spatiotemporal features (1≤j≤n) from each video vi.
• We associate each feature with its class label and get a twoelements
l
ffeature tuple
l xij = [dij lc].
]
• Collection of labeled features T = {xij} is used to construct the tree.



SR-tree is a Multi-dimensional indexing method (data
partition using predefined shapes).
SR-tree
SR
tree organizes the data by hierarchical regions.
regions
The region is defined by the intersection of a bounding
sphere and a bounding rectangle.
Root Node
Node Level
Leaf Level
Query
Leaf to search
Root Node
Node Level
Leaf Level
Labeled Data
Query
Tree Construction
IInputt
Video
Feature Detection
Labeling
L
b li Feature
F t
Points
P i t
with the Video class
L1
L3…
Constructing
C
t ti th
the
tree
L2
Query
Query
Video
Feature
Detection
Retrieve the top K features
for each Feature Query
Feature
Voting
Label the
Query Video
Query
Query
Video
Feature Detection
and Extraction
Retrieve the top K features for
each Feature Query using the
T
Tree
Feature Voting
Objective: Given a query video Q, assign a class label to it.
Feature Extraction:
1. For a given video Q, detect the spatiotemporal interest
points using the Dollar detector and extract the cuboids
around them.
2. Compute gradient for each cuboid q, and use PCA to
reduce the dimension, then it is represent by a descriptor dq.
3 Representing Q as a set of features {dq}
3.
Action Recognition:
Given the query features of Q: {dq}
1. For each query feature dq in Q retrieve the nearest
neighbor
g
.
2. The class label of Q is decided by summing the votes of
the label assigned to the features {dq}.
.
Label the Query
Video



5 – Persons and 6 – actions to construct the tree
 SR-Tree average performance of 87.77%
Vocabulary-based (standard bag of words) approach 84.4%.
10 – Persons and 6 – actions to construct the tree
 SR-tree performance increases to 90.78%
SR Tree
Training on Four views, Testing on One view
72
70
68
66
64
62
60
58
Camera‐1
Camera‐2
Camera‐3
Camera‐4
• Average Accuracy
•Training
Training on Fo
Fourr views
ie s and Testing on One view
ie 66.4%
66 4%
•Training on Four views and Testing on Four views 72%
77
90
80
70
60
50
40
30
20
10
0
Training on 4 views Testing on
views,Testing on One view Training on 3 views, g
Testing on 4th View
78

Confusion Table (Four Views for training and
testing)
KTH Data set



Initial tree is constructed using videos from Boxing, Clapping,
waving and Jogging performed by 5 persons.
Tree is expanded using videos from a new action Running
incrementally adding videos from 1 person at a time
Testing is done for 5 actions using videos of 20 people not used in
constructing the tree.
KTH Data set
Ground Truth
Classification
Accuracy
KTH Data set
Number of Frames
Cluttered environment
Labeling Feature Points
with the Video class
L1
L2
L3
L3…
Feature Tree
Visual Vocabulary Using Diffusion Maps
85
Jingen Liu
Saad Ali
Arslan Basharat
Kishore Reddy
Yang Yang
htt //
http://www.cs.ucf.edu/~vision
f d / ii
86
Labeling Feature Points
with the Video class
L1
L2
L3
L3…
Feature Tree
Visual Vocabulary Using Diffusion Maps
What Is
Missing?
87




Complex actions
Explanation/Understanding
Unsupervised learning
Other sources of information e.g. text (Semantic
Similarity)
88
89
90
91
Basket Ball going through the hoop.
92




Complex actions
Explanation/Understanding
Unsupervised learning
Other sources of information e.g. text (Semantic
Similarity)
93
Labeling Feature Points
with the Video class
L1
L2
L3
L3…
Feature Tree
Visual Vocabulary Using Diffusion Maps
What Is
Missing?
94
Download