Improving Human Action Recognition
using Score Distribution and Ranking
Minh Hoai Nguyen
Joint work with Andrew Zisserman
1
Inherent Ambiguity:
When does an action begin and end?
2
Precise Starting Moment?
- Hands are being extended?
- Hands are in contact?
3
When Does the Action End?
- Action extends over multiple shots
- Camera shows a third person in the middle
4
Action Location as Latent Information
Video clip
Latent location of action
Consider
subsequences
HandShake classifier
HandShake
scores
Max
Recognition score
(in testing)
Update the classifier
(in training)
Poor Performance of Max
Mean Average Precision (higher is better)
Dataset
Whole
Max
Hollywood2
66.7
64.8
TVHID
66.6
65.0
Possible reasons:
Action recognition is … a hard problem 
 The learned action classifier is far from perfect
 The output scores are noisy
 The maximum score is not robust
6
Can We Use Mean Instead?
Video clip
Latent location of action
Considered
subsequences
HandShake classifier
HandShake
scores
Mean
On Hollywood2, Mean is generally better than Max
But not always
Whole
Hollywood2-Handshake
48.0
Max Mean
57.1
50.3
Another HandShake Example
The proportion of HandShake is small
For Whole and Mean, the Signal-to-Noise ratio is small
8
Proposed Method: Use the Distribution
Video clip
Latent location of action
Sampled
subsequences
Base HandShake classifier
HandShake
scores
Sort
Distribution-based classification
Improved HandShake score
Learning Formulation
Subsequence-score
distribution
Video label
weights
bias
Hinge loss
Weights for Distribution
Emphasize the
relative importance
of classifier scores
Special cases:
Case 1:
equivalent to using Mean
Case 2:
equivalent to using Max
Controlled Experiments
Synthetic video
0.4
0.3
Random action location
Non−action feature
Action feature
0.2
0.1
0
−5
0
5
Two controlled parameters:
- The action percentage
, the separation between non-action and action features
11
Controlled Experiments
12
Hollywood2 – Progress over Time
8.6% 9.3%
Best
Published
Results
2009
2010
2011
2012
2013
Ours
0
20
40
60
Mean Average Precision (higher is better)
80
13
Hollywood2 – State-of-the-art Methods
Dataset Introduction (STIP + scene context)
Marszalek, Laptev, Schmid, CVPR09
Han, Bo, Sminchisescu, ICCV09
Taylor, Fergus, LeCun, Bregler, ECCV10
Deep Learning features
Wang, Ullah, Klaser etal BMVC09
Mined compound
features
Gilbert, Illingworth, Bowden, PAMI11
Le, Zou, Yeung, Ng, CVPR11
Ullah, Parizi, Laptev, BMVC10
Dense Trajectory
Descriptor (DTD)
Wang, Klaser, Schmid, CVPR11
Jiang, Dai, Xue, Liu, Ngo, ECCV12
Vig, Dorr, Cox, ECCV12
DTD + saliency
Mathe & Sminchisescu, ECCV12
Improved DTD
(better motion est.)
Jain, Jegou, Bouthemy, CVPR13
Wang & Schmid, ICCV13
same
Our method
0
10
20
30
40
50
60
70
Mean Average Precision (higher is better)
80
14
Results on TVHI Dataset
14.8%
Patron, Marszalek, Zisserman, Reid, BMVC10
Marin, Yeguas, Blanca, PRL13
Patron, Marszalek, Reid, Zisserman, PAMI12
Gaidon, Harchaoui, Schmid, BMVC12
Yu, Yuan, Liu, ECCV12
Hoai & Zisserman, CVPR14
Our method
0
10
20
30
40
50
60
70
80
Mean Average Precision (higher is better)
15
Weights for SSD classifiers
16
AnswerPhone Example 1
17
AnswerPhone Example 2
18
The End
19