Improving Human Action Recognition using Score Distribution and Ranking Minh Hoai Nguyen Joint work with Andrew Zisserman 1 Inherent Ambiguity: When does an action begin and end? 2 Precise Starting Moment? - Hands are being extended? - Hands are in contact? 3 When Does the Action End? - Action extends over multiple shots - Camera shows a third person in the middle 4 Action Location as Latent Information Video clip Latent location of action Consider subsequences HandShake classifier HandShake scores Max Recognition score (in testing) Update the classifier (in training) Poor Performance of Max Mean Average Precision (higher is better) Dataset Whole Max Hollywood2 66.7 64.8 TVHID 66.6 65.0 Possible reasons: Action recognition is … a hard problem The learned action classifier is far from perfect The output scores are noisy The maximum score is not robust 6 Can We Use Mean Instead? Video clip Latent location of action Considered subsequences HandShake classifier HandShake scores Mean On Hollywood2, Mean is generally better than Max But not always Whole Hollywood2-Handshake 48.0 Max Mean 57.1 50.3 Another HandShake Example The proportion of HandShake is small For Whole and Mean, the Signal-to-Noise ratio is small 8 Proposed Method: Use the Distribution Video clip Latent location of action Sampled subsequences Base HandShake classifier HandShake scores Sort Distribution-based classification Improved HandShake score Learning Formulation Subsequence-score distribution Video label weights bias Hinge loss Weights for Distribution Emphasize the relative importance of classifier scores Special cases: Case 1: equivalent to using Mean Case 2: equivalent to using Max Controlled Experiments Synthetic video 0.4 0.3 Random action location Non−action feature Action feature 0.2 0.1 0 −5 0 5 Two controlled parameters: - The action percentage , the separation between non-action and action features 11 Controlled Experiments 12 Hollywood2 – Progress over Time 8.6% 9.3% Best Published Results 2009 2010 2011 2012 2013 Ours 0 20 40 60 Mean Average Precision (higher is better) 80 13 Hollywood2 – State-of-the-art Methods Dataset Introduction (STIP + scene context) Marszalek, Laptev, Schmid, CVPR09 Han, Bo, Sminchisescu, ICCV09 Taylor, Fergus, LeCun, Bregler, ECCV10 Deep Learning features Wang, Ullah, Klaser etal BMVC09 Mined compound features Gilbert, Illingworth, Bowden, PAMI11 Le, Zou, Yeung, Ng, CVPR11 Ullah, Parizi, Laptev, BMVC10 Dense Trajectory Descriptor (DTD) Wang, Klaser, Schmid, CVPR11 Jiang, Dai, Xue, Liu, Ngo, ECCV12 Vig, Dorr, Cox, ECCV12 DTD + saliency Mathe & Sminchisescu, ECCV12 Improved DTD (better motion est.) Jain, Jegou, Bouthemy, CVPR13 Wang & Schmid, ICCV13 same Our method 0 10 20 30 40 50 60 70 Mean Average Precision (higher is better) 80 14 Results on TVHI Dataset 14.8% Patron, Marszalek, Zisserman, Reid, BMVC10 Marin, Yeguas, Blanca, PRL13 Patron, Marszalek, Reid, Zisserman, PAMI12 Gaidon, Harchaoui, Schmid, BMVC12 Yu, Yuan, Liu, ECCV12 Hoai & Zisserman, CVPR14 Our method 0 10 20 30 40 50 60 70 80 Mean Average Precision (higher is better) 15 Weights for SSD classifiers 16 AnswerPhone Example 1 17 AnswerPhone Example 2 18 The End 19