Action recognition with improved trajectories Heng Wang and Cordelia Schmid LEAR Team, INRIA 1 Action recognition in realistic videos Challenges Severe camera motion Variation in human appearance and pose Cluttered background and occlusion Viewpoint and illumination changes Current state of the art Local space-time features + bag-of-features model Dense trajectories performs the best on a large variety of datasets (Wang et.al. IJCV’13) 2 Dense trajectories revisited Three major steps: - Dense sampling - Feature tracking - Trajectory-aligned descriptors 3 Dense trajectories revisited Advantages: - Capture the intrinsic dynamic structures in video - MBH is robust to camera motion Disadvantages: - Generate irrelevant trajectories in background due to camera motion - Motion descriptors are corrupted due to camera motion, e.g., HOF, MBH 4 Improved dense trajectories Contributions: - Improve dense trajectories by explicit camera motion estimation - Detect humans to remove outlier matches for homography estimation - Stabilize optical flow to eliminate camera motion - Remove trajectories caused by camera motion 5 Camera motion estimation Find the correspondences between two consecutive frames: - Extract and match SURF features (robust to motion blur) - Sample good-features-to-track interest points from optical flow Combine SURF (green) and optical flow (red) results in a more balanced distribution Use RANSAC to estimate a homography from all feature matches Inlier matches of the homography 6 Remove inconsistent matches due to humans Human motion is not constrained by camera motion, thus generate outlier matches Apply a human detector in each frame, and track the human bounding box forward and backward to join them together Remove feature matches inside the human bounding box during homography estimation Inlier matches and warped flow, without or with HD 7 Warp optical flow Warp the second frame of two consecutive frames with the homography and re-compute the optical flow For HOF, the warped flow removes irrelevant camera motion, thus only encodes foreground motion For MBH, it also improves, as the motion boundaries are enhanced Two images overlaid Original optical flow Warped version 8 Remove background trajectories Remove trajectories by thresholding the maximal magnitude of stabilized motion vectors in the warped optical flow Our method works well under various camera motions, such as pan, zoom, tilt Successful examples Failure cases Removed trajectories (white) and foreground ones (green) Failure due to severe motion blur; the homography is not correctly estimated due to unreliable feature matches 9 Demo of warp flow and remove track Warped optical flow eliminates background camera motion Removing trajectories makes the feature representation more focus on human motion 10 Experimental setting "RootSIFT" normalization for each descriptor, then PCA to reduce its dimension by a factor of two Use Fisher vector to encode each descriptor separately, set the number of Gaussians to K=256 Use Power+L2 normalization for FV, and linear SVM with one-against-rest for multi-class classification Datasets Hollywood2: 12 classes from 69 movies, report mAP HMDB51: 51 classes, report accuracy on three splits Olympic sports: 16 sport actions, report mAP UCF50: 50 classes, report accuracy over 25 groups 11 Evaluation of the intermediate steps Trajectory DTF 25.4% WarpFlow 31.0% RmTrack 26.9% ITF 32.4% HOG 38.4% 38.7% 39.6% 40.2% HOF 39.5% 48.5% 41.6% 48.9% MBH HOF+MBH Combined 49.1% 49.8% 52.2% 50.9% 53.5% 55.6% 50.8% 51.0% 53.9% 52.1% 54.7% 57.2% Results on HMDB51 using Fisher vector Baseline: DTF = "dense trajectory feature" WarpFlow = "warp the optical flow" RmTrack = "remove background trajectory" ITF = "improved trajectory feature: combining WarpFlow and RmTrack 12 Evaluation of the intermediate steps Trajectory DTF 25.4% WarpFlow 31.0% RmTrack 26.9% ITF 32.4% HOG 38.4% 38.7% 39.6% 40.2% HOF 39.5% 48.5% 41.6% 48.9% MBH HOF+MBH Combined 49.1% 49.8% 52.2% 50.9% 53.5% 55.6% 50.8% 51.0% 53.9% 52.1% 54.7% 57.2% Results on HMDB51 using Fisher vector Both Trajectory and HOF are significantly improved; MBH also better as motion boundaries are clearer; HOG does not change much HOF and MBH are complementary, as they represent zero and first order motion information Both RmTrack and WarpFlow helps; WarpFlow contributes more; Combing them (ITF) works the best 13 Impact of feature encoding on improved trajectories Datasets Hollywood2 HMDB51 Olympic Sport UCF50 Bag of features DTF ITF 58.5% 62.2% 47.2% 52.1% 75.4% 83.3% 84.8% 87.2% Fisher vector DTF ITF 60.1% 64.3% 52.2% 57.2% 84.7% 91.1% 88.6% 91.2% Compare DTF and ITF using different feature encoding Standard bag of features: train a codebook of 4000 visual words with k-means for each descriptor type; RBFkernel SVM for classification We observe a similar improvement of ITF over DTF when using BOF or FV for feature encoding The improvement of FV over BOF varies on different datasets, from 2% to 7% 14 Impact of human detection and state of the art Hollywood2 Jain CVPR'13 62.5% With HD 64.3% Without HD 63.0% Olympic Sports Jain CVPR'13 83.2% With HD 91.1% Without HD 90.2% HMDB51 Jain CVPR'13 52.1% With HD 57.2% Without HD 55.9% UCF50 Shi CVPR'13 With HD Without HD 83.3% 91.2% 90.5% HD stands for human detection Human detection always helps. For Hollywood2 and HMDB51, the difference is more significant, as there are more humans present Significantly outperforms the state of the art on all four datasets Source code: http://lear.inrialpes.fr/~wang/improved_trajectories 15 THUMOS'13 Action Recognition Challenge Dataset: three train-test splits from UCF101 We follow exactly the same framework: improved trajectory feature + Fisher vector We do not apply human detection as it is computational expensive to run it on large datasets We use spatio-temporal pyramid to embed structure information in the final representation 16 THUMOS'13 Action Recognition Challenge Descriptors HOG HOF MBH HOG+HOF HOG+MBH HOF+MBH HOG+HOF+MBH None 72.4% 76.0% 80.8% 82.9% 83.3% 82.2% 84.8% T2 72.8% 76.1% 81.1% 82.7% 83.3% 82.2% 84.8% H3 73.2% 77.3% 80.5% 82.7% 83.4% 82.0% 84.6% Combined 74.6% 78.3% 82.1% 83.9% 84.4% 83.3% 85.9% We do not include Trajectory descriptor, as combining it does not improve the final performance For single descriptor: MBH > HOF > HOG; For combining two descriptors, MBH+HOG works the best, as they are most complementary 17 THUMOS'13 Action Recognition Challenge Descriptors HOG HOF MBH HOG+HOF HOG+MBH HOF+MBH HOG+HOF+MBH None 72.4% 76.0% 80.8% 82.9% 83.3% 82.2% 84.8% T2 72.8% 76.1% 81.1% 82.7% 83.3% 82.2% 84.8% H3 73.2% 77.3% 80.5% 82.7% 83.4% 82.0% 84.6% Combined 74.6% 78.3% 82.1% 83.9% 84.4% 83.3% 85.9% Spatio-temporal pyramids always helps. The improvement is more significant on a single descriptor Combing everything gives the best performance 85.9%, which is the result we submitted to THUMOS 18 TRECVID’13 Multimedia Event Detection Large scale video classification: 4500 hours, over 100,0000 videos. ITF is the best video descriptor and very fast to compute. Our whole pipeline (ITF+FV) is only 10 times slower than real time. For visual channel, we combine ITF and SIFT Descriptors AXES CMU BBNVISER Sesame MediaMill NII SRIAURORA Genie Full system 36.6% 36.3% 32.2% 25.7% 25.3% 24.9% 24.2% 20.2% ASR 1.0% 5.7% 8.0% 3.9% ------3.9% 4.3% Audio 12.4% 16.1% 15.1% 5.6% 5.6% 8.8% 9.6% 10.1% Top performance on MED ad-hoc OCR 1.1% 3.7% 5.3% 0.2% ------4.3% ---- Visual 29.4% 28.4% 23.4% 23.2% 23.8% 19.9% 20.4% 16.9% 19