Speeded Up Event Recognition SUPER: Towards Real-time Event Recognition in Internet Videos Yu-Gang Jiang School of Computer Science Fudan University Shanghai, China ygj@fudan.edu.cn ACM International Conference on Multimedia Retrieval (ICMR), Hong Kong, China, Jun. 2012. ACM ICMR 2012, Hong Kong, June 2012 The Problem • Recognize high-level events in videos We’re particularly interested in Internet Consumer videos • Applications Video Search Personal Video Collection Management Smart Advertising Intelligence Analysis … … 2 Our Objective Improve Efficiency Maintain Accuracy 3 The Baseline Recognition Framework Best Performing approach in TRECVID-2010 Multimedia event detection (MED) task Feature extraction Classifier SIFT Spatial-temporal interest points χ2 kernel SVM Late Average Fusion MFCC audio feature Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010. 4 Three Audio-Visual Features… • SIFT (visual) – D. Lowe, IJCV ‘04 • STIP (visual) – I. Laptev, IJCV ‘05 • MFCC (audio) 16ms 16ms … 5 Bag-of-words Representation • SIFT / STIP / MFCC words • Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007) Bag-of-SIFT SIFT feature space Keypoint extraction DoG Hessian Affine BoW histograms Using Soft-Weighting ... Vocabulary 1 ......... ... ... Vocabulary 2 ......... Vocabulary Generation BoW Representation Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010 6 Baseline Speed… • 4 Factors on speed: Feature, Classifier, Fusion, Frame Sampling Total: 1003 seconds per video ! Feature extraction SIFT Classifier 82.0 Spatial-temporal interest points916.8 MFCC audio 2.36 feature χ2 kernel SVM Late Average Fusion ~2.00 <<1 Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps). Classification time is measured by classifying a video using classifiers of all the 20 categories 7 Dataset: Columbia Consumer Videos (CCV) Basketball Skiing Dog Baseball Swimming Bird Soccer Biking Ice Skating Cat Graduation Birthday Celebration Wedding Reception Non-music Performance Wedding Ceremony Wedding Dance Music Performance Parade Beach Playground Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A8 Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011. Feature Options • • • • • • • • • • (Sparse) SIFT STIP MFCC Dense SIFT (DIFT) Dense SURF (DURF) Self-Similarities (SSIM) Color Moments (CM) GIST LBP TINY Uijlings, Smeulders, Scha, Real-time bag of words, approximately, in ACM CIVR 2009. Suggested feature combinations: 9 Classifier Kernels • Chi Square Kernel • Histogram Intersection Kernel (HI) • Fast HI Kernel (fastHI) Maji, Berg, Malik, Classification Using Intersection Kernel Support Vector Machines is Efficient, in CVPR 2008. 10 Multi-modality Fusion • Early Fusion Feature concatenation • Kernel Fusion Kf=K1+K2+… • Late Fusion fusion of classification score MFCC, DURF, SSIM, CM, GIST, LBP MFCC, DURF Frame Sampling K. Schindler and L. van Gool, Action snippets: How many frames does human action recognition require?, in CVPR 2008. • DURF Uniformly sampling 16 frames per video seems sufficient. 12 Frame Sampling • MFCC Sampling audio frames is always harmful. 13 Summary • Feature: Dense SURF (DURF), MFCC, plus some global features • Classifier: Fast HI kernel SVM • Fusion: Early • Frame Selection: Audio - No; Visual - Yes 220-fold speed-up! 14 Demo… 15 email: ygj@fudan.edu.cn 16