Concept Detection: Convergence to Local Features and

advertisement
Speeded Up Event Recognition
SUPER: Towards Real-time Event
Recognition in Internet Videos
Yu-Gang Jiang
School of Computer Science
Fudan University
Shanghai, China
ygj@fudan.edu.cn
ACM International Conference on Multimedia
Retrieval (ICMR), Hong Kong, China, Jun. 2012.
ACM ICMR 2012, Hong Kong, June 2012
The Problem
• Recognize high-level events in videos
 We’re particularly interested in Internet Consumer
videos
• Applications




Video Search
Personal Video Collection Management
Smart Advertising
Intelligence Analysis
 …
…
2
Our Objective
Improve Efficiency
Maintain Accuracy
3
The Baseline Recognition Framework
Best Performing approach in TRECVID-2010
Multimedia event detection (MED) task
Feature extraction
Classifier
SIFT
Spatial-temporal
interest points
χ2
kernel
SVM
Late
Average
Fusion
MFCC audio
feature
Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang,
Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual
Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.
4
Three Audio-Visual Features…
• SIFT (visual)
– D. Lowe, IJCV ‘04
• STIP (visual)
– I. Laptev, IJCV ‘05
• MFCC (audio)
16ms 16ms
…
5
Bag-of-words Representation
• SIFT / STIP / MFCC words
• Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)
Bag-of-SIFT
SIFT feature space
Keypoint extraction
DoG
Hessian Affine
BoW histograms Using
Soft-Weighting
...
Vocabulary 1
.........
...
...
Vocabulary 2
.........
Vocabulary Generation
BoW Representation
Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept
Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010 6
Baseline Speed…
• 4 Factors on speed: Feature, Classifier,
Fusion, Frame Sampling
Total: 1003 seconds
per video !
Feature extraction
SIFT
Classifier
82.0
Spatial-temporal
interest points916.8
MFCC audio
2.36
feature
χ2
kernel
SVM
Late
Average
Fusion
~2.00
<<1
Feature efficiency is measured in seconds needed for processing an 80-second video sequence (for SIFT: 0.5fps).
Classification time is measured by classifying a video using classifiers of all the 20 categories
7
Dataset: Columbia Consumer Videos (CCV)
Basketball
Skiing
Dog
Baseball
Swimming
Bird
Soccer
Biking
Ice
Skating
Cat
Graduation
Birthday
Celebration
Wedding Reception
Non-music Performance
Wedding Ceremony
Wedding Dance
Music Performance
Parade
Beach
Playground
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A8
Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.
Feature Options
•
•
•
•
•
•
•
•
•
•
(Sparse) SIFT
STIP
MFCC
Dense SIFT (DIFT)
Dense SURF (DURF)
Self-Similarities (SSIM)
Color Moments (CM)
GIST
LBP
TINY
Uijlings, Smeulders, Scha, Real-time
bag of words, approximately, in
ACM CIVR 2009.
Suggested feature combinations:
9
Classifier Kernels
• Chi Square Kernel
• Histogram Intersection
Kernel (HI)
• Fast HI Kernel (fastHI)
Maji, Berg, Malik, Classification Using
Intersection Kernel Support Vector
Machines is Efficient, in CVPR 2008.
10
Multi-modality Fusion
• Early Fusion
Feature concatenation
• Kernel Fusion
Kf=K1+K2+…
• Late Fusion
fusion of classification
score
MFCC, DURF, SSIM,
CM, GIST, LBP
MFCC, DURF
Frame Sampling
K. Schindler and L. van Gool, Action snippets: How many frames does human action
recognition require?, in CVPR 2008.
• DURF
Uniformly sampling 16 frames per video seems sufficient.
12
Frame Sampling
• MFCC
Sampling audio frames is always harmful.
13
Summary
• Feature: Dense SURF (DURF), MFCC, plus some
global features
• Classifier: Fast HI kernel SVM
• Fusion: Early
• Frame Selection: Audio - No; Visual - Yes
220-fold speed-up!
14
Demo…
15
email: ygj@fudan.edu.cn
16
Download