Columbia University TRECVID-2006 High-Level Feature Extraction Shih-Fu Chang, Winston Hsu, Wei Jiang, Lyndon Kennedy, Dong Xu, Akira Yanagawa, and Eric Zavesky Digital Video and Multimedia Lab, Columbia University http://www.ee.columbai.edu/dvmm Overview – 5 methods & 6 submitted runs 2 baseline context-based concept fusion 4 3 text feature lexicon-spatial pyramid matching 5 event detection Visual-based 6 runs baseline LSPM visual_concept adaptive 2 5 methods 1 context text multi-model_concept adaptive Overview – performance MAP 0.16 0.14 best all best visual visual-text visual-based 0.12 0.1 0.08 0.06 0.04 0.02 0 Every multi-model_ visual_ text LSPM A_CL1_1 A_CL2_2 incrementally A_CL3_3 A_CL4_4 method contributes concept adaptive concept adaptive context baseline A_CL6_6 toA_CL5_5 the final detection context > baseline context-based concept fusion (CBCF) improves baseline LSPM > context lexicon-spatial pyramid matching (LSPM) further improves detection 3 text > LSPM: text features improve visual Overview – performance MAP 0.16 best all best all 0.14 best best visual visual visual-text visual-text visual-based visual-based 0.12 0.1 0.08 0.06 0.04 0.02 0 multi-model_ visual_ text spatial pyramid context A_CL1_1 A_CL2_2 A_CL3_3 A_CL4_4 A_CL5_5 concept adaptive concept adaptive baseline A_CL6_6 visual_concept adaptive > LSPM (also > context > baseline): best of visual selection works 4 text > multi-model_concept adaptive: best of all selection does not work well probably due to over fitting of text tool Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 5 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 6 Individual Methods: (1) Baseline Average fusion of two SVM baseline classification results Based on 3 visual features color moments over 5x5 fixed grid partitions Gabor texture edge direction histogram from the whole image 1 Fixed/Global Color … Texture Edge Support Vector Machines (SVM) coarse local features, layout, and global appearance 7 Individual Methods: (1) Baseline Average fusion of two SVM baseline classification results Based on 3 visual features Features and models color moments over 5x5 available fixed grid for partitions download Gabor texture soon! edge direction histogram from the whole image 2 Fixed/Global Color … Texture Edge ensemble classifier Yanagawa et al., Tec. Rep., Columbia Univ., 2006 , 8http://www.ee.columbia.edu/dvmm/newPublication.htm Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 9 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 10 Individual Methods: (2) CBCF Background on Context Fusion government-leader different view different person Hard/specific concept “Government-Leader” Detector large variance in appearance Context-based Model Government-Leader Generic concept “Face” Detector Generic concept “outdoor” Detector Context Information 11 Outdoor + Face Individual Methods: (2) CBCF Formulation outdoor detector government-leader detector face detector P(outdoor|image) P(government-leader|image) P (face|image) context-based model (Naphade et al 2002) P (outdoor|image) 12 P (government-leader|image) P (face|image) Individual Methods: (2) CBCF Our approach: Discriminative + Generative I C1 outdoor detector P(outdoor|image) C3 C2 government-leader detector x1 P(government-leader|image) face detector x2 P (face|image) x3 observation Conditional Random Field (Jiang, Chang, et al ICIP 2006) outdoor airplane office updated posteriors P (outdoor|image) p13( y1 = 1 | X ) P (government-leader|image) p ( y2 = 1 | X ) P (face|image) p ( y3 = 1 | X ) Individual Methods: (2) CBCF Our approach: Discriminative + Generative I C1 outdoor detector P(outdoor|image) C3 C2 government-leader detector x1 P(government-leader|image) face detector x2 P (face|image) x3 observation Conditional Random Field min J = −∏∏ p( yi = 1 | X ) I (1+ yi ) / 2 p ( yi = − 1 | X ) (1− yi ) / 2 Ci iteratively minimized by boosting updated posteriors P (outdoor|image) p14( y1 = 1 | X ) P (government-leader|image) p ( y2 = 1 | X ) P (face|image) p ( y3 = 1 | X ) Individual Methods: (2) CBCF During each iteration t: Classifier 2 keeps updating through iteration two SVM are trained forinfluences each concept: Andclassifiers captures inter-conceptual min 1. Using input independent results(1− yi ) / 2 (1+ ydetection i )/2 J = −∏∏ p( yi = 1 | X ) p ( yi = − 1 | X ) I Ci 2. Using updated posteriors from iteration t-1 iteratively minimized by boosting Without classifier 2, Traditional AdaBoost 15 Individual Methods: (2) CBCF Database & lexicon for context • Predefined lexicon to provide context -- 374 concepts from LSCOM ontology (observation) airplane, building, car, boat, person, outdoor, sports, etc • Independent detector -- our baseline • Test concepts -- the 39 concepts defined by NIST (update posteriors) 16 Individual Methods: (2) CBCF 1.2 experimental results over TRECVID 2005 development set independent detector independent detector 1 context-based Boosted CRF fusion 24 improve 15 degrade AP 0.8 0.6 0.4 0.2 0 1 17 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Selective Application of Context • Not every concept classification benefits from context-based fusion Consistent with previous context-based fusion: IBM: no more than 8 out of 17 concepts gained performance [Amir et al., TRECVID Workshop, 2003] Mediamill: 80 out of 101 concepts [Snoek et al., TRECVID Workshop, 2005] • Is there a way to predict when it works? 18 Predict When Context Helps Why CBCF may not help every concept ? Complex inter-conceptual relationships vs. limited training samples Strong classifiers may suffer from fusion with weak context Avoid using CBCF for Ci if Ci is strong and with weak context Use CBCF for concept Ci if Ci is weak or with strong context I (Ci ; C j ) -- mutual information between Ci and C j E (Ci ) ∑ C j , j ≠i -- error rate of independent detector for Ci I (C j ; Ci ) E (C j ) ∑ C j , j ≠i 19 I (C j ; Ci ) Strong context <β or E (Ci ) > λ weak concept Predict When Context Helps Change parameters to predict different number of concepts # predicted # concept improved precision of prediction MAP gain 39 24 62% 3.0% 20 15 75% 9.5% 16 14 88% 14% 9 9 100% 7.2% 20 Example Military Individual Fighter_Combat House ... 21 Example Independent Detector 22 Example Context-based concept fusion 23 Example Context-based concept fusion House 24 Example Context-based concept fusion Positive frames are moved forward with the help of Fighter_Combat 25 Context-Based Fusion + Baseline TRECVID 2005 development set R6 R5 8 9 baseline All get improved ! context 1 MAP Gain: 14% 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 26 2 3 4 5 6 7 10 11 12 13 14 15 16 Context-Based Fusion + Baseline TRECVID 2006 evaluation 4 concepts Similar to results over TRECVID 2005 set ! 0.3 baseline context 0.25 AP 0.2 0.15 0.1 0.05 0 1 27 2 3 4 Discussion The smaller the better ∑ Quality of context: C j , j ≠i I (C j ; Ci ) E (C j ) ∑ C j , j ≠i I (C j ; C i ) Concepts with performance improved: 3.23 Concepts with performance degraded: 4.17 Adding context – strong relationship and robust 28 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 29 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 30 Individual Methods: (3) LSPM Local features (SIFT) Spatial layout sky tree water Spatial Pyramid Matching (SPM) [Lazebnik et al. CVPR, 2006] multi-resolution histogram matching in spatial domain, bags-of-features Appropriate sizePyramid for visual lexicon (LSPM) ? Lexicon-Spatial Matching SPM matching guided by multi-resolution lexicons 31 Individual Methods: (3) LSPM SIFT features t1 t2 t3 t4 tn t5 Lexicon level 0 t 4_2 t 2_2 t 1_1 t 1_2 32 t 5_2 t 3_1 t 3_2 t 2_1 t 4_1 t 5_1 . . . t n_1 Lexicon level 1 t n_2 Individual Methods: (3) LSPM Lexicon level 0 t1 t2 ... tn spatial level 0 + ... spatial level 1 + ... . . . Image 1 Image 2 spatial level 2 . . . Local features & Spatial layout of local features 33 ... || SPM kernel Individual Methods: (3) LSPM Lexicon level 0 t1 t2 ... tn SPM kernel 0 + Lexicon level 1 t 1_1 t 1_2 . . . t n_1 t n_2 SPM kernel 1 + ... ... || SVM classifier 34 LSPM kernel Individual Methods: (3) LSPM We apply LSPM to 13 concepts: flag-us, building, maps, waterscape-waterfront, car, charts, urban, road, boat-ship, vegetation, court, government-leader Complements baseline by considering local features 0.3 with LSPM without LSPM 6 are evaluated by NIST 0.25 almost all get improved ! AP 0.2 0.15 0.1 0.05 0 35 1 2 3 4 5 6 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 36 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 37 Individual Methods: (4) Text Problems: asynchrony between the words being spoken and the visual concepts appearing in the shot Solution: by frequency dimension -top k most incorporate associated text from the entire story reduction frequent words automatically detected story boundaries [Hsu et al., ADVENT Technical Report , Columbia Univ., 2005 ] story bag-of-words (term-frequency-inverse document frequency) training data: bag-of-words features of stories ground-truth label: positive – one shot is positive 38 SVM Individual Methods: (4) Text 0.2 text + 0.8 visual 0.5 0.45 visual only MAP text + visualGain 4.5% 0.4 0.35 AP 0.3 0.25 0.2 0.15 0.1 0.05 0 1 39 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 40 Outline – New Algorithms • Baseline • Context-based concept fusion (CBCF) • Lexicon-spatial pyramid matching (LSPM) • Text features • Event detection 41 Individual Methods: (5) Event Event detection: Key frame v.s. Multiple frames 42 Individual Methods: (5) Event Event detection: Key frame v.s. Multiple frames fij: correspondence flow P P Q Q dij p1 q1 1/2 . . . 1 1/2 q2 . . . pm Supply qn SVM demand Earth Mover’s Distance: minimum weighted distance by linear programming handle temporal shift: a frame at the beginning of P can map to a frame at the end of Q 43 Handle scale variations: a frame from P can map to multiple frames in Q Individual Methods: (5) Event experimental results Performance over TRECVID 2005 development set 11 events: airplane_flying, people_marching, car_crash, exiting_car, demonstration_or_protest, election_campaign_greeting, parade, riot, running, shooting, walking AP Key Frame 44 1 0.8 0.6 0.4 0.2 0 EMD Conclusion • TRECVID 2006 offers a mature opportunity for evaluating concept interaction — We have built 374 concept detectors — Models and feature will be released soon • Context-Based Fusion — — — — Propose a systematic framework for predicting the effect of context fusion (TRECVID 2005) 14 out of 16 predicted concepts show performance gain (TRECVID 2006) 3 out of 4 predicted concepts show performance gain Promising methodology for scaling up to large-scale systems (374 models) • Results from Parts-based model (LSPM) are mixed — But show consistent improvement when fused with SVM baseline — 3 out of 6 concepts improve by more than 10% • Temporal event modeling — We propose a novel matching and detection method based on EMD+SVM — Show consistent gains in 2005 data set — Results in 2006 are incomplete and lower than expected 45 • More information at – http://www.ee.columbia.edu • Features and models for baseline detectors for 374 LSCOM concepts coming soon 46