Columbia University TRECVID-2006 High-Level Feature Extraction Shih-Fu Chang, Winston Hsu, Wei Jiang,

advertisement
Columbia University TRECVID-2006
High-Level Feature Extraction
Shih-Fu Chang, Winston Hsu, Wei Jiang,
Lyndon Kennedy, Dong Xu,
Akira Yanagawa, and Eric Zavesky
Digital Video and Multimedia Lab, Columbia University
http://www.ee.columbai.edu/dvmm
Overview – 5 methods & 6 submitted runs
2
baseline
context-based concept fusion
4
3
text feature
lexicon-spatial
pyramid matching
5
event detection
Visual-based
6 runs
baseline
LSPM
visual_concept adaptive
2
5 methods
1
context
text
multi-model_concept adaptive
Overview – performance
MAP
0.16
0.14
best all
best visual
visual-text
visual-based
0.12
0.1
0.08
0.06
0.04
0.02
0
Every
multi-model_
visual_
text
LSPM
A_CL1_1
A_CL2_2 incrementally
A_CL3_3 A_CL4_4
method
contributes
concept adaptive concept adaptive
context
baseline
A_CL6_6
toA_CL5_5
the final
detection
ƒ context > baseline
context-based concept fusion (CBCF) improves baseline
ƒ LSPM > context
lexicon-spatial pyramid matching (LSPM) further improves detection
3
ƒ text > LSPM: text features improve visual
Overview – performance
MAP
0.16 best all
best all
0.14
best
best visual
visual
visual-text
visual-text
visual-based
visual-based
0.12
0.1
0.08
0.06
0.04
0.02
0
multi-model_
visual_
text spatial pyramid context
A_CL1_1 A_CL2_2 A_CL3_3 A_CL4_4 A_CL5_5
concept adaptive concept adaptive
baseline
A_CL6_6
visual_concept adaptive > LSPM (also > context > baseline):
best of visual selection works
4
text > multi-model_concept adaptive:
best of all selection does not work well
probably due to over fitting of text tool
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
5
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
6
Individual Methods: (1) Baseline
Average fusion of two SVM baseline classification results
Based on 3 visual features
ƒ color moments over 5x5 fixed grid partitions
ƒ Gabor texture
ƒ edge direction histogram from the whole image
1
Fixed/Global
ƒ Color
…
ƒ Texture
ƒ Edge
Support Vector
Machines (SVM)
coarse
local features, layout, and global appearance
7
Individual Methods: (1) Baseline
Average fusion of two SVM baseline classification results
Based on 3 visual features Features and models
ƒ color moments over 5x5 available
fixed grid for
partitions
download
ƒ Gabor texture
soon!
ƒ edge direction histogram from the whole image
2
Fixed/Global
ƒ Color
…
ƒ Texture
ƒ Edge
ensemble classifier
Yanagawa et al., Tec. Rep., Columbia Univ., 2006 ,
8http://www.ee.columbia.edu/dvmm/newPublication.htm
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
9
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
10
Individual Methods: (2) CBCF
Background on Context Fusion
government-leader
different view
different person
Hard/specific concept
“Government-Leader” Detector
large variance in appearance
Context-based Model
Government-Leader
Generic concept
“Face” Detector
Generic concept
“outdoor” Detector
Context Information
11
Outdoor
+
Face
Individual Methods: (2) CBCF
Formulation
outdoor detector
government-leader detector
face detector
P(outdoor|image)
P(government-leader|image)
P (face|image)
context-based model
(Naphade et al 2002)
P (outdoor|image)
12
P (government-leader|image)
P (face|image)
Individual Methods: (2) CBCF
Our approach: Discriminative + Generative
I
C1
outdoor detector
P(outdoor|image)
C3
C2
government-leader detector
x1
P(government-leader|image)
face detector
x2
P (face|image) x3
observation
Conditional Random Field (Jiang, Chang, et al ICIP 2006)
outdoor
airplane
office
updated posteriors
P (outdoor|image)
p13( y1 = 1 | X )
P (government-leader|image)
p ( y2 = 1 | X )
P (face|image)
p ( y3 = 1 | X )
Individual Methods: (2) CBCF
Our approach: Discriminative + Generative
I
C1
outdoor detector
P(outdoor|image)
C3
C2
government-leader detector
x1
P(government-leader|image)
face detector
x2
P (face|image) x3
observation
Conditional Random Field
min
J = −∏∏ p( yi = 1 | X )
I
(1+ yi ) / 2
p ( yi = − 1 | X )
(1− yi ) / 2
Ci
iteratively minimized by boosting
updated posteriors
P (outdoor|image)
p14( y1 = 1 | X )
P (government-leader|image)
p ( y2 = 1 | X )
P (face|image)
p ( y3 = 1 | X )
Individual Methods: (2) CBCF
During each iteration t:
Classifier 2 keeps updating through iteration
two SVM
are trained forinfluences
each concept:
Andclassifiers
captures inter-conceptual
min
1. Using input independent
results(1− yi ) / 2
(1+ ydetection
i )/2
J = −∏∏ p( yi = 1 | X )
p ( yi = − 1 | X )
I
Ci
2. Using updated
posteriors from iteration t-1
iteratively minimized by boosting
Without classifier 2, Traditional AdaBoost
15
Individual Methods: (2) CBCF
Database & lexicon for context
• Predefined lexicon to provide context
-- 374 concepts from LSCOM ontology (observation)
airplane, building, car, boat, person, outdoor, sports, etc
• Independent detector
-- our baseline
• Test concepts
-- the 39 concepts defined by NIST (update posteriors)
16
Individual Methods: (2) CBCF
1.2
experimental results over TRECVID 2005 development set
independent detector
independent
detector
1
context-based
Boosted
CRF fusion
24 improve
15 degrade
AP
0.8
0.6
0.4
0.2
0
1
17
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Selective Application of Context
• Not every concept classification benefits
from context-based fusion
Consistent with previous context-based fusion:
IBM: no more than 8 out of 17 concepts gained performance
[Amir et al., TRECVID Workshop, 2003]
Mediamill: 80 out of 101 concepts
[Snoek et al., TRECVID Workshop, 2005]
• Is there a way to predict when it works?
18
Predict When Context Helps
Why CBCF may not help every concept ?
ƒ Complex inter-conceptual relationships vs. limited training samples
ƒ Strong classifiers may suffer from fusion with weak context
Avoid using CBCF for Ci if Ci is strong and with weak context
Use CBCF for concept Ci if Ci is weak or with strong context
I (Ci ; C j ) -- mutual information between Ci and C j
E (Ci )
∑
C j , j ≠i
-- error rate of independent detector for Ci
I (C j ; Ci ) E (C j )
∑
C j , j ≠i
19
I (C j ; Ci )
Strong context
<β
or
E (Ci ) > λ
weak concept
Predict When Context Helps
Change parameters to predict different number of concepts
# predicted
# concept improved
precision of prediction
MAP gain
39
24
62%
3.0%
20
15
75%
9.5%
16
14
88%
14%
9
9
100%
7.2%
20
Example
Military
Individual
Fighter_Combat
House
...
21
Example
Independent Detector
22
Example
Context-based concept fusion
23
Example
Context-based concept fusion
House
24
Example
Context-based concept fusion
Positive frames are moved forward
with the help of Fighter_Combat
25
Context-Based Fusion + Baseline
TRECVID 2005 development set
R6
R5
8
9
baseline
All get improved !
context
1
MAP Gain:
14%
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
26
2
3
4
5
6
7
10
11
12
13
14
15
16
Context-Based Fusion + Baseline
TRECVID 2006 evaluation
4 concepts Similar to results over TRECVID 2005 set !
0.3
baseline
context
0.25
AP
0.2
0.15
0.1
0.05
0
1
27
2
3
4
Discussion
The smaller the better
∑
Quality of context:
C j , j ≠i
I (C j ; Ci ) E (C j )
∑
C j , j ≠i
I (C j ; C i )
Concepts with performance improved: 3.23
Concepts with performance degraded: 4.17
Adding context – strong relationship and robust
28
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
29
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
30
Individual Methods: (3) LSPM
Local features (SIFT)
Spatial layout
sky
tree
water
Spatial Pyramid Matching (SPM)
[Lazebnik et al. CVPR, 2006]
multi-resolution histogram matching in spatial domain, bags-of-features
Appropriate sizePyramid
for visual
lexicon (LSPM)
?
Lexicon-Spatial
Matching
SPM matching guided by multi-resolution lexicons
31
Individual Methods: (3) LSPM
SIFT features
t1
t2
t3
t4
tn
t5
Lexicon
level 0
t 4_2
t 2_2
t 1_1 t 1_2
32
t 5_2
t 3_1 t 3_2
t 2_1
t 4_1
t 5_1
. . . t n_1
Lexicon
level 1
t n_2
Individual Methods: (3) LSPM
Lexicon level 0
t1
t2
...
tn
spatial level 0
+
...
spatial level 1
+
...
.
.
.
Image 1
Image 2
spatial level 2
.
.
.
Local features & Spatial layout of local features
33
...
||
SPM kernel
Individual Methods: (3) LSPM
Lexicon level 0
t1
t2
...
tn
SPM kernel 0
+
Lexicon level 1
t 1_1 t 1_2
. . . t n_1 t n_2
SPM kernel 1
+
...
...
||
SVM classifier
34
LSPM kernel
Individual Methods: (3) LSPM
We apply LSPM to 13 concepts:
flag-us, building, maps, waterscape-waterfront, car, charts, urban,
road, boat-ship, vegetation, court, government-leader
Complements baseline by considering local features
0.3
with LSPM
without LSPM
6 are evaluated by NIST
0.25
almost all get improved !
AP
0.2
0.15
0.1
0.05
0
35
1
2
3
4
5
6
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
36
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
37
Individual Methods: (4) Text
Problems:
asynchrony between the words being spoken and the visual
concepts appearing in the shot
Solution:
by frequency
dimension
-top
k most
incorporate associated text from the entire story
reduction
frequent
words
automatically detected story boundaries
[Hsu et al., ADVENT Technical Report , Columbia Univ., 2005 ]
story
bag-of-words
(term-frequency-inverse document frequency)
training data: bag-of-words features of stories
ground-truth label: positive – one shot is positive
38
SVM
Individual Methods: (4) Text
0.2 text + 0.8 visual
0.5
0.45
visual only
MAP
text
+ visualGain
4.5%
0.4
0.35
AP
0.3
0.25
0.2
0.15
0.1
0.05
0
1
39
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
40
Outline – New Algorithms
• Baseline
• Context-based concept fusion (CBCF)
• Lexicon-spatial pyramid matching (LSPM)
• Text features
• Event detection
41
Individual Methods: (5) Event
Event detection: Key frame v.s. Multiple frames
42
Individual Methods: (5) Event
Event detection: Key frame v.s. Multiple frames
fij: correspondence flow
P
P
Q
Q
dij
p1
q1
1/2
.
.
.
1
1/2
q2
.
.
.
pm
Supply
qn
SVM
demand
Earth Mover’s Distance: minimum weighted distance by linear programming
handle temporal shift:
a frame at the beginning of P can map to a frame at the end of Q
43
Handle
scale variations: a frame from P can map to multiple frames in Q
Individual Methods: (5) Event
experimental results
Performance over TRECVID 2005 development set
11 events: airplane_flying, people_marching, car_crash,
exiting_car, demonstration_or_protest, election_campaign_greeting,
parade, riot, running, shooting, walking
AP
Key Frame
44
1
0.8
0.6
0.4
0.2
0
EMD
Conclusion
• TRECVID 2006 offers a mature opportunity for evaluating concept interaction
— We have built 374 concept detectors
— Models and feature will be released soon
• Context-Based Fusion
—
—
—
—
Propose a systematic framework for predicting the effect of context fusion
(TRECVID 2005) 14 out of 16 predicted concepts show performance gain
(TRECVID 2006) 3 out of 4 predicted concepts show performance gain
Promising methodology for scaling up to large-scale systems (374 models)
• Results from Parts-based model (LSPM) are mixed
— But show consistent improvement when fused with SVM baseline
— 3 out of 6 concepts improve by more than 10%
• Temporal event modeling
— We propose a novel matching and detection method based on EMD+SVM
— Show consistent gains in 2005 data set
— Results in 2006 are incomplete and lower than expected
45
• More information at
– http://www.ee.columbia.edu
• Features and models for baseline
detectors for 374 LSCOM concepts
coming soon
46
Download