© 2007 IBM Research

advertisement
Dynamic Multimodal Fusion
in Video Search
Lexing Xie, Apostol Natsev, Jelena Tesic
IBM T. J. Watson Research Center
July 5th, ICME 2007, Beijing
© 2007 IBM Research
The Multimedia Search Problem
multi-modal query
media repository
produced video
corpra
Online media shares
“Find shots of flying through
snow capped mountains”
Personal media
Digital life records
Wide impact
– consumer, media, enterprise, web, science …
Multiple Tipping Points:
– Multimedia semantics learning,
– Multimedia ontology,
– Systematic benchmark and evaluations
2
© 2007 IBM Research
The Challenge of Choices in Multimodal
Search
queries
retrieval
experts
Find shots of Condoleezza
Rice
Text
Image
similarity
Semantic
concepts
3
~
Find shots of soccer game
with goalpost visible
Find scenes of snow
capped mountains
?
??
??
??
?
??
??
??
?
© 2007 IBM Research
Outline
The problem
Dynamic multimodal fusion
– Query feature extraction
– Query matching
Experiments and evaluation
Conclusion
4
© 2007 IBM Research
Prior Work on Query Dependent Fusion
Query-classes: manually
defined
– [Chua et. a., NUS]
– [Yan et. al., CMU]
Joint clustering of queries in feature +
performance space [Kennedy et. al.]
Our approach
– Can we fuse without pre-defined “query classes” or “query clusters”?
– Deep text analysis for query features and query matching
5
© 2007 IBM Research
Extract Semantic Query Features
Input Query
Text
“people with”
computer display
Semantic
Tagging
Semantic
Categories
Query
Feature
Vector
Person:CATEGORY
BodyPart:UNKNOWN
Furniture:UNKNOWN
…
[IBM UIMA and PIQUANT Analysis Engine]
6
© 2007 IBM Research
Weighted linear fusion
Query
Broadcast Video
“Snow capped mountains”
Text
Visual
Concepts
Multimodal
retrieval experts
7
© 2007 IBM Research
Three Query-dependent Fusion Strategies
8
© 2007 IBM Research
Experimental setup
TRECVID benchmark corpora
– Multi-lingual news from 6 channels
– Training/test split:
news videos
ENG
CHN
ARB
• 2005 24 queries, 110 hours
• 2006 24 queries, 160 hours
Measure AP (average precision)
on each query
Average precision
Precision
AP
Recall
9
© 2007 IBM Research
IBM Multimedia Search System Overview
Query
Textual query
formulation
“Find shots of an
airplane taking off”
TEXT
Text-Based
Retrieval
Query topic examples
LSCOM-lite
Semantic
Models
Model-Based
Retrieval
VISUAL FEATURES
Semantic
Based
Retrieval
Visual-Based
Retrieval
Fusion
Multi-modal results
(Text + Model + Semantic + Visual)
10
Approaches:
1. Text-based: storybased retrieval with
automatic query
refinement/re-ranking
2. Model-based: automatic
query-to-model
mapping based on
query topic text
3. Semantic-based:
cluster-based semantic
space modeling and
data selection
4. Visual-based: lightweight learning © 2007 IBM Research
Automatic/Manual Search Overall Performance (Mean AP)
T R E C V ID 06 Au to m atic an d M an u a l S earch
(Ag g r e g a te d Pe r fo m a n c e o f IB M v s . O th e r s )
IBM Runs
0.10
Mean Average Precision
0.08
IB M Autom atic S earc h
O ther Autom atic S earc h
O ther Manual S earc h
0.06
IBM Official Runs:
Text (baseline):
Text (story-based):
0.041
0.052
Multimodal Fusion:
Query independent:
0.076
Query classes (soft): 0.086
Query classes (hard): 0.087
0.04
0.02
0.00
1 3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 5 5 57 59 61 63 6 5 67 69 71 73 7 5 77 79 81 83 8 5 87
S u b m itte d Ru n s
Multi-modal fusion doubles baseline performance.
Text-based query matching has the best performance among TRECVID runs
Improves upon query-independent fusion by 14%
11
© 2007 IBM Research
Query-dependent Fusion Breakdown
Observation:
– Concept-related queries improved the most:
“tall building”, “prisoner”, “helicopters in flight”, “soccer”
– Named-entity queries improved slightly:
“Dick Cheney”, “Saddam Hussein”, “Bush Jr.”
– Generic people category deteriorated the most:
“people in formation”, “at computer display”, “w/ newspaper”, “w/ books”
Relative improvement (%): Qcomp vs. Qind
200
150
100
50
12
ow
sn
io
n
r.
wa
so
lk i
ldi
ng
er
s
or
po
wa
l ic
te
e
rb
oa
t- s
co
hip
m
pu
te
re
rd
ad
i sp
in
g
la
a
y
ne
ws
p
ap
a
na
er
tu
ra
ls
he
ce
l ic
ne
op
te
r
s
bu
in
rn
fli g
in
fo
g
ht
ur
wi
su
th
f la
ite
d
m
es
pe
op
pe
l
e
rs
w.
on
fla
an
g
d
10
bo
ok
ad
s
ul
ta
n
d
k is
ch
s
il d
on
th
e
ch
ee
sm
k
ok
es
Co
ta
ck
nd
s
ol
ee
za
so
Ri
cc
ce
er
go
alp
os
t
,J
sh
le
op
pe
Bu
in
fo
rm
ss
ei
at
n
+1
ey
en
Hu
m
da
Sa
d
i ld
Ch
in
Di
ck
t&
pr
ot
es
g
tin
co
r
gs
r
ne
bu
pr
a
&
le
es
-150
is o
i cl
ve
h
ui
op
pe
em
es
s
ng
ldi
i cl
ll b
ta
nc
y
ge
er
-100
ve
h
-50
es
AV
G
0
© 2007 IBM Research
Dynamic Query Fusion Results
Qdyn further
improves MAP of
Qclass by 8%, to
0.094
The improvement
is mainly in
generic people
queries where
Qclass failed
previously
13
© 2007 IBM Research
Demo: multimodal interactive search
14
© 2007 IBM Research
15
© 2007 IBM Research
16
© 2007 IBM Research
17
© 2007 IBM Research
18
© 2007 IBM Research
Conclusion
Presented a solution to query-dependent
multimodal fusion problem for video search
– Query features based on deep semantic parsing
– Dynamic query matching
– Improvement on TRECVID queries
Future work
– More retrieval modality, more diverse queries
19
© 2007 IBM Research
Thank You
For more information:
IBM Multimedia Retrieval Demo
http://mp7.watson.ibm.com/
“Data modeling strategies for imbalanced learning
in video search”, J. Tesic, A. Natsev, L. Xie, J.
Smith, ICME’07 next poster session
“Semantic Concept-Based Query Expansion and
Re-ranking for Multimedia Retrieval: A
Comparative Review and New Approaches”, A. P.
Natsev, A. Haubold, J Tesic, L. Xie, R. Yan, ACM
Multimedia 2007, to appear
20
© 2007 IBM Research
Related documents
Download