Dynamic Multimodal Fusion in Video Search Lexing Xie, Apostol Natsev, Jelena Tesic IBM T. J. Watson Research Center July 5th, ICME 2007, Beijing © 2007 IBM Research The Multimedia Search Problem multi-modal query media repository produced video corpra Online media shares “Find shots of flying through snow capped mountains” Personal media Digital life records Wide impact – consumer, media, enterprise, web, science … Multiple Tipping Points: – Multimedia semantics learning, – Multimedia ontology, – Systematic benchmark and evaluations 2 © 2007 IBM Research The Challenge of Choices in Multimodal Search queries retrieval experts Find shots of Condoleezza Rice Text Image similarity Semantic concepts 3 ~ Find shots of soccer game with goalpost visible Find scenes of snow capped mountains ? ?? ?? ?? ? ?? ?? ?? ? © 2007 IBM Research Outline The problem Dynamic multimodal fusion – Query feature extraction – Query matching Experiments and evaluation Conclusion 4 © 2007 IBM Research Prior Work on Query Dependent Fusion Query-classes: manually defined – [Chua et. a., NUS] – [Yan et. al., CMU] Joint clustering of queries in feature + performance space [Kennedy et. al.] Our approach – Can we fuse without pre-defined “query classes” or “query clusters”? – Deep text analysis for query features and query matching 5 © 2007 IBM Research Extract Semantic Query Features Input Query Text “people with” computer display Semantic Tagging Semantic Categories Query Feature Vector Person:CATEGORY BodyPart:UNKNOWN Furniture:UNKNOWN … [IBM UIMA and PIQUANT Analysis Engine] 6 © 2007 IBM Research Weighted linear fusion Query Broadcast Video “Snow capped mountains” Text Visual Concepts Multimodal retrieval experts 7 © 2007 IBM Research Three Query-dependent Fusion Strategies 8 © 2007 IBM Research Experimental setup TRECVID benchmark corpora – Multi-lingual news from 6 channels – Training/test split: news videos ENG CHN ARB • 2005 24 queries, 110 hours • 2006 24 queries, 160 hours Measure AP (average precision) on each query Average precision Precision AP Recall 9 © 2007 IBM Research IBM Multimedia Search System Overview Query Textual query formulation “Find shots of an airplane taking off” TEXT Text-Based Retrieval Query topic examples LSCOM-lite Semantic Models Model-Based Retrieval VISUAL FEATURES Semantic Based Retrieval Visual-Based Retrieval Fusion Multi-modal results (Text + Model + Semantic + Visual) 10 Approaches: 1. Text-based: storybased retrieval with automatic query refinement/re-ranking 2. Model-based: automatic query-to-model mapping based on query topic text 3. Semantic-based: cluster-based semantic space modeling and data selection 4. Visual-based: lightweight learning © 2007 IBM Research Automatic/Manual Search Overall Performance (Mean AP) T R E C V ID 06 Au to m atic an d M an u a l S earch (Ag g r e g a te d Pe r fo m a n c e o f IB M v s . O th e r s ) IBM Runs 0.10 Mean Average Precision 0.08 IB M Autom atic S earc h O ther Autom atic S earc h O ther Manual S earc h 0.06 IBM Official Runs: Text (baseline): Text (story-based): 0.041 0.052 Multimodal Fusion: Query independent: 0.076 Query classes (soft): 0.086 Query classes (hard): 0.087 0.04 0.02 0.00 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 5 5 57 59 61 63 6 5 67 69 71 73 7 5 77 79 81 83 8 5 87 S u b m itte d Ru n s Multi-modal fusion doubles baseline performance. Text-based query matching has the best performance among TRECVID runs Improves upon query-independent fusion by 14% 11 © 2007 IBM Research Query-dependent Fusion Breakdown Observation: – Concept-related queries improved the most: “tall building”, “prisoner”, “helicopters in flight”, “soccer” – Named-entity queries improved slightly: “Dick Cheney”, “Saddam Hussein”, “Bush Jr.” – Generic people category deteriorated the most: “people in formation”, “at computer display”, “w/ newspaper”, “w/ books” Relative improvement (%): Qcomp vs. Qind 200 150 100 50 12 ow sn io n r. wa so lk i ldi ng er s or po wa l ic te e rb oa t- s co hip m pu te re rd ad i sp in g la a y ne ws p ap a na er tu ra ls he ce l ic ne op te r s bu in rn fli g in fo g ht ur wi su th f la ite d m es pe op pe l e rs w. on fla an g d 10 bo ok ad s ul ta n d k is ch s il d on th e ch ee sm k ok es Co ta ck nd s ol ee za so Ri cc ce er go alp os t ,J sh le op pe Bu in fo rm ss ei at n +1 ey en Hu m da Sa d i ld Ch in Di ck t& pr ot es g tin co r gs r ne bu pr a & le es -150 is o i cl ve h ui op pe em es s ng ldi i cl ll b ta nc y ge er -100 ve h -50 es AV G 0 © 2007 IBM Research Dynamic Query Fusion Results Qdyn further improves MAP of Qclass by 8%, to 0.094 The improvement is mainly in generic people queries where Qclass failed previously 13 © 2007 IBM Research Demo: multimodal interactive search 14 © 2007 IBM Research 15 © 2007 IBM Research 16 © 2007 IBM Research 17 © 2007 IBM Research 18 © 2007 IBM Research Conclusion Presented a solution to query-dependent multimodal fusion problem for video search – Query features based on deep semantic parsing – Dynamic query matching – Improvement on TRECVID queries Future work – More retrieval modality, more diverse queries 19 © 2007 IBM Research Thank You For more information: IBM Multimedia Retrieval Demo http://mp7.watson.ibm.com/ “Data modeling strategies for imbalanced learning in video search”, J. Tesic, A. Natsev, L. Xie, J. Smith, ICME’07 next poster session “Semantic Concept-Based Query Expansion and Re-ranking for Multimedia Retrieval: A Comparative Review and New Approaches”, A. P. Natsev, A. Haubold, J Tesic, L. Xie, R. Yan, ACM Multimedia 2007, to appear 20 © 2007 IBM Research