Multimedia Event Detection – Strong by Multi-Modality Integration S D

advertisement
Multimedia Event Detection – Strong by Multi-Modality Integration
SPEAKER
Mr ZHANG Hao
PhD Student
Department of Computer Science
City University of Hong Kong
Hong Kong
DATE 7 January 2016 (Thursday)
TIME 3:00 pm - 3:30 pm
VENUE CS Seminar Room, Y6405, 6th Floor
Yellow Zone, Academic 1
City University of Hong Kong
83 Tat Chee Avenue
Kowloon Tong
ABSTRACT
We will present our Multimedia Event Detection system with positive video exemplars (event query is
defined by 10 or 100 positive videos), which achieves state-of-the-art performance by designing
different fusion strategies for different modalities. First, in visual system, the standard fusion strategy is
averaging probability scores obtained by different features. This strategy could achieve reasonable
results e.g., relative mAP improvement of 5% with 100 positive videos case, in fusion of improved
dense trajectory and high-level concept features. However, we show that using an inverse joint
probability instead of the standard strategy in the combination of the concept feature and
improved dense trajectory improves performance with a relative mAP improvement of 7%, gaining a
better AP in 16 of 20 pre-specified multimedia events on MED15 full evaluation set. The main reason is
that classifiers trained on high-level concept feature and improved dense trajectory can be
complementary and with average fusion method low score of one type of classifier downgrades a
possibly relevant video. By using the inverse joint probability, only videos that receive a low score
from both classifiers will be put at the bottom of the list. Besides combining visual information, we
combine the speech (ASR) and textual (OCR) information. Our ASR and OCR system are tuned for
high precision and only retrieve those videos that almost certainly contain the event. These results
are used to rank the relevant videos higher in the ranked list than before. Our results show that our
OCR system improves performance by 5.8% and 2% relative mAP in both 10 and 100 positive videos
cases. ASR system, on the other hand, is not precise enough, as overall performance does not
increase by adding ASR results. This indicates that both ASR and OCR might be useful in some events,
but not in other events. Our systems are, thus, not the best individual systems, but by combining
multiple sources we can outperform systems that only use one source of information.
This paper was presented at TRECVID 2015 Workshop, November 16-18, 2015, Gaithersburg, MD, USA.
Supervisor: Prof C W Ngo
Research Interests: Multimedia Event Detection; Multimedia Content Analysis; Semantical Concept
Indexing
All are welcome!
In case of questions, please contact Prof NGO Chong Wah at Tel: 3442 4390, E-mail: cscwngo@cityu.edu.hk, or visit
the CS Departmental Seminar Web at http://www.cs.cityu.edu.hk/news/seminars/seminars.html.
Download