Multimedia Event Detection – Strong by Multi-Modality Integration SPEAKER Mr ZHANG Hao PhD Student Department of Computer Science City University of Hong Kong Hong Kong DATE 7 January 2016 (Thursday) TIME 3:00 pm - 3:30 pm VENUE CS Seminar Room, Y6405, 6th Floor Yellow Zone, Academic 1 City University of Hong Kong 83 Tat Chee Avenue Kowloon Tong ABSTRACT We will present our Multimedia Event Detection system with positive video exemplars (event query is defined by 10 or 100 positive videos), which achieves state-of-the-art performance by designing different fusion strategies for different modalities. First, in visual system, the standard fusion strategy is averaging probability scores obtained by different features. This strategy could achieve reasonable results e.g., relative mAP improvement of 5% with 100 positive videos case, in fusion of improved dense trajectory and high-level concept features. However, we show that using an inverse joint probability instead of the standard strategy in the combination of the concept feature and improved dense trajectory improves performance with a relative mAP improvement of 7%, gaining a better AP in 16 of 20 pre-specified multimedia events on MED15 full evaluation set. The main reason is that classifiers trained on high-level concept feature and improved dense trajectory can be complementary and with average fusion method low score of one type of classifier downgrades a possibly relevant video. By using the inverse joint probability, only videos that receive a low score from both classifiers will be put at the bottom of the list. Besides combining visual information, we combine the speech (ASR) and textual (OCR) information. Our ASR and OCR system are tuned for high precision and only retrieve those videos that almost certainly contain the event. These results are used to rank the relevant videos higher in the ranked list than before. Our results show that our OCR system improves performance by 5.8% and 2% relative mAP in both 10 and 100 positive videos cases. ASR system, on the other hand, is not precise enough, as overall performance does not increase by adding ASR results. This indicates that both ASR and OCR might be useful in some events, but not in other events. Our systems are, thus, not the best individual systems, but by combining multiple sources we can outperform systems that only use one source of information. This paper was presented at TRECVID 2015 Workshop, November 16-18, 2015, Gaithersburg, MD, USA. Supervisor: Prof C W Ngo Research Interests: Multimedia Event Detection; Multimedia Content Analysis; Semantical Concept Indexing All are welcome! In case of questions, please contact Prof NGO Chong Wah at Tel: 3442 4390, E-mail: cscwngo@cityu.edu.hk, or visit the CS Departmental Seminar Web at http://www.cs.cityu.edu.hk/news/seminars/seminars.html.