Speaker: Prof. Qiang Ji, Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute (RPI). Title: Complex Activity Modeling and Recognition with the Interval Temporal Bayesian Networks. Complex activities typically consist of multiple primitive events happening in parallel or sequentially over a period of time. Understanding such activities requires recognizing not only each individual event but, more importantly, capturing their spatiotemporal dependencies over different time intervals. The current graphical model-based approaches are mostly based on points of time and they hence can only capture three temporal relations: precedes, follows, and equals. The existing syntactic and descriptionbased methods, while rich in modeling temporal relationships, do not have the expressive power to capture uncertainties. To address these issues, we introduce the Interval Temporal Bayesian Network (ITBN), a novel graphical model that combines the Bayesian Network with the Interval Algebra to explicitly model a large variety of temporal dependencies, while remaining fully probabilistic and expressive of uncertainty. Advanced machine learning methods are introduced to learn the ITBN model structure and parameters. Experimental results on benchmark real videos show that by reasoning with spatiotemporal dependencies, ITBN can significantly outperform state of art dynamic models in recognizing complex activities. Bio: Qiang Ji received his Ph.D. degree in Electrical Engineering from the University of Washington. He is currently a Professor with the Department of Electrical, Computer, and Systems Engineering at Rensselaer Polytechnic Institute (RPI). He recently served as a program director at the National Science Foundation (NSF), where he managed NSF’s computer vision and machine learning programs. He also held teaching and research positions with the Beckman Institute at University of Illinois at Urbana-Champaign, the Robotics Institute at Carnegie Mellon University, the Department of Computer Science at University of Nevada at Reno, and the US Air Force Research Laboratory. Prof. Ji currently serves as the director of the Intelligent Systems Laboratory (ISL) at RPI. Prof. Ji’s research interests are in computer vision, probabilistic graphical models, information fusion, and their applications in various fields. He has published over 190 papers in peer-reviewed journals and conferences. His research has been supported by major governmental agencies including NSF, NIH, DARPA, ONR, ARO, and AFOSR as well as by major companies including Honda and Boeing. Prof. Ji is an editor on several related IEEE and international journals and he has served as a general chair, program chair, technical area chair, and a program committee member in numerous international conferences/workshops. Prof. Ji is a fellow of IAPR. Speaker: Prof. Jason Corso, Department of Computer Science and Engineering at SUNY, Buffalo. Title: Can Language Play a Role in Large Scale Video Search? Large scale video search and mining is dominated by low-level features and classifiers. Although these methods have demonstrated strong promise in various problems like video search based on event type, they have limited ability to facilitate rich, semantic queries. In contrast, if the underlying video representation is at a higher-level, say attributes or language, such rich queries may be more plausible. To that end, I will discuss my recent work in jointly modeling video and language and converting video into language, in an effort to motivate a semantically rich large scale video search. Bio: Corso is an associate professor in the Computer Science and Engineering Department of SUNY at Buffalo. He received his Ph.D. in Computer Science at The Johns Hopkins University in 2005. From 2005-2007, Corso was a post-doctoral research fellow in neuroimaging and statistics at the University of California, Los Angeles. He is the recipient of the Army Research Office Young Investigator Award 2010, NSF CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2004. He holds the Associate Editor position of Computer Methods and Programs in Biomedicine since 2009. Corso has authored more than eighty papers on topics of his research interest including computer vision, robot perception, data mining and medical imaging. He is PI on more than $5 million in research funding from major federal agencies, including NSF, NIH, DARPA, ARO, and IARPA. Speaker: Dr. Josef Sivic, INRIA / Ecole Normale Supérieure, Paris, France. Title: Towards Mid-level Representations of Video. In this talk I will describe our recent work towards developing mid-level representations of video. First, I will discuss a joint model of actors and actions in movies that can localize individual actors in video and recognize their action. The model is learnt from only a weak textual supervision provided by the movie shooting script. We validate the model in the challenging setting of localizing and recognizing characters and their actions in feature length movies Casablanca and American Beauty. Second, motivated by the increasing availability of 3D films we develop a mid-level representation of stereoscopic video that combines person detection, pose estimation and pixel-wise segmentation of multiple people in video. We formulate the problem as an energy minimization that explicitly models depth ordering and occlusion of people. We demonstrate results on challenging indoor and outdoor scenes from 3D feature length movies Street Dance and Pina. Finally, we investigate a transfer learning approach based on convolutional neural networks (CNNs). We demonstrate that a mid-level image representation learnt using a CNN on a task with a large amount of fully labelled image data (ImageNet) can significantly improve visual recognition performance on related tasks where supervision is limited. The proposed method achieves state-of-the-art results on the Pascal VOC object classification and (still image) action recognition challenge. Applying the model to video seems within reach. Joint work with: K. Alahari, F. Bach, P. Bojanowski, L. Bottou, J. Ponce, I. Laptev, M. Oquab, G. Seguin and C. Schmid. Bio: Josef Sivic received a degree from the Czech Technical University, Prague, in 2002 and the PhD degree from the University of Oxford in 2006. His thesis dealing with efficient visual search of images and videos was awarded the British Machine Vision Association 2007 Sullivan Thesis Prize and was short listed for the British Computer Society 2007 Distinguished Dissertation Award. His research interests include visual search and object recognition applied to large image and video collections. After spending six months as a postdoctoral researcher in the Computer Science and Artificial Intelligence Laboratory at the Massachusetts Institute of Technology, he currently holds a permanent position as an INRIA researcher at the Departement d’Informatique, Ecole Normale Superieure, Paris. He has published over 40 scientific publications and serves as an Associate Editor for the International Journal of Computer Vision. He has been awarded an ERC Starting grant in 2013. Speaker: Prof. Jiang Yu-Gang, Fudan University, China. Title: Recognizing actions and complex events in unconstrained videos. Nowadays people produce a huge number of videos; many are uploaded to the Internet on social media sites such as YouTube and Vimeo. There is a strong need to develop automatic solutions for recognizing the contents of these videos. Potential applications of such techniques include effective video content management and retrieval, open-source intelligence analysis, etc. In this talk, I will introduce our recent works on human action and event recognition. I will start by introducing a recently constructed Internet consumer video dataset, on which we measure human recognition performance of video events and compare this with popular automatic machine recognition solutions. After that I will introduce an approach to construct effective features for human action recognition. Finally I will discuss the speed efficiency of popular techniques and suggest componentlevel options for “real-time” recognition. Bio: Yu-Gang Jiang is an associate professor in the School of Computer Science at Fudan University, Shanghai. He directs Lab for Big Visual Data Analytics, working on problems related to large scale image and video data analysis sponsored broadly by both government agencies and industrial partners. He is an active participant of several international benchmark evaluations, and is one of the task organizers of the annual European MediaEval evaluations. At the U.S. NIST TREC video retrieval evaluation, systems designed by him achieved top performance in 2008 video concept detection task and 2010 multimedia event detection task. His work has led to a best demo award from ACM Hong Kong (2009), the second prize of ACM Multimedia Grand Challenge (2011), a recognition by IBM Watson Research as an "emerging leader in multimedia" (2009), and an award from Intel to outstanding young CS faculties in China (2013). He is a guest editor for IEEE Transactions on Multimedia's special issue on Socio-Mobile Media Analysis and Retrieval, and Machine Vision and Applications' special issue on Multimedia Event Detection. He is program co-chair of ICCV 2013 workshop on Action Recognition with a Large Number of Classes. He will serve as a Program Chair for ACM ICMR 2015. He graduated from City University of Hong Kong with a PhD in Computer Science. Before Fudan, he was a postdoctoral research scientist at Columbia University, New York. Speaker: Prof. Junsong Yuan, Nanyang Technological University, Singapore. Title: Discovering Visual Patterns in Video Data Motivated by the previous success in mining structured data (e.g., transaction data) and semi-structured data (e.g., text), it has aroused our curiosity to discover meaningful patterns in more complex data like images and videos. However, unlike transaction and text data that are composed of discrete elements without much ambiguity (i.e. predefined items and vocabularies), visual patterns generally exhibit large variabilities in their visual appearances, thus challenge existing data mining and pattern discovery algorithms. This talk will discuss my recent work of discovering visual patterns in videos, as well as its applications in video scene understanding, summarization and anomaly detection. Bio: Junsong Yuan received his Ph.D. from Northwestern University. He is currently a Nanyang Assistant Professor at Nanyang Technological University (NTU), Singapore, leading the video analytics program at School of EEE. His PhD thesis “Mining Image and Video Data” received the Outstanding EECS Ph.D. Thesis award from Northwestern University. He also received the Best Doctoral Spotlight Award from IEEE Conf. Computer Vision and Pattern Recognition Conference (CVPR'09). He co-chairs workshops at CVPR'12'13 and ICCV’13, and serves as Area Chair for IEEE Winter Conf. on Computer Vision (WACV'14) and IEEE Conf. on Multimedia Expo (ICME'14). He is Organizing Chair and Area Chair for Asian Conf. on Computer Vision (ACCV'14). He recently gives tutorials at IEEE ICIP'13, FG'13, ICME'12, SIGGRAPH VRCAI'12, and PCM'12.