Full Proposal

advertisement
DeCoRe
Deep Convolutional and Recurrent
networks for image, speech, and text
Action-team proposal, LabEx PERSYVAL, section Advanced Data Mining
March 30, 2016
Contents
1 Synopsis
2
2 Methodology
2.1 Participating research groups . . . . . . . . . . . . . . . .
2.1.1 THOTH team, INRIA/LJK . . . . . . . . . . . . .
2.1.2 GETALP team UGA/CNRS/LIG . . . . . . . . .
2.1.3 MRIM team UGA/CNRS/LIG . . . . . . . . . . .
2.1.4 AGPIG team UGA/CNRS/GIPSA-LAB . . . . . .
2.1.5 AMA team UGA/CNRS/LIG . . . . . . . . . . . .
2.2 Challenges and research directions. . . . . . . . . . . . . .
2.2.1 Object recognition and localization . . . . . . . . .
2.2.2 Speech recognition . . . . . . . . . . . . . . . . . .
2.2.3 Distributed representations for texts and sequences
2.2.4 Image caption generation . . . . . . . . . . . . . .
2.2.5 Selecting and evolving model structures . . . . . .
2.2.6 Higher-order potentials for dense prediction tasks .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Expected results
3
3
4
4
5
5
5
6
6
6
6
7
7
7
8
4 Detailed research plan for PhD scholarships and PostDoc
8
4.1 PhD Thesis 1: encoder/decoder approaches for multilingual image captioning . . . . . . . . . . . 8
4.2 PhD Thesis 2: incremental learning for visual recognition . . . . . . . . . . . . . . . . . . . . . . 9
4.3 PostDoc: representation learning for sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Positioning and aligned actions
10
5.1 Positioning in LabEx Persyval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5.2 Aligned actions outside LabEx Persyval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6 Requested resources
A CV
A.1
A.2
A.3
A.4
of principal investigators
Laurent Besacier . . . . . .
Denis Pellerin . . . . . . . .
Georges Quénot . . . . . . .
Jakob Verbeek . . . . . . .
11
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
19
21
24
1
Synopsis
Scientific context. Recently, deep convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) have yielded breakthroughs in different areas [29], including object recognition, machine translation,
and speech recognition. One of the key distinguishing properties of these approaches, across different application
domains, is that they are end-to-end trainable. That is: whereas conventional methods typically rely on a signal
pre-processing stage in which features are extracted, such as MFCC [3] for speech or SIFT [33] for images, in
deep end-to-end trainable systems each processing layer (from the raw input signal upwards) involves trainable
parameters which allow the system to learn the most appropriate features.
DeCoRe gathers experts from LJK, GIPSA-LAB and LIG in computer vision, machine learning, speech,
natural language processing, and information retrieval, to foster collaborative interdisciplinary research in this
rapidly evolving area which is likely to underpin future advances in these research areas for the next decade.
We believe that DeCoRe project is a remarkable opportunity to get together research groups of Grenoble area
with a critical mass on deep learning. It is also the chance to foster exciting research spanning different fields,
such as computer vision and natural language processing.
Challenges and research directions. Within the broader scope of DeCoRe , funding and effort will be
focused on several specific areas, these include:
• Object recognition and localization. While neural networks have been used for a long time in image/object recognition, firstly in character recognition [6] and in face detection [10], they were only recently
shown to be effective for general object recognition [26]. This was due to advances in effective training algorithms [38], the availability of very powerful parallel GPU hardware, and the availability of huge quantity
of clean annotated data [5]. Open challenges that will be addressed in DeCoRe include efficient detecting
and localizing very large sets of categories, weakly supervised learning for object localization and semantic
segmentation, as well as developing structured models to capture co-occurrence and spatial relation patterns
to improve object localization. These themes will be studied for applications in both images and videos.
• Speech recognition. Neural networks have been used as feature extractors in HMM-based speech recognition systems [2, 18]. Recently, neural networks started to replace larger parts of the speech processing chain
previously dominated by HMMs [16]. There is also an increasing number of new studies trying to address
speech processing tasks (notably speech recognition) with the use of CNN based systems with only spectrograms as input [9, 41]. The objectives of DeCoRe in this area are (i) to propose and benchmark end-to-end
neural speech recognition pipelines, (ii) to better understand the information captured by CNN or RNN in
acoustic speech modelling (as recently done for CNN-based image recognition [57]), and (iii) to investigate
the potential of multi-task learning for deep neural network (DNN) based speech recognition (possibility to
exploit multi-genre training data to train a single system dedicated to several tasks or to several languages).
• Distributed representations for text. There has been a growing interest in distributed representations
for text, largely due to [36] who proposed simple neural network architectures which can be trained on huge
amounts of text (in the order of 100 billion words). A number of contributions have extended this work to
phrases [37], text sequences [24, 28], and bilingual distributed representations [35]. These representations, also
called word embeddings, can capture similarities between words or phrases at different levels (morphological,
semantic). Bi-lingual word embedding (common representation for two languages) opens avenues for new
tasks such as cross-lingual image captioning (train in English, caption in French) for instance.
• Image caption generation. Recently, RNNs [7, 19] have proven effective to produce natural language
descriptions of images [21, 54]. Although these results are impressive, there are a number of challenges in
this area that will be addressed in DeCoRe . These include addressing the scalability to use such models for
natural-language-based image search, and generalization to words that were not seen in the training data.
Another challenge is to develop methods that associate words in the caption with image regions [23], with the
goal to improve generalization by being able to exploit visual scene compositionality. A final challenge is to
infer basic spatial relations among objects from the image and to report these in the generated descriptions
(“A man on a bike” vs. “A man on the left of a large bike”).
Caption generation will play a central role, integrating image understanding and language generation models.
Positioning. DeCoRe fits excellently in Persyval’s research action Advanced Data Mining (ADM), and directly addresses one of its three main challenges: “Mining multi-modal data”. The understanding of speech,
visual content, and text are among the core topics of modern data mining. None of the existing Persyval-funded
actions have a direct overlap with DeCoRe .
2
Figure 1: Schematic comparison of conventional hand-crafter feature approaches and deep-learning end-to-end
trainable based approach. Figure credit: Yann Lecun
2
Methodology
Deep convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have yielded breakthroughs
in several areas [29], including object recognition [26], machine translation [52], and speech recognition [16].
These approaches are end-to-end trainable: feature processing (MFCC [3], SIFT [33]) and mid-level features
extraction are replaced by neural machines for which each processing layer is trainable . This allows the system
to learn the appropriate hierarchical features from raw data (signal, image, spectrogram).
The second key property is the “deep” layered hierarchical structure of these models. While 2-layer perceptrons have long been known to be universal function approximators [55], they require an arbitrarily large
number of units in the single hidden layer. The power of “deep” layered architectures lies in their efficiency in
terms of the number of parameters to specify highly complex patterns [39]. Intuitively, this efficiency is a result
of compositionality: each layer of the network extracts non-linear features which are defined in terms of the
features of the previous layer. In this manner, after several layers, complex object-level patterns can be detected
as a constellation of parts, which are detected as a constellation of sub-parts, etc. [13]. Visualization of the
activations of neurons in convolutional networks confirm this intuition [57]. Earlier state-of-the-art computer
vision models were mostly based on hand-crafted single-level representations (e.g. color histograms [53]), or
unsupervised two-level representations (bag-of-words [50], Fisher vectors [43]) and three-level representations
[8]). Deeper models only became an viable alternative to such shallow networks once the right regularization
methods [51], large datasets [4], and massively parallel GPU compute hardware were all in place.
The exceptional results obtained when taken together, in deep end-to-end trainable systems, underlines the
importance of learning the “feature” or “representation” rather than just the “classifier” which has been the
dominant approach before. See Figure 1 for a schematic illustration of how the deep learning approach compares
to the conventional approach based on hand-crafted features with trainable linear classifiers. Interestingly, it has
also recently been found that the activations in deep CNN models correlate better than traditional feature-based
approaches with activations in the inferior temporal (IT) cortex in primates [22].
DeCoRe brings together a critical mass of experts in information retrieval, computer vision, machines learning, natural language processing and speech recognition from five research groups hosted in Grenoble’s three
computer science and applied mathematics laboratories. The main objective of DeCoRe is to foster collaborations in the Grenoble research community in the area of deep learning which is rapidly evolving, and likely to
underpin future advances in the considered application areas for the next decade. The collaboration involves
cross-institute research, training of PhD students and MSc interns, but also the organization of reading groups,
workshops, and teaching of MSc-level courses.
2.1
Participating research groups
In this section we give a description of the five research groups that host most participating researchers. For each
we list the participating research staff, a description of the research directions and the principal investigator.
3
The SigmaPhy team at the GIPSA-LAB (http://www.gipsa-lab.fr/sigmaphy/accueil-sigmaphy) is
also part of the network of teams in DeCoRe working on deep learning, but not part of the core organizing
and funds requesting teams. SigmaPhy studies image processing and wave physics for natural environment
characterization and surveillance. This includes underwater acoustics (active and passive observing, localization
in complex environment), optical and radar remote sensing, and transient signal imagery (seismic imagery,
ultrasonic signals, fluorescence signals). In November 2015 M. Malfante started a PhD thesis supervised by J.
Mars and M. Dalla Mura on deep learning for recognition problems in submarine acoustic signals.
2.1.1
THOTH team, INRIA/LJK
• Website: http://lear.inrialpes.fr
• Participants: Jakob Verbeek (coordinator, CR), Cordelia Schmid (DR), Julien Mairal (CR), Karteek
Alahari (CR).
• Team description: THOTH (formerly known as LEAR, renamed in March 2016) is focused on computer
vision and machine learning. It’s main long term objective is to learn structured visual recognition models
from little or no manual supervision. Research focuses on the design of deep convolutional and recurrent
neural network architectures: in particular those that can be used as a general-purpose visual recognition
engine that is suitable to support many different tasks (recognition of objects, faces, actions, localization
of objects and parts, pose estimation, textual image description, etc.). A second research axis focuses
specifically on learning such models from as little supervision as possible. The third research direction is
large-scale machine learning, needed to deploy such models on large datasets with little or no supervision.
• Principal investigator: J. Verbeek currently supervises two PhD students. One in co-supervision
with C. Couprie from Facebook AI Research (FAIR), on the topic of deep learning for weakly supervised
semantic video segmentation. The other is funded by an national ANR project on metric learning and CNN
models for face recognition in unconstrained conditions including non-cooperative, non-visible spectrum
images, etc. He also supervises a PostDoc and MSc intern on RNN models for image captioning. He is
involved in a national ANR grant application which federates six research centers across France around the
topic of low-power embedded applications of deep learning. J. Verbeek is teaching the course Advanced
Learning Models on (deep) neural networks in the Industrial and Applied Mathematics MSc program at
the Univ. of Grenoble.
2.1.2
GETALP team UGA/CNRS/LIG
• Website: http://getalp.imag.fr
• Participants: Laurent Besacier (Prof. co-organizer), Benjamin Lecouteux (MC), Christophe Servan
(Postdoc).
• Team description: The GETALP (Study Group for Machine Translation and Automated Processing
of Languages and Speech) was born in 2007 when LIG was created. Born from the virtuous union of
researchers in spoken and written language processing, GETALP is a multidisciplinary group (computer
scientists, linguists, phoneticians, translators and signal processing specialists) whose objective is to address all theoretical, methodological and practical aspects of multilingual communication and multilingual
(written or spoken) information processing, with a focus on speech recognition and machine translation.
GETALP’s methodology relies on continuous investigations between data collection, fundamental research,
development of systems, applications and experimental evaluations.
• Principal investigator: L. Besacier started to have interest for deep learning approaches for spoken
language processing three years ago and he has supervised a PhD student on automatic speech recognition for under-resourced languages using deep neural networks (Sarah Samson Juan - PhD defended
in 2015). He currently supervises or co-supervises several PhDs on topics related to DeCoRe : deep and
active learning for multimedia (Mateusz Budnik - with MRIM), recurrent neural networks for cross-lingual
annotation propagation (Othman Zenaki - with CEA/LIST) and cross-language plagiarism detection using
word embeddings (Jeremy Ferrero - with Compilatio S.A.). Currently he supervises three MSc interns; on
Long-Short-Term-Memory (LSTM) networks for speech recognition, DNN compression for speech transcription, and neural machine translation.
4
2.1.3
MRIM team UGA/CNRS/LIG
• Website: http://lig-mrim.imag.fr
• Participants: Georges Quénot (DR, co-organizer), Jean-Pierre Chevallet (MC), and Philippe Mulhem
(CR).
• Team description: The research carried out in the MRIM targets information retrieval and mobile
computing domains. While studies done in Information Retrieval are dedicated to satisfy users information
needs from a huge corpus of documents, those which are conducted in mobile computing are dedicated
to satisfy mobile users needs in terms of services taken from a corpus of services and then, composed
altogether: in both domains, users express their needs through queries, and the system gives back relevant
documents or personalised services i.e., documents/services that match users’ query.
• Principal investigator: Georges Quénot has worked for over 15 years in video contents indexing and
retrieval. He is co-organizer of TRECVid since its beginning in 2001. He has started using deep learning in
this context three years ago and is co-supervising a PhD student (Mateusz Budni, with GETALP group)
and a Master student (Anuvabh Dutt, with the AGPIG group) on this subject. He obtained excellent
results at the TRECVid semantic indexing task (between second and fourth) using this approach. He
also successfully applied the same method on still images: currently ranking first at VOC 2012 object
classification task (comp1, post-campaign).
2.1.4
AGPIG team UGA/CNRS/GIPSA-LAB
• Website: http://www.gipsa-lab.grenoble-inp.fr/agpig,
• Participants: Denis Pellerin (Prof., co-organizer), Michèle Rombaut (Prof.)
• Team description: GIPSA-lab (Laboratoire Grenoble Images Parole Signal Automatique) is a research
unit between CNRS, Grenoble-INP and University Grenoble Alpes. The Architecture Geometry Perception Image Gesture (AGPIG) team of GIPSA-lab has a long experience of image/video analysis and
indexing. It research interests include image/video classification, human action recognition, facial analysis, audiovisual scene analysis for robot companions. It has an expertise of visual attention modelling,
data fusion with transferable belief models, dictionary learning, as well as architecture/algorithm joint
exploration.
• Principal investigator: Denis Pellerin started to work on deep learning networks for image classification
two years ago. With Georges Quénot, he co-supervised one master student (Efrain-Leonardo GutierrezGomez in 2015) and is co-supervising one master student (Anuvabh Dutt in 2016) on this subject. His
research interests include i) video analysis and indexing: image and video classification, human action
recognition, video summarization, active vision for robots, ii) visual perception and modeling: visual
salience, attention models, visual substitution.
2.1.5
AMA team UGA/CNRS/LIG
• Website: http://ama.liglab.fr
• Participants: Eric Gaussier (Prof.), Ahlame Douzal (MC).
• Team description: The research of the AMA team fits within the general framework of data science,
with a strong focus on data analysis, machine learning and information modeling. Within this framework,
the AMA team is interested in developing new theoretical tools, algorithms and systems for analyzing and
making decisions on complex data. The research of the team is organized in three main, complementary
axes: data analysis and learning theory, learning and perception systems, and modeling social systems.
• Principal investigator: Eric Gaussier started to work on deep learning for information access two years
ago. He was particularly interested in obtaining collection independent representations that can be used
for transfer learning. More recently, in collaboration with Ahlame Douzal, he is interested in deep learning
representations for time series, with applications on prediction and classification. This topic is the focus
of the ANR project LOCUST (with LIP6, UPMC) which started in January 2016.
5
2.2
Challenges and research directions.
Within the broader scope of DeCoRe , effort will be focused on several more specific topics which are presented
in the following sections. Some of these topics are oriented towards a specific application domain, others towards
scientific challenges that reach across the scope of all the considered application domains.
2.2.1
Object recognition and localization
While neural networks have been used for a long time in visual object recognition, firstly in character recognition [6] and in face detection [10]. They were only recently shown to be effective for general object recognition [26]. This was due to advances in effective training algorithms [38], the availability of very powerful
parallel GPU hardware, and the availability of huge quantity of cleanly annotated data [5]. Since then, many
improvements have been brought including the use of very deep (19 layers) [49] and even ultra deep (152 layers) [17] architectures, and the localization of objects using CNNs [12, 14, 40, 46, 48]. In order to avoid complete
re-training of large networks, incremental methods have recently been proposed for the dynamic inclusion of
new categories [56].
The main objectives of DeCoRe in this area are the development of new methods for (i) efficiently detecting
and localizing very large sets of categories, (ii) weakly supervised learning for object localization and semantic
segmentation, (iii) developing of structured models to capture co-occurrence and spatial relation patterns to improve object localization, and (iv) building models for dynamically evolving sets of categories using incremental
learning.
Object recognition and localization is the main topic of one funded PhD scholarship subject further described
in Section 4.2.
2.2.2
Speech recognition
Neural networks have been used as feature extractors in HMM-based speech recognition systems [2, 18]. Recently, neural networks started to replace larger parts of the speech processing chain previously dominated by
HMMs [16]. There is also an increasing number of new studies trying to address speech processing tasks (notably speech recognition) with the use of CNN based systems with only spectrograms as input [9, 41]. Lately,
Recurrent Neural Networks (RNNs) have also been introduced for speech recognition because of their modelling
capabilities for sequences. RNNs allow the model to store temporal contextual information directly without
explicitely defining the size of temporal contexts (e.g. the time convolution filter size in CNNs). Among several
implementations of RNNs, Long Short Term Memory (LSTM) [19] networks have the capability to memorize
sequences with long range temporal dependencies and start to be used for end-to-end speech recognition.
The main objectives of DeCoRe in this area are: (i) Propose and benchmark an efficient end-to-end speech
recognition pipeline for multiple languages including English and French. (ii) Better understand the information captured by CNN or RNN in acoustic speech modelling (as recently done for CNN-based image recognition
[57]). (iii) Propose architectures which combine front-end deep CNN models (acting as trainable feature extractors) with LSTMs (modeling the context from the sequence acoustic signal). (iv) Explore data augmentation
techniques for speech recognition. Data augmentation consists in increasing the quantity of training data and
have been widely used in image processing, see e.g. [42], but hardly ever in speech processing. (v) Exploit the
ability of deep neural networks to benefit from transfer learning (transferring knowledge between tasks) which
has been widely studied in neural network literature. For instance, it is particularly useful to transfer knowledge
from one language to another for crosslingual speech modeling and rapid development systems for new target
languages. Encoder-decoder approaches [52] lend them selves extremely well for such an approach [34].
This research topic is studied in GETALP group through several MSc and will be strengthened by collaborations within DeCoRe .
2.2.3
Distributed representations for texts and sequences
There has been a growing interest in distributed representations for text, largely due to [36] who propose simple
neural network architectures which can be trained on huge amounts of text (in the order of 100 billion words). A
number of contributions have extended this work to phrases [37], text sequences [28], and bilingual distributed
representations [35]. These representations, also called word embeddings, can capture similarities between words
or phrases at different levels (morphological, semantic). Bi-lingual word embedding (common representation for
two languages) opens avenues for new tasks such as cross-lingual image captioning (train in English, caption in
French) and neural machine translation for instance [34].
Beyond texts, sequences of objects, as time series, can also be embedded into representations that allow
one to abstract away from the representation problems raised by multi-scale, multi-variate and multi-modal
6
sequences. Deep learning offers here an integrated solution for sequences that can be used in a variety of
contexts.
Bi-lingual word embedding is part of one funded PhD scholarship subject further described in Section 4.1.
Sequence embedding will also be studied by the requested post-doc co-supervised by AMA, GETALP and THOTH.
2.2.4
Image caption generation
Recently RNNs [7, 19] have proven effective to produce natural language descriptions of images [21, 54]. Although these results are impressive, there are a number of open challenges in this area. These include addressing
the scalability to use such models for natural-language-based image search, and generalization to words that
were not seen in the training data. Another challenge is to develop methods that associate words in the caption
with image regions. To date only very few works exist along these lines [21, 23]. The goal is to improve generalization by being able to exploit visual scene compositionality. Moreover, region-based visual modeling will also
be key to inferring spatial relationships between objects, and for visual “grounding”, so that if multiple objects
of the same category exist in a scene, the model is able to distinguish them, and to associate properties to the
individual instances.
Caption generation will play a central role in DeCoRe since it brings together image understanding models
and sequential language generation models. One of the two funded PhD scholarships will specifically address this
research area. More details are given in Section 4.1.
2.2.5
Selecting and evolving model structures
One of the main problems in applying deep neural networks is the architecture choice. The space of architectures
is large and discrete: a specific network is defined by the number of layers, number of nodes per layer, type
of non-linearity (sigmoid, rectifiers, maxout [15]), filter sizes for CNN, type of pooling operations, ordering of
pooling and covolutional layers, etc. Naively testing different architectures one-by-one is a hopelessly intractable
approach, and more systematic approaches are needed. For example by using sparsity inducing regularizers
over the weight space [27], using hierarchical non-parametric approaches to learn the structure of probabilistic
graphical models [1].
The design of efficient model selection approaches, for example based on (structured) regularization, is an
important research topic today regardless of the application domain. Moreover, adapting and expanding the
network architecture over time —as more training data becomes available, or simply more data has been seen
by the model during training— will be important for future large-scale learning scenarios where training the
model will not be a matter of hours or days, but rather weeks, months, or longer. Such scenarios are particularly
important in the context of learning from very large minimally supervised datasets. Network adaptation will
require methods to assess to what extent the current network capacity has been saturated with the training
data, and so as to determine if the network needs to be expanded.
This research topic will be studied within the context of two submitted ANR projects by THOTH and MRIM.
2.2.6
Higher-order potentials for dense prediction tasks
Many tasks in computer vision require dense predictions at the pixel level. For example, in semantic segmentation the goal is to predict the semantic category label for each pixel (e.g. pedestrian, car, building, road, sign,
bicyle, tree, sky, etc.). Other dense prediction tasks include optical flow estimation, depth estimation, image
de-noising, super resolution, colorization, deblurring, etc. These dense prediction tasks are typically solved using
(conditional) Markov random fields [11], which include unary data terms for each pixel, and pairwise terms to
ensure spatial regularity of the output predictions. Deep networks have been used for such tasks [32] to define
data dependent unary and pairwise terms [30]. Moreover, recently it has been shown that variational mean-field
inference [20] in Markov random fields can be expressed as a special recurrent neural network [47, 58]. This
allows the training of the unary and pairwise potentials to be done in a way that is coherent with the MRF
structure, and optimal wrt. the approximate inference method used for prediction.
While higher-order potentials (which model interactions of more than two prediction variables at a time)
have been proven effective in the past for dense prediction tasks [25]. Efficient inference is only possible for a
very small and specific class of higher-order potentials. An open question we will study in DeCoRe is how more
general higher-order potentials can be formulated using deep convolutional networks over label fields, in a way
that permits efficient approximate inference. For example building upon the recurrent convolutional model of
Pinheiro and Collobert [44].
This research topic is studied in particular in the context of the PhD thesis between THOTH and Facebook
AI Research.
7
3
Expected results
The objective of DeCoRe is to generate the following outcomes.
• Scientific knowledge: disseminated mainly in the form of scientific conference and journal papers,
preferably in open-access venues.
• Transfer: particular research results may give rise to technology that can be protected or transferred to
industry. Locally, both Xerox Research Center Europe (Meylan), ST Microelectronics (Grenoble), and
NVIDIA (Grenoble) are active in deep learning for computer vision, and could therefore be logical partners
for transfer.
• Infra-structure know-how: exchanges on the most effective and cost-efficient hardware setups to train
deep neural networks. This also includes exchanges on multi-GPU and multi-machine implementations.
The contacts between INRIA and an NVIDIA researcher on computer vision and deep learning in Grenoble
is extremely useful in this respect.
• Software: we will contribute our research results in the form of code to open-source tools that are essential
in this fast evolving area
– Caffe: Convolutional architecture for fast feature embedding. See http://caffe.berkeleyvision.
org
– Theano: general purpose (deep) neural network library, particularly suitable for recurrent networks.
See http://deeplearning.net/software/theano
– Kaldi: Open-source toolkit for automatic speech recognition http://kaldi.sourceforge.net
– MultiVec (partially developed by LIG in collaboration with LIFL lab.): a multilingual and multilevel
representation learning toolkit for NLP https://github.com/eske/multivec
• Training: funding and supervision of 2 PhD students, and 6 MSc students, structuring MSc teaching on
deep learning in Grenoble
• Interaction: invited researchers, organization of workshops, seminars, and cross-institute reading groups.
4
Detailed research plan for PhD scholarships and PostDoc
4.1
PhD Thesis 1: encoder/decoder approaches for multilingual image captioning
• Supervisors: L. Besacier and J. Verbeek
• Localization: 50% between GETALP and THOTH teams
• Topic: The focus of this PhD will be on recurrent encoder-decoder models and their application to several
modalities (image, speech, text). Such models have been found effective for machine translation [52], and
lend themselves well for image captioning [23]. The idea is to encode the input (image or sentence) into a
continuous semantic space. The encoder can be a recurrent LSTM [19] network for a sentence, or a CNN
model for an image. The decoder takes the input encoding and generates a sequential output of variable
length (e.g. a sequence of words) in a step-by-step manner. See Figure 2 for several examples of images
with automatically generated captions.
As a key application, we will consider multilingual image captioning which is the generation of image
descriptions in a target language, given training data which includes a collection of images and their
description in a different source language. The Multimodal Machine Translation Challenge provides
excellent benchmark data for this problem, see http://www.statmt.org/wmt16/multimodal-task.html
• Focus areas:
– Text encoder architectures: since the input sentence is given at once (and not generated) there are
many possibilities for the architecture of the input encoder. For example, bidirectional RNNs may
be used [21], instead of uni-directional models. We will evaluate existing sequence encoding models
for image captioning, and propose novel ones based on the results.
8
A cat sitting on top of a suitcase.
A group of people riding skis down
a snow covered slope.
A close up of a plate of food on a
table.
Figure 2: Example images with natural language descriptions automatically generated with an RNN model
with LSTM units. The COCO dataset [31] was used to train the model, and examples come from the test set.
– Learning from weak supervision: in current research, image captioning models are trained from supervised training data where images are annotated by hand with multiple very descriptive sentences,
sometime also localized in the image [45]. While this is good for initial research, it will not scale
to real applications, where large and diverse training datasets are needed. Annotating such data
sets is too costly, and hence weakly supervised learning is needed. We will develop latent variable
models to infer object locations from image-sentence pairs, and learn models from internet data such
as stock-photography websites which host many images with natural language descriptions, see e.g.
http://www.shutterstock.com. We will also consider the use of aligned multi-lingual text-corpora
to pre-train text encoder-decoder models, which can be combined with image encoder models. In
particular, we expect larger pure-text corpora to considerably improve the text generation (decoder)
quality.
– Region-based image representation: a distributed region-based image representation is promising for
at least three reasons. To improve generalization (combining a limited number of object categories in
many different scenes), to enable relative geometrical statements (a is on the left of b), and to enable
grounding of properties and attributes to individual object instances (there may be a tiny white
horse, and a large black one in the scene, and a good description will not mix properties of different
objects even if they belong to the same category). Region-based encoder-decoder models for images,
however, have hardly been proposed in the literature [21, 23]. We will develop new region-based
image representations for this purpose based on convolutional and recurrent network structures.
– Data augmentation: increasing the quantity of training data and has been widely used in image
processing, see e.g. [42]. For cross-lingual image captioning, several (instead of one) captions per
image can be easily obtained using automatic paraphrasing (for a mono-lingual image captioning task)
or machine translation (for a cross-lingual image captioning task). We will explore data augmentation
scenarios for image captioning that operate jointly at the image (image transformations) level and
text (paraphrasing) level.
4.2
PhD Thesis 2: incremental learning for visual recognition
• Supervisors: Georges Quénot and Denis Pellerin.
• Localization: 50% between MRIM and AGPIG teams
• Topic: This PhD will focus on the detection of visual categories in still images and videos. It will especially
study the problem of the dynamic adaptation of CNN models to newly available training data, to new
needed target categories and/or to new or specific application domains (e.g. medical, satellite or life-log
data). Effective architectures are now very deep (19 layers) [49] and even ultra deep (152 layers) [17] and
need very long training times: up to several weeks even using very powerful multi-GPU hardware. It is
not possible or efficient to retrain a complete model for a particular set of new categories or for applying
already trained categories to different domains. Incremental learning [56] is a way to adapt already trained
networks for such needs at a low marginal cost. Also, various forms of weakly supervised learning and
active learning can be used in conjunction to further improve the system performance. Localization of
target categories [40] is also very important. First, knowing where objects are located in images helps
building better model, especially in a semi-supervised way. Second, in the context of DeCoRe , it will be
essential for providing elements for the generation of a detailed textual description.
9
• Focus areas:
– Incremental learning and evolving network architectures: new methods will be studied for building
networks that operate in a ”continuous learning” mode for permanently improving themselves. Improvements will be possible by a continuous inclusion on new target concepts (possibly including
the full ImageNet set and even beyond), and by the adaptation of already trained concepts to new
target domains (e.g. satellite images or life-logging content). Incremental learning methods will be
considered as well as network architecture evolution.
– Active learning and weakly supervised learning: various forms of these approaches as well as of semisupervised learning have proven very effective and efficient for content-based indexing of images and
videos, both at the image or shot level and at the region or even pixel level. These also fit very
well with incremental learning. The goal here will be to efficiently integrate them in order to extract
as much information as possible from all available annotated, non-annotated, and weakly annotated
data. This will also involve classification using hierarchical sets of categories, and knowledge transfer
between categories and between application domains. Data augmentation will also be considered
specifically in the context of active learning.
– Salience: salience is a very important prior in object detection. It can be considered from two
perspectives, using either user gaze information or main categories localization. In both cases, salience
can be learned using deep networks and later used for improving object detection and localization.
We will explore how salience extraction and use can be efficiently combined with incremental and
active learning.
4.3
PostDoc: representation learning for sequences
• Supervisors: Laurent Besacier, Eric Gaussier and Jakob Verbeek
• Localization: 30% between AMA, GETALP and THOT teams
• Topic: Encoding/decoding architectures as the ones envisaged in Section 4.1 capture local and global
dependencies, as well as ordering information. Such architectures are well suited for addressing several
generic problems pertaining to sequence data (as prediction, classification and clustering), and the goal
of this postdoc will be to extend current encoding/decoding architectures to times series. In particular,
we will (1) design a method to transform general time series into input vectors for encoding/decoding
architectures, and (2) adapt the decoding module to output multi-modal, multi-variate times series.
• Focus areas:
– Advanced encoder models: Machine learning techniques for prediction, classification and clustering
usually operate on vectors; it is thus important to find fixed-size representations of the examples
considered. Such representations, for standard time series, can be obtained using RNN-based encoder
models that assume a single input sequence sampled at a constant rate, without any missing values.
The problem is however more complex for multi-scale, multi-modal and multi-variate time series, as
the ones we plan to study, inasmuch as (a) the sampling time of a given variable varies over time, and
(b) several values are missing, due to the unreliability of the associated sensors for example. We plan
to investigate encoder models for such complex time series, in particular by making the recurrent
updates dependent on the observation intervals.
– Complex multi-variate decoders: Complex time series also require specific outputs, in which one can
have several ordered sequences (instead of just one ordered sequence in the case of text). We will
study here the extension of standard decoding architectures to deal with several ordered sequences,
possibly sampled at different frequencies.
5
5.1
Positioning and aligned actions
Positioning in LabEx Persyval
DeCoRe fits excellently in Persyval’s research action Advanced Data Mining (ADM), and directly addresses one
of its three main challenges: “Mining multi-modal data”. The understanding of speech, visual content, and text
are among the core topics of modern data mining.
10
Although no existing Persyval-funded actions have a direct overlap with DeCoRe , we list related ones for
completeness. The exploratory project Phon&Stat deals with speech data but its goal is to use statistical data
analysis models and tools for experimental phonology and phonetics. The project-team Khronos focuses on
theoretical analysis and statistical modeling of time-series data with non-iid data models. The project-team
Persyvact2 aims at applying data science methods to medical data and specifically high-dimensional and large
scale ones. None of these projects has a strong overlap with DeCoRe .
5.2
Aligned actions outside LabEx Persyval
The main objective of DeCoRe is to strengthen competences and collaborations in the Grenoble research community in the area of deep learning. The collaboration involves cross-institute research, training of PhD students
and MSc interns, but also the organization of reading groups, workshops, and teaching of MSc-level courses.
While DeCoRe is an important vehicle towards this goal (by financing two full PhDs and a number of other
expenses, see Section 6), alignment with other actions helps to ensure a bigger impact by building a critical
mass of involved non-permanent research staff.
Several related actions undertaken by the principal investigators of DeCoRe are, or will be, running in
parallel. These include one PhD thesis at THOTH (J. Verbeek) funded by a Cifre grant with Facebook AI
Research, Paris (started in January 2016) on weakly supervised semantic video segmentation with deep hybrid
CNN and RNN models. The ANR project LOCUST at AMA (with LIP6-UPMC, started in January 2016)
which studies deep learning representations for time series, with applications on prediction and classification.
Furthermore, two ANR projects are in submission (selected for the final evaluation phase): one by THOTH,
and another by both MRIM and AGPIG. These projects each fund an additional PhD student: one on model
selection and one on incremental learning with deep convolutional models respectively.
6
Requested resources
Table 1 gives an overview of the requested financial resources. The large majority (> 80%) of the requested
funds will be spent on human resources: two full PhD scholarships, 6 months of PostDoc salary, and six MSc
internships. The topics of the PhD scholarships and PostDoc are detailed in Section 4.
Learning deep convolutional and recurrent networks poses a formidable computational challenge. For largescale experimentation on hard real-world problems and benchmarks, the use of GPU hardware is mandatory to
be able to run experiments in a tractable amount of time. An ambitious research program on this topic should
therefore be aligned with a suitable hardware platform to have a chance to succeed.
INRIA-Grenoble has recently entered the NVIDIA GPU research center program (coordinator J. Verbeek),
which enables DeCoRe to use the latest hardware and benefit from technical NVIDIA support thanks to the
hosting of an NVIDIA researcher. Currently, THOTH disposes of a cluster of 30 GPU boards (mostly TitanX
class). LIG has also recently acquired several machines with GPUs, shared between both GETALP and MRIM
research groups. To ensure that a sufficient hardware platform for the proposed research, we reserve a part
of the budget (11%) to acquire four servers that can host two GPUs each. In parallel we have submitted a
request to join the Facebook AI Research hardware donation program. If accepted, this is a supplementary
path to ensure sufficient computational resources. Our goal is to integrate the GPU compute resources in
a mutually accessible cluster structure, that is at least available to all partners in DeCoRe , e.g. Grenoble’s
CIMENT high-performance compute center (https://ciment.ujf-grenoble.fr).
The remaining budget will be spent on travel (8%): conference attendance, visiting researchers, and invited
speakers. We will acquire external funding for workshop organization and other dissemination activities.
Expense
Full PhD scholarships
MSc Internships
PostDoc months (*)
Travel (conferences, etc.)
GPUs (Nvidia TitanX)
Servers (Dell R730)
Total
Cost
100 kE
4 kE
4 kE
1.5 kE
1 kE
6 kE
Quantity
2
6
6
16
8
4
Budget
200 kE
24 kE
24 kE
24 kE
8 kE
24 kE
304 kE
Table 1: Breakdown of overall requested budget. (*) The 24 kE for 6 months PostDoc are conditioned on the
availability of additional funding over the 280 kE specified in the call.
11
References
[1] R. Adams, H. Wallach, and Z. Ghahramani. Learning the structure of deep sparse graphical models. In
AISTATS, 2010.
[2] Hervé Bourlard and Nelson Morgan. Connectionist speech recognition. a hybrid approach. 1994.
[3] S. Davis and P. Mermelstein. Comparison of parametric representations for monosyllabic word recognition
in continuously spoken sentences.
[4] J. Deng, A. Berg, K. Li, and L. Fei-Fei. What does classifying more than 10,000 image categories tell us?
In ECCV, 2010.
[5] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009.
[6] H. Drucker and Y LeCun. Improving generalization performance in character recognition. In Proceedings of
the IEEE Workshop on Neural Networks for Signal Processing, pages 198–207. IEEE Press, 1991. catalog
number 91TH0385-5, ISBN 0-7803-0118-8.
[7] J. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.
[8] P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. IJCV, 59(2):167–181,
2004.
[9] Sriram Ganapathy, Kyu Han, Samuel Thomas, Mohamed Omar, Maarten Van Segbroeck, and Shrikanth S
Narayanan. Robust language identification using convolutional neural network features. In Proc. INTERSPEECH, 2014.
[10] Christophe Garcia and Manolis Delakis. Convolutional face finder: A neural architecture for fast and
robust face detection. IEEE Trans. Pattern Anal. Mach. Intell., 26(11):1408–1423, November 2004.
[11] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images.
PAMI, 6(6):712–741, 1984.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection
and semantic segmentation. In CVPR, 2014.
[13] R. Girshick, F. Iandola, T. Darrell, and J. Malik. Deformable part models are convolutional neural networks.
In CVPR, 2015.
[14] Ross Girshick. Fast r-cnn. In International Conference on Computer Vision (ICCV), 2015.
[15] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.
[16] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks.
In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China,
21-26 June 2014, pages 1764–1772, 2014.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
CoRR, abs/1512.03385, 2015.
[18] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew
Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic
modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine,
IEEE, 29(6):82–97, 2012.
[19] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
[20] M. Jordan, Z. Ghahramani, T. Jaakola, and L. Saul. An introduction to variational methods for graphical
models. Machine Learning, 37(2):183–233, 1999.
[21] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR,
2015.
12
[22] S.-M. Khaligh-Razavi and N. Kriegeskorte. Deep supervised, but not unsupervised, models may explain it
cortical representation. PLoS Computational Biology, 10(11):1–29, 11 2014.
[23] R. Kiros, R. Salakhutdinov, and R. Zemel. Unifying visual-semantic embeddings with multimodal neural
language models. TACL, 2015. to appear.
[24] R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. Skip-thought
vectors. In NIPS, 2015.
[25] P. Kohli, L. Ladický, and P. Torr. Robust higher order potentials for enforcing label consistency. IJCV,
82(3):302–324, 2009.
[26] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks.
In NIPS, 2012.
[27] P. Kulkarni, J. Zepeda, F. Jurie, P. Pérez, and L. Chevallier. Learning the structure of deep architectures
using l1 regularization. In BMVC, 2015.
[28] Quoc V. Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. arXiv:1405.4053
[cs], 2014.
[29] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 52:436–444, 2015.
[30] G. Lin, C. Shen, I. Reid, and A. van den Hengel. Efficient piecewise training of deep structured models for
semantic segmentation. Arxiv.
[31] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár,
and C. Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
[32] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,
2015.
[33] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
[34] M.-T. Luong, Q. Le, I. Sutskever, O. Vinyals, and L. Kaiser. Multi-task sequence to sequence learning. In
ICLR, 2016.
[35] Thang Luong, Hieu Pham, and Christopher D. Manning. Bilingual word representations with monolingual
quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language
Processing, pages 151–159, 2015.
[36] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations
in Vector Space. arXiv:1301.3781 [cs], 2013.
[37] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed Representations
of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems
26, pages 3111–3119. 2013.
[38] G Montavon, G.B. Orr, and K.R. Müller. Neural Networks: Tricks of the Trade. Number LNCS 7700 in
Lecture Notes in Computer Science Series. Springer Verlag, 2012.
[39] G. Montufar, R. Pascanu, K. Cho, and Y. Bengio. On the number of linear regions of deep neural networks.
In NIPS, 2014.
[40] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object localization for free? – weakly-supervised learning
with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2015.
[41] Dimitri Palaz, Ronan Collobert, et al. Analysis of cnn-based speech recognition system using raw speech
as input. In Proc. INTERSPEECH, 2015.
[42] M. Paulin, J. Revaud, Z. Harchaoui, F. Perronnin, and C. Schmid. Transformation pursuit for image
classification. In CVPR, 2014.
[43] F. Perronnin and C. Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
13
[44] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014.
[45] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hockenmaier, and S. Lazebnik. Flickr30k entities:
Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
[46] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region
proposal networks. CoRR, abs/1506.01497, 2015.
[47] A. Schwing and R. Urtasun. Fully connected deep structured networks. CoRR, abs/1503.02351, 2015.
[48] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,
localization and detection using convolutional networks. In ICLR, 2014.
[49] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[50] J. Sivic and A. Zisserman. Video Google: a text retrieval approach to object matching in videos. In ICCV,
2003.
[51] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to
prevent neural networks from overfitting. JMLR, 2014.
[52] I. Sutskever, O. Vinyals, and Q. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
[53] M. Swain and D. Ballard. Color indexing. IJCV, 1991.
[54] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In
CVPR, 2015.
[55] A.R. Webb. An approach to non-linear principal components analysis using radially symmetric kernel
functions. Statistics and Computing, 6:159–168, 1996.
[56] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng, and Zheng Zhang. Error-driven incremental
learning in deep convolutional neural network for large-scale image classification. In Proceedings of the
22Nd ACM International Conference on Multimedia, MM ’14, pages 177–186, New York, NY, USA, 2014.
ACM.
[57] M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
[58] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional
random fields as recurrent neural networks. In ICCV, 2015.
14
A
CV of principal investigators
15
CURRICULUM VITAE
Laurent BESACIER
Laurent Besacier
Married, 3 children
Professor (1st class) at Univ. Grenoble Alpes (UGA), HDR
Laboratory of Informatics of Grenoble (LIG), leader of GETALP group
Director of MSTII (Math-Info) Doctoral School of Grenoble
Laurent.Besacier@imag.fr
1. Short bio
Prof. Laurent Besacier defended his PhD thesis (Univ. Avignon, France) in Computer Science in 1998 on “A parallel model
for automatic speaker recognition”. Then he spent one and a half year at the Institute of Microengineering (EPFL, Neuchatel
site, Switzerland) as an associate researcher working on multimodal person authentication (M2VTS European project). Since
1999 he is an associate professor (full professor since 2009) in Computer Science at Univ. Grenoble Alpes (he was formerly
at U. Joseph Fourier). From September 2005 to October 2006, he was an invited scientist at IBM Watson Research Center
(NY, USA) working on Speech to Speech Translation.
His research interests are mainly related to multilingual speech recognition and machine translation. Laurent Besacier has
published 200 papers in conferences and journals related to speech and language processing. He supervised or co-supervised 20
PhDs and 30 Masters. He has been involved in several national and international projects as well as several evaluation
campaigns. Since October 2012, Laurent Besacier is a junior member of the “Institut Universitaire de France” with a
project entitled “From under-resourced languages processing to machine translation: an ecological approach”.
2. Diploma
•
•
•
•
HDR (Ability to supervise research), specializing in Computer Science, University Joseph Fourier (January 2007).
Thesis title: Rich transcription in a multilingual and multimodal world,
PhD in Computer Science (1998), Université d'Avignon, Thesis title: A parallel model for speaker recognition, under
the direction of Jean-François Bonastre and Henri Meloni,
Master Degree at INPG (1995), specialty Signal-Image-Speech,
Engineer from the school of Chemistry, Physics, and Electronics of Lyon (CPE, 1995), option electronics and
information processing.
3. Scientific Activity
3.1 Prices / Honors / Highlights
• Winner (best system) NIST 2002 evaluation of speaker segmentation systems (meeting task)
• Winner (best system) in the evaluation of the project DARPA / TRANSTAC 2006 Arabic-English Spoken Translation
(done during my stay at IBM Watson research center)
• Best Paper Award in 2007 for D. Istrate, E. Castelli, M. Vacher, L Besacier., Serignat J.-F. (2007). Information
extraction from sound for medical telemonitoring. IMIA Yearbook 2007 21: 72-72.IEEE Trans. Inf. Technol. Biomed.
January 2006, 10 (2) :264-274.
• Star Challenge 2008 Finalist (Content-based search in video documents) – top 5 among 50 participants.
• Chair of the conference JEP-TALN-RECITAL 2012 (300-350 participants).
• Keynote speaker for IALP conference (International Conference on Asian Language Processing) 2012
• Junior member of the “Institut Universitaire de France” (awarded in 2012).
• My paper « Automatic speech recognition for under-resourced languages: A survey » published in Speech
Communication Journal (Elsevier) was in the top 3 of the most downloaded papers in 2014 as assessed by
http://top25.sciencedirect.com/subject/computer-science/7/journal/speech-communication/01676393/archive/59/
3.2 Scientific Committees and proofreading of articles
• Editorial comitee of TAL journal (Traitement Automatique des Langues) since 2011
•
Reviewing for International Avenues
IEEE Transactions on Acoustics, Speech and Language Processing (IEEE ASL) ; Computer Speech and Language
Journal ; Speech Communication Journal ; IEEE Transactions on Speech and Audio Processing ; IEEE Signal
Processing Letters ; IEEE Transactions on Signal Processing ; IEEE Transactions on Multimedia ; IEEE
•
•
Transactions on Information Forensics and Security ; Pattern Recognition Letters ; Machine Translation Journal ;
Language Ressources and Evaluation Journal (LRE)
Reviewing for National Journals
Traitement du Signal ; Acta Acustica ; Revue I3 ; Traitement Automatique des Langues (TAL)
International Conferences Comitees
(non exhaustive list)
Interspeech (every year since 2005) ; IEEE ICASSP (every year since 2007) ; IEEE ASRU (Technical Review
Committee, since 2009) ; EUSIPCO (since 2006, stop in 2011) ; Speaker Odyssee, Workshop on Speaker Identification
and Verification, (since 2004) ; International Workshop on Spoken Language Translation (since 2008) ; EAMT ;
NAACL-HLT 2012 ; Workshop on South and Southeast Asian Natural Languages Processing (WSSANLP) ; COLING,
2008 2012 ; ACL 2013 ; SpeD (since 2004).
3.3 Expert Assessment
•
•
•
•
•
Expert for project proposals to ACI (2005), ANR (2006-2016), Microsoft Research Fellowship in 2009 (Microsoft
Research PhD Scholarship), ANR-JST (Japan-France) in 2010.
Expert for OSEO-Anvar (2008), for the European Community (ERC Starting Grant 3rd Call - 2010).
Selection Committee for Research grants of Region Rhone-Alpes 2011-2014
Participation to the working group defining the scope of the future research call in Rhône-Alpes region and board
member of action - November 2011.
Regular member of ANR (National Research Agency) comitees.
3.4 Projects
Participation to or coordination of 3 European projects, 10 french ANR projects, DGA projects and several bilateral projects
with foreign countries (Singapore, Colombia, Brasil, Germany).
Industrial collaborations via CIFRE PhD or projects (ST micro-electronics, Lingua&Machina, Voxygen, Compilatio).
3.5 International collaborations
•
•
•
•
•
•
•
•
•
•
•
•
Institute for Infocomm Research (Singapore): Franco-Singaporean project (Merlion) on multilingual speech
recognition with Prof. Haizhou LI. Respective visits and exchanges of students and / or postdocs, 2009-2011.
IBM Watson Research Center (NY, United States): collaboration with the spoken language translation group of Y.
Gao (visiting scholar for 13 months in 2005/06, co-signatures of articles IEEE ICASSP2007, Interspeech 2007, IEEE /
ACL SLT 2006, HLT 2006).
Interactive Systems Lab. (ISL) at CMU (United States) and Karlsruhe Institute of Technology (KIT, Germany)
with T. Schultz on multilingual speech recognition (including co-authorship of a paper at the conference IEEE ICASSP
2006). with S. Stucker on the unsupervised discovery of words from phonetic streams (paper at Interspeech 2009).
European Commission - Joint Research Centre (JRC) with B. Pouliquen on automatic transliteration of named
entities in a highly multilingual context (2008).
Laboratory MICA, Hanoi (Vietnam): co-supervision of PhD students and joint work around the Vietnamese
language processing with the international laboratory MICA (INPG / CNRS / HPI).
Laboratory ITC (Cambodia): co-supervision and joint work around Khmer language processing.
Polytechnic Institute of Bucharest (Human-Computer Dialogue Group): scientific exchanges with Prof. Corneliu
Burileanu, co-supervision of master students, PhD students.
Universiti Sains Malaysia (Malaysia): Hosting and supervision of two doctoral students on speech recognition (since
2005)
University of Addis Ababa (Ethiopia): supervision of a PhD on machine translation of Amharic, hosting post-doctoral
researchers from Ethiopia (since 2010)
University of Cauca (Colombia): co-supervision of a PhD student and project around the revitalization of an
endangered language of southwestern Colombia (since 2011).
UFRGS and Ufscar (Brasil) : CNRS-FAP (french-Brasil) project on the analysis and integration of MultiWord
Expressions (MWEs) in speech and translation (2014-2016)
ITU and Ozyegin univ. (Turkey) : joint work and joint papers in the framework of the CAMOMILE project (ERANET) on collaborative annotation of multi-modal, multi-lingual and multi media documents.
4. Organization of Scientific Events
•
•
•
Chair of the next conference JEP-TALN-RECITAL 2012 (300-350 people)
Responsible for the monthly keynotes of my lab (LIG) - 2010-2014 (some guests: Moshe Vardi, Sacha Krakowiak, P.
Flajolet, G. Dowek, A. Colmerauer, A. Pentland, S. Abiteboul, W. Zadrozny, J. Sifakis, H. Hermanns, J. Hellersetein,
etc. see http://www.liglab.fr/spip.php?article884 )
Member of the organizing committee of Interspeech 2013 in Lyon (1500 persons – Satellites Workshop Coordinator).
•
•
•
•
•
Co-organizer of a special session at Interspeech 2011 (Speech technology for under-resourced languages) and
Interspeech 2016 (Sub-Saharan African languages : from speech fundamentals to applications)
Invited editor for a special issue of the "speech communication" journal (special issue around "Speech technology for
under-resourced languages"). 2014.
Chairman and organizer of the first two and of the fifth International Workshop SLTU (Spoken Language Technologies
for Under-resourced Languages) : Hanoi, Vietnam, May 2008 ; Penang, Malaysia, May 2010, and Yogyakarta,
Indonesia, 2016.
Organizer of the AFCP seminar Spoken Language Processing for under-resourced languages, in June 2007.
Organizing a special session on biometrics at the conference ISPA 2005.
5. Publications
A complete list of my most recent publications can be found on : https://cv.archives-ouvertes.fr/laurent-besacier and on
https://www.researchgate.net/profile/Laurent_Besacier
5 most significant (and recent) publications
•
Laurent Besacier, Etienne Barnard, Alexey Karpov, Tanja Schultz. Automatic speech recognition for under-resourced
languages: A survey. Speech Communication Journal, vol. 56 - Special Issue on Processing Under-Resourced
Languages:85-100, January 2014. Note: (Impact-F 1.28 estim. in 2012).
•
Martha Tachbelie, Solomon Teferra Abate, Laurent Besacier. Using different acoustic, lexical and language modeling
units for ASR of an under-resourced language – Amharic. Speech Communication Journal, Vol. 56 - Special Issue on
Processing Under-Resourced Languages:181-194, January 2014. Note: (Impact-F 1.28 estim. in 2012).
•
Horia Cucu, Andi Buzo, Laurent Besacier, Corneliu Burileanu. SMT-Based ASR Domain Adaptation Methods for
Under- Resourced Languages: Application to Romanian. Speech Communication Journal, Vol. 56 - Special Issue on
Processing Under-Resourced Languages:195-212, January 2014. Note: (Impact-F 1.28 estim. in 2012).
•
Johann Poignant, Laurent Besacier, Georges Quénot. Unsupervised Speaker Identification in TV Broadcast Based on
Written Names. IEEE Transactions on Audio, Speech and Language Processing, 2015, 23 (1), pp.57-68.
•
Ngoc-Quang Luong, Laurent Besacier, Benjamin Lecouteux. Towards Accurate Predictors of Word Quality for
Machine Translation: Lessons Learned on French - English and English - Spanish Systems. Data and Knowledge
Engineering, Elsevier, 2015, pp.11.
CURRICULUM VITAE
Denis PELLERIN
Professor (1st class) at University Grenoble Alpes (UGA), HDR
Grenoble Images Speech Signal Automatic laboratory (GIPSA-lab), UMR 5216
Denis.Pellerin@gipsa-lab.grenoble-inp.fr
Tel. 04 76 57 43 69
1. Short biography
Denis Pellerin is professor at the University Grenoble Alpes (UGA). He received the engineering degree
in electrical engineering in 1984 and the Ph.D. degree in 1988 from the Institut National des Sciences
Appliquées (INSA-Lyon), France. Since 1989 he is assistant professor (full professor since 2006) in
signal and image processing at Univ. Grenoble Alpes (He was formely at Univ. Joseph Fourier Grenoble).
He is with the AGPIG team (for Architecture, Geometry, Perception, Images, Gestures) at GIPSA-lab
(Grenoble Images Speech Signal Automatic laboratory). His research interests include i) video analysis
and indexing: image and video classification, human action recognition, video summarization, active
vision for robots, ii) visual perception and modelling: visual saliency, audio saliency, attention model,
visual substitution.
2. Education
• HDR (Ability to supervise research) in Signal and Image Processing, Univ. Joseph Fourier Grenoble,
France, 2001
• Ph.D. in Electronic Systems, Institut National des Sciences Appliquées, Lyon, France, 1988,
• Engineer in Electrical Engineering (Honours) from Institut National des Sciences Appliquées, Lyon,
France, 1984
3. Scientific Activity
Reviewer for international journals and conferences
• IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology,
IEEE Transactions on Image Processing, Computer Vision and Image Understanding, IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing.
• ACM International Conference on Multimedia Retrieval (ICMR), workshop Content-Based Multimedia
Indexing (CBMI), European Signal Processing Conference (EUSIPCO).
Main responsabilities
• 2011-2015: Member of the doctoral school EEATS (Electronics, Electrotechnics, Automatic, Signal
Processing) of Grenoble.
• 2011-2015: Member of the research committee for the UFR IM2AG of UJF.
• Since 2007: Responsible of the research group "Perception and Analysis of Videos and Images" (seven
researchers) of AGPIG team at GIPSA-lab.
• Since 2006: Responsible for the organization of the 5th school year (M.Sc) in the Industrial Computing
and Instrumentation Department (30 students and 2 options) of the engineering school
Polytech’Grenoble.
• 2003-2008: Assistant director (team of four persons) of the Master degree in “Signal Image Speech
Telecommunication” (40 students).
Main projects and collaborations
• 2013-2015: Co-responsible of the exploratory project "Attentive" supported by the LabEx Persyval-lab.
Collaboration with O. Aycard and C. Garbay of Laboratory of Informatics of Grenoble
(LIG) and M. Rombaut (GIPSA-lab). Development of a mobile robotics platform intended
to participate in the surveillance of a set of people in situation of fragility.
• 2010-2013: Regional project "Plateforme de calcul parallèle pour des modèles de vision artificielle bioinspirée". Collaboration with D. Houzet (GIPSA-lab) and A. Trémeau of the Laboratory
Hubert Curien (LaHC, Univ. Jean Monnet, Saint Etienne).
• 2007-2012: National project IRIM (Content based Multimedia Information Retrieval) with the GDR
ISIS (Research association in Information Signal Image viSion), participation in the annual
international challenges TRECVID of video retrieval evaluation.
• 2007-2010: Regional project "LIMA" (Leisure and IMAges), participation in the task "video analysis
and indexation".
• 2006-2009: Responsible for the project with the INA (National Audiovisual Institute) about image
classification (PhD of H. Goeau, co-supervised with O. Buisson).
• 2003-2007: European Network of Excellence SIMILAR for the study of multimodal interfaces
efficiently answering to vision, gesture and voice. Collaboration with the Computer
Science Department, University of Crete, Greece (C. Panagiotakis and G. Tziritas) in the
task human action recognition.
• 2001-2003: Regional project "ACTIV II" (Colour, Image processing and Vision), participation in the
task "video indexation".
• 1998-2001: European project "Art-live" (ARchitecture and authoring Tool prototype for Living Images
and new Video Experiments), participation in the task "moving people detection".
Recent supervision of PhD students
• Since 2013: Q. Labourey, Développement d'un robot attentionné pour la surveillance de personnes en
situation de fragilité (co-supervised with O. Aycard).
• Since 2013: S. Chan Wai Tim, Classification d’images et de vidéos par apprentissage de dictionnaire
(co-supervised with M. Rombaut).
• 2013:
G. Song, Effect of sound in videos on gaze: Contribution to audio-visual saliency
modeling.
• 2013:
A. Rahman, Face perception in videos: Contributions to a visual saliency model and its
implementation on GPUs (co-supervised with D. Houzet).
• 2010:
S. Marat, Modèles de saillance visuelle par fusion d’informations sur la luminance, le
mouvement et les visages pour la prédiction de mouvements oculaires lors de l’exploration
de vidéos (co-supervised with N. Guyader)
4. Publications
A complete list of my publications can be found on:
http://www.gipsa-lab.fr/~denis.pellerin/publications_en.html
Six most significant (and recent) publications:
[1] Budnik M., Gutierrez-Gomez E.-L., Safadi B., Pellerin D., Quénot G., Learned features versus
engineered features for multimedia indexing, Multimedia Tools and Applications, Springer Verlag,
To appear.
[2] Labourey Q., Aycard O., Pellerin D., Rombaut R., Garbay C., An evidential filter for indoor
navigation of a mobile robot in dynamic environment, International Conference on Information
Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU’2016),
Eindhoven, The Netherlands, June 2016.
[3] Chan Wai Tim S., Rombaut M., Pellerin D., Rejection-based classification for action recognition
using a spatio-temporal dictionary, European Signal Processing Conference (EUSIPCO'2015), Nice,
France, August 2015.
[4] Stoll C., Palluel-Germain R., Fristot V., Pellerin D., Alleysson D., Graff C., Navigating from a depth
image converted into sound, Applied Bionics and Biomechanics, volume 2015, article ID 543492,
2015.
[5] Rahman A., Pellerin D., Houzet D., Influence of number, location and size of faces on gaze in video,
Journal of Eye Movement Research, 7(2):5, 1-11, 2014
[6] Marat S., Rahman A., Pellerin D., Guyader N., Houzet D., Improving visual saliency by adding “face
feature map” and “center bias”, Cognitive Computation, 5(1): 63-75, 2013.
Curriculum Vitae
Georges QUÉNOT
Last name: QUÉNOT First Name: Georges
Born: May 14, 1960. Married. 2 children.
Employment: Senior researcher (CNRS) at Laboratoire d’Informatique de Grenoble.
Professional address:
Laboratoire d'Informatique de Grenoble – CNRS UMR 5217
Bâtiment B, 41, rue des mathématiques, B.P. 53, 38041 Grenoble Cedex 9
Direct tel.: +33 (0)4 76 63 58 55
Fax: +33 (0)4 76 63 56 86
Mail: Georges.Quenot@imag.fr
Webpage: http://lig-membres.imag.fr/quenot/
1) BIOGRAPHY
Education:
 1983: Engineer from École Polytechnique, Palaiseau.
 1988: Ph.D. in Computer Science, University of Orsay – Paris XI.
 1998: HDR in Computer Science, University of Orsay – Paris XI.
Research interests:
 Multimedia information indexing and retrieval;
 Concept indexing in image and video documents;
 Machine learning.
Current functions:
 Leader of the Multimedia Information Indexing and Retrieval group (MRIM) of the
Laboratoire d'Informatique de Grenoble (LIG);
 Responsible for their activities on video indexing and retrieval.
Student-researcher advising:
 10 former Ph.D. students and currently 1 PhD students.
Teaching:
 About 60 hours per year at M1/M2 level (M2R MOSIG, RICM, M2PGI) on multimedia
information indexing and retrieval.
Participations in research projects:
 International Projects:
o ICT ASIA project: MoSAIC (2006-2008): Mobile Search and Annotation using Images
in Context.
o ICT ASIA project: ShootMyMind (2015-2016): Automatic Generation of Videos form
Scenarii.
o CHIST-ERA Camomile (20012-2016): Collaborative Annotation of multi-MOdal, multILingual and multi-mEdia documents.
 European project:
o STREP PENG (2004-2006): PErsonalised News content programminG;
 National French projects :
o TechnoVision ARGOS (2004-2006): Campagne d'évaluation d'outils de surveillance
de contenus vidéos;
o ANR AVEIR (2006-2009): Annotation automatique et extraction de concepts visuels
pour la recherche d'images;
o OSEO-AII Quaero (2007-2013): La recherche et la reconnaissance de contenus
numériques;
o
ANR Contint VideoSense (2010-2013): automatic video tagging by high level
concepts;
o

ANR Repere QCompere (2012-2014): Quaero Consortium for Multimodal Person
Recognition;
o FUI Guimuteic (2015-2018): Guide Multimédia de Tête, Informatif et Connecté.
Local project:
o APIMS (2009-2010): Apprentissage Parallèle pour l'Indexation Multimédia
Sémantique.
Professional activities:
 PC member or reviewers of many international conferences and journals including for
instance: Proceedings of the IEEE, ACM Transactions on Multimedia Computing
Communications and Applications, IEEE Transactions on Multimedia, Information
Processing and Management, IEEE Transactions on Pattern Analysis and Machine
Intelligence, Multimedia Tools and Applications, and Signal Processing: Image
Communication.
 Organization of the first École d'Automne en Recherche d'Information et Application
(EARIA'06).
 Organization of Content-Based Multimedia Indexing (CBMI) 2014.
 Expert for project proposals and evaluation: Technovision / ANR / Digiteo.
 Organization of the TRECVid semantic indexing (SIN) benchmark since 2010.
 Responsible of the IRIM (Indexation et Recherche d'Information Multimédia) action of the
GDR ISIS since 2008.
 Member of associate professor recruitment committees (Bordeaux, Cergy-Pontoise).
Highlights:
 Star Challenge 2008 Finalist (Content-based search in video documents) – top 5 among
50 participants.
 Currently first at VOC 2012 Object Classification (comp1, post-campaign).
2) MOST SIGNIFICANT PUBLICATIONS
 George Awad, Cees G. M. Snoek, Alan F. Smeaton Georges Quénot. TRECVid Semantic
Indexing of Video: A 6-Year Retrospective. ITE Transactions on Media Technology and
Applications. To appear.
 Mateusz Budnik, Efrain-Leonardo Gutierrez-Gomez, Bahjat Safadi, Denis Pellerin and
Georges Quénot. Learned features versus engineered features for multimedia indexing.
Multimedia Tools and Applications, Springer Verlag. To appear.
 Johann Poignant, Guillaume Fortier, Laurent Besacier, Georges Quénot. Naming multimodal clusters to identify persons in TV broadcast. Multimedia Tools and Applications,
Springer Verlag, pp.1-25, 2015.
 Johann Poignant, Laurent Besacier, Georges Quénot. Unsupervised Speaker Identification
in TV Broadcast Based on Written Names. IEEE Transactions on Audio, Speech and
Language Processing, 23 (1), pp.57-68, 2015.
 Bahjat Safadi, Nadia Derbas, Georges Quénot. Descriptor Optimization for Multimedia
Indexing and Retrieval. Multimedia Tools and Applications. 74 (4):1267-1290, 2015.
 Abdelkader Hamadi, Philippe Mulhem, Georges Quénot. Extended conceptual feedback
for semantic multimedia indexing. Multimedia Tools and Applications. 23 (1):57-68, 2015.
 Bogdan Ionescu, Jenny Benois-Pineau, Tomas Piatrik, Georges Quénot. Fusion in
Computer Vision: Understanding Complex Visual Content. Springer international
publishing, 272 p., 2014.
 S. Tiberius Strat, A. Benoit, Hervé Bredin, Georges Quénot, P. Lambert. Hierarchical Late
Fusion for Concept Detection in Videos. Fusion in Computer Vision: Understanding








Complex Visual Content, Springer international publishing, pp.53-78, 2014.
Bahjat Safadi, Georges Quénot. Active learning with multiple classifiers for multimedia
indexing. Multimedia Tools and Applications, 66(2):403-417, 2012.
Émilie Dumont, Georges Quénot. Automatic Story Segmentation for TV News Video using
Multiple Modalities. International Journal of Digital Multimedia Broadcasting, 2012:1--11,
2012. Note: Article ID 732514.
Georges Quénot, Tien-Ping Tan, Viet-Bac Le, Stéphane Ayache, Laurent Besacier,
Philippe Mulhem. Content-based search in multilingual audiovisual documents using the
International Phonetic Alphabet. Multimedia Tools and Applications (Impact-F 1.01),
48(1):123-140, 2010.
Stéphane Ayache and Georges Quénot, “Image and Video Indexing using Networks of
Operators”, in EURASIP Journal on Image and Video Processing, Vol. 2007, Article ID
56928, 13 pages, 2007.
Stéphane Ayache and Georges Quénot, “Evaluation of active learning strategies for video
indexing”, in Signal Processing: Image Communication, Vol. 22/7-8 pp 692-704, AugustSeptember 2007.
Philippe Joly, Jenny Benois-Pineau, Ewa Kijak and Georges Quénot, “The ARGOS
campaign: Evaluation of Video Analysis Tools”, in Signal Processing: Image
Communication, Vol. 22/7-8 pp 705-717, August-September 2007.
Stéphane Ayache and Georges Quénot, “Video Corpus Annotation using Active Learning”,
in 30th European Conference on Information Retrieval (ECIR'08), Glasgow, Scotland, 30th
March - 3rd April, 2008.
Stéphane Ayache, Georges Quénot, Jérôme Gensel and Shin'ichi Satoh, “Using Topic
Concepts for Semantic Video Shots Classification”, in International Conference on Image
and Video Retrieval (CIVR'06), Tempe, AZ, USA, July 13-15, 2006.
INRIA Rhône-Alpes, LEAR team
Tel. +33 4 76 61 52 33, Fax +33 4 76 61 54 54
655 Avenue de l’Europe, 38330 Montbonnot, France
Email: Jakob.Verbeek@inria.fr
Webpage: http://lear.inrialpes.fr/∼verbeek
Citizenship: Dutch, Date of birth: December 21, 1975
Curriculum Vitae – Jakob Verbeek
Academic Background
2004
2000
1998
• Doctorate Computer Science (best thesis award), Informatics Institute, University of Amsterdam. Advisors: Prof. Dr. Ir. F. Groen, Dr. Ir. B. Kröse, and Dr. N. Vlassis. Thesis: Mixture models for clustering and
dimension reduction.
• Master of Science in Logic (with honours), Institute for Language, Logic, and Computation, University of
Amsterdam. Advisor: Prof. Dr. M. van Lambalgen. Thesis: An information theoretic approach to finding word
groups for text classification.
• Master of Science in Artificial Intelligence (with honours), Dutch National Research Institute for Mathematics and Computer Science & University of Amsterdam. Advisors: Prof. Dr. P. Vitányi, Dr. P. Grünwald,
and Dr. R. de Wolf. Thesis: Overfitting using the minimum description length principle.
Awards
2011
2009
2006
2000
• Outstanding Reviewer Award, IEEE Conference on Computer Vision and Pattern Recognition.
• Outstanding Reviewer Award, IEEE Conference on Computer Vision and Pattern Recognition.
• Biannual E.S. Gelsema Award of the Dutch Society for Pattern Recognition and Image Processing for best
PhD thesis and associated international journal publications.
• Regional winner of yearly best MSc thesis award Dutch Society for Computer Science.
Employment
since 2007
2005-2007
2004-2005
• Researcher (CR1), LEAR project, INRIA Rhône-Alpes, Grenoble.
• Postdoc, LEAR project, INRIA Rhône-Alpes, Grenoble.
• Postdoc, Intelligent Autonomous Systems group, Informatics Institute, University of Amsterdam.
Professional Activities
Participation in Research Projects
2013-2016
2011-2015
2010-2013
2009-2012
2008-2010
2006-2009
2000-2005
• Physionomie: Physiognomic Recognition for Forensic Investigation , funded by French national research
agency (ANR).
• AXES: Access to Audiovisual Archives, European integrated project, 7th Framework Programme.
• Quaero Consortium for Multimodal Person Recognition, funded by French national research agency
(ANR).
• Modeling multi-media documents for cross-media access, funded by Xerox Research Centre Europe
(XRCE) and French national research agency (ANR).
• Interactive Image Search, funded by French national research agency (ANR).
• Cognitive-Level Annotation using Latent Statistical Structure (CLASS), funded by European Union Sixth
Framework Programme.
• Tools for Non-linear Data Analysis, funded by Dutch Technology Foundation (STW).
Teaching
2015
2008-2015
2003-2005
2003-2005
• Lecturer in MSc course Kernel Methods for Statistical Learning, École Nationale Supérieure
d’Informatique et de Mathématiques Appliquées (ENSIMAG), Grenoble, France.
• Lecturer in MSc course Machine Learning and Category Representation, École Nationale Supérieure
d’Informatique et de Mathématiques Appliquées (ENSIMAG), Grenoble, France.
• Lecturer in MSc course Machine learning: pattern recognition, University of Amsterdam, The Netherlands.
• Lecturer in graduate course Advanced issues in neurocomputing, Advanced School for Imaging and
Computing, The Netherlands.
Professional Activities (continued)
1997-2000
• Teaching assistant in courses MSc Artificial Intelligence, University of Amsterdam, The Netherlands.
Supervision of MSc and PhD Students
2015
since 2013
2013
2011-2015
2010-2014
2009-2012
2008-2011
2006-2010
2009
2007-2008
2005
2003
2003
since 2014
since 2011
• Jerome Lesaint, MSc, Image and video captioning.
• Shreyas Saxena, PhD, Recognizing people in the wild.
• Shreyas Saxena, MSc, Metric learning for face verification.
• Dan Oneaţă, PhD, Large-scale machine learning for video analysis.
• Gokberk Cinbis, PhD, Fisher kernel based models for image classification and object localization, awarded
AFRIF best thesis award 2014.
• Thomas Mensink, PhD, Modeling multi-media documents for cross-media access, awarded AFRIF best
thesis award 2012.
• Josip Krapac, PhD, Image search using combined text and image content.
• Matthieu Guillaumin, PhD, Learning models for visual recognition from weak supervision.
• Gaspard Jankowiak, intern, Decision tree quantization of image patches for image categorization.
• Thomas Mensink, intern, Finding people in captioned news images.
• Markus Heukelom, MSc, Face detection and pose estimation using part-based models.
• Jan Nunnink, MSc, Large scale mixture modelling using a greedy expectation-maximisation algorithm.
• Noah Laith, MSc, A fast greedy k-means algorithm.
Associate Editor
• International Journal of Computer Vision.
• Image and Vision Computing Journal.
Area Chair for International Conferences
• IEEE Conference on Computer Vision and Pattern Recognition: 2015.
• European Conference on Computer Vision: 2012, 2014.
• British Machine Vision Conference: 2012, 2013, 2014.
Programme Committee Member for Conferences, including
• IEEE International Conference on Computer Vision: 2009, 2011, 2013, 2015.
• European Conference on Computer Vision: 2008, 2010.
• IEEE Conference on Computer Vision and Pattern Recognition: 2006–2014, 2016.
• Neural Information Processing Systems: 2006–2010, 2012–2013.
• Reconnaissance des Formes et l’Intelligence Artificielle: 2016.
Reviewer for International Journals, including
since 2008
since 2005
since 2004
• International Journal of Computer Vision.
• IEEE Transactions on Neural Networks.
• IEEE Transactions on Pattern Analysis and Machine Intelligence.
Reviewer of research grant proposals, including
2015
2014
2010
• Postdoctoral fellowship grant, Research Foundation Flanders (FWO)
• Collaborative Research grant, Indo-French Centre for the Promotion of Advance Research (IFCPAR)
• VENI grant, Netherlands Organisation for Scientific Research (NWO)
Miscellaneous
2011
2003
Research Visits
• Visiting researcher Statistical Machine Learning group, NICTA Canberra, Autralia, May 2011.
• Machine Learning group University of Toronto, Prof. Sam Roweis, Canada, May–September 2003.
Summer Schools & Workshops
2015
• DGA workshop on Big Data in Multimedia Information Processing, invited speaker, Paris, France, October 22.
• Physionomie workshop at European Academy of Forensic Science conference, co-organizer and speaker,
Prague, Czech Republic, September 9.
Miscellaneous (continued)
2014
2011
2010
2009
2008
2015
2013
2012
2011
2010
2009
2008
2006
2005
• StatLearn workshop, invited speaker, April 13, 2015, Grenoble, France.
• 3rd Croatian Computer Vision Workshop, Center of Excellence for Computer Vision, invited speaker,
September 16, 2014, Zagreb, Croatia.
• 2nd IST Workshop on Computer Vision and Machine Learning, Institute of Science and Technology, invited presentation, October 7, Vienna, Austria.
• Workshop on 3D and 2D Face Analysis and Recognition, Ecole Centrale de Lyon / Lyon University, invited presentation, January 28.
• NIPS Workshop on Machine Learning for Next Generation Computer Vision Challenges, co-organizer,
December 10, Whistler BC, Canada.
• ECCV Workshop on Face Detection: Where are we, and what next?, invited presentation, September 10,
Hersonissos, Greece.
• INRIA Visual Recognition and Machine Learning Summer School, 1h lecture, July 26–30,Grenoble,
France.
• Workshop “Statistiques pour le traitement de l’image”, Université Paris 1 Panthéon-Sorbonne, invited
speaker, January 23.
• International Workshop on Object Recognition, poster presentation, May 16–18 2008, Moltrasio, Italy.
Seminars
• Société Francaise de Statistique, Institut Henri Poincaré, Paris, France, Object detection with incomplete
supervision, October 23.
• Center for Machine Perception, Czech Technical University, Prague, Czech Republic, Object detection with
incomplete supervision, September 8.
• Dept. of Information Engineering and Computer Science, University of Trento, Italy, Object detection with
incomplete supervision, March 16.
• Computer Vision Center, Barcelona, Spain, Object detection with incomplete supervision, February 13.
• Intelligent Systems Laboratory Amsterdam, University of Amsterdam, The Netherlands, Segmentation
Driven Object Detection with Fisher Vectors, October 15.
• Media Integration and Communication Center at the University of Florence, Italy, Segmentation Driven
Object Detection with Fisher Vectors, September 24.
• DGA workshop on Multimedia Information Processing (TIM 2013), Paris, France, Face verification ”in the
wild”, July 2.
• Computer Vision and Machine Learning group, Institute of Science and Technology, Vienna, Austria,
Image categorization using Fisher kernels of non-iid image models, June 11.
• Computer Vision Center, Barcelona, Spain, Image categorization using Fisher kernels of non-iid image models,
June 4.
• TEXMEX Team, INRIA, Rennes, France, Image categorization using Fisher kernels of non-iid image models,
April 20.
• Statistical Machine Learning group, NICTA, Canberra, Australia, Modelling spatial layout for image classification, May 26.
• Canon Information Systems Research Australia, Sydney, Australia, Learning structured prediction models for
interactive image labeling, May 20.
• Laboratoire TIMC-IMAG, Learning: Models and Algorithms team, Grenoble, Metric learning approaches
for image annotation and face verification, October 7.
• University of Oxford, Visual Geometry Group, Oxford, TagProp: a discriminatively trained nearest neighbor
model for image auto-annotation, February 1.
• Laboratoire Jean Kuntzmann, Grenoble, Machine learning for semantic image interpretation, June 11.
• University of Amsterdam, Intelligent Systems Laboratory, Discriminative learning of nearest-neighbor models
for image auto-annotation, April 28.
• Université de Caen, Laboratoire GREYC, Improving People Search Using Query Expansions, February 5.
• Computer Vision Center, Autonomous University of Barcelona, Improving People Search Using Query Expansions, September 26.
• Computer Vision Lab, Max Planck institute for Biological Cybernetics, Scene Segmentation with CRFs
Learned from Partially Labeled Images, July 31.
• Textual and Visual Pattern Analysis team, Xerox Research Centre Europe, Scene Segmentation with CRFs
Learned from Partially Labeled Images, April 24.
• Parole group, LORIA Nancy, Unsupervised learning of low-dimensional structure in high-dimensional data.
• Content Analysis group, Xerox Research Centre Europe, Manifold learning: unsupervised, correspondences,
and semi-supervised.
• Learning and Recognition in Vision group, INRIA Rhône-Alpes, Manifold learning & image segmentation.
• Computer Engineering Group, Bielefeld University, Manifold learning with local linear models and Gaussian
fields.
Miscellaneous (continued)
2004
2003
2002
• Algorithms and Complexity group, Dutch Center for Mathematics and Computer Science, Semi-supervised
dimension reduction through smoothing on graphs.
• Machine Learning team, Radboud University Nijmegen, Spectral methods for dimension reduction and nonlinear CCA.
• Information and Language Processing Systems group, University of Amsterdam, A generative model for the
Self-Organizing Map.
Selected Publications
In peer reviewed international journals
2015
2013
2012
2010
2009
2006
2005
2003
2002
• G. Cinbis, J. Verbeek, C. Schmid. Approximate Fisher kernels of non-iid image models for image categorization.
IEEE Transactions on Pattern Analysis and Machine Intelligence, to appear, 2015.
• H. Wang, D. Oneaţă, J. Verbeek, C. Schmid. A robust and efficient video representation for action recognition.
International Journal of Computer Vision, to appear, 2015.
• M. Douze, J. Revaud, J. Verbeek, H. Jégou, C. Schmid. Circulant temporal encoding for video retrieval and
temporal alignment. International Journal of Computer Vision, to appear, 2015.
• J. Sánchez, F. Perronnin, T. Mensink, J. Verbeek. Image classification with the Fisher vector: theory and practice.
International Journal of Computer Vision 105 (3), pp. 222–245, 2013.
• T. Mensink, J. Verbeek, F. Perronnin, G. Csurka. Distance-based image classification: generalizing to new classes
at near-zero cost. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (11), pp. 2624–2637,
2013.
• T. Mensink, J. Verbeek, G. Csurka. Tree-structured CRF models for interactive image labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (2), pp. 476–489, 2013.
• M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. Face recognition from caption-based supervision. International Journal of Computer Vision, 96(1), pp. 64–82, January 2012.
• H. Jégou, C. Schmid, H. Harzallah, and J. Verbeek. Accurate image search using the contextual dissimilarity
measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(1), pp. 2–11, January 2010.
• D. Larlus, J. Verbeek, F. Jurie. Category level object segmentation by combining bag-of-words models with Dirichlet processes and random fields. International Journal of Computer Vision 88(2), pp. 238–253, June 2010.
• J. van de Weijer, C. Schmid, J. Verbeek, and D. Larlus. Learning color names for real-world applications. IEEE
Transactions on Image Processing 18(7), pp. 1512–1523, July 2009.
• J. Verbeek, J. Nunnink, and N. Vlassis. Accelerated EM-based clustering of large data sets. Data Mining and
Knowledge Discovery 13(3), pp. 291–307, November 2006.
• J. Verbeek and N. Vlassis. Gaussian fields for semi-supervised regression and correspondence learning. Pattern
Recognition 39(10), pp. 1864–1875, October 2006.
• J. Verbeek. Learning nonlinear image manifolds by global alignment of local linear models. IEEE Transactions on
Pattern Analysis and Machine Intelligence 28(8), pp. 1236–1250, August 2006.
• J. Porta, J. Verbeek, B. Kröse. Active appearance-based robot localization using stereo vision. Autonomous
Robots 18(1), pp. 59–80, January 2005.
• J. Verbeek, N. Vlassis, and B. Kröse. Self-organizing mixture models. Neurocomputing 63, pp. 99–123,
January, 2005.
• J. Verbeek, N. Vlassis, and B. Kröse. Efficient greedy learning of Gaussian mixture models. Neural Computation 15(2), pp. 469–485, February 2003.
• A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering algorithm. Pattern Recognition 36(2), pp.
451–461, February 2003.
• J. Verbeek, N. Vlassis, and B. Kröse. A k-segments algorithm for finding principal curves. Pattern Recognition
Letters 23(8), pp. 1009–1017, June 2002.
In peer reviewed international conferences
2014
2013
• D. Oneaţă, J. Revaud, J. Verbeek, C. Schmid. Spatio-Temporal Object Detection Proposals. Proceedings European Conference on Computer Vision, September 2014.
• G. Cinbis, J. Verbeek, C. Schmid. Multi-fold MIL Training for Weakly Supervised Object Localization. Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2014.
• D. Oneaţă, J. Verbeek, C. Schmid. Efficient Action Localization with Approximately Normalized Fisher Vectors.
Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2014.
• G. Cinbis, J. Verbeek, C. Schmid. Segmentation Driven Object Detection with Fisher Vectors. Proceedings
IEEE International Conference on Computer Vision, December 2013.
• D. Oneaţă, J. Verbeek, C. Schmid. Action and Event Recognition with Fisher Vectors on a Compact Feature Set.
Proceedings IEEE International Conference on Computer Vision, December 2013.
Selected Publications (continued)
2012
2011
2010
2009
2008
2007
2006
2004
2003
2002
• T. Mensink, J. Verbeek, F. Perronnin, G. Csurka. Metric learning for large scale image classification: generalizing
to new classes at near-zero cost. Proceedings European Conference on Computer Vision, October 2012. (oral)
• G. Cinbis, J. Verbeek, C. Schmid. Image categorization using Fisher kernels of non-iid image models. Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2012.
• J. Krapac, J. Verbeek, F. Jurie. Modeling spatial layout with Fisher vectors for image categorization. Proceedings
IEEE International Conference on Computer Vision, November 2011.
• G. Cinbis, J. Verbeek, C. Schmid. Unsupervised metric learning for face identification in TV video. Proceedings
IEEE International Conference on Computer Vision, November 2011.
• J. Krapac, J. Verbeek, F. Jurie. Learning tree-structured descriptor quantizers for image categorization. Proceedings British Machine Vision Conference, September 2011.
• T. Mensink, J. Verbeek, G. Csurka. Learning structured prediction models for interactive image labeling. Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2011.
• M. Guillaumin, J. Verbeek, C. Schmid. Multiple instance metric learning from automatically labeled bags of
faces. Proceedings European Conference on Computer Vision, September 2010.
• M. Guillaumin, J. Verbeek, C. Schmid. Multimodal semi-supervised learning for image classication. Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2010. (oral)
• J. Krapac, M. Allan, J. Verbeek, F. Jurie. Improving web image search results using query-relative classifiers.
Proceedings IEEE Conference on Computer Vision and Pattern Recognition, June 2010.
• T. Mensink, J. Verbeek, G. Csurka. Trans Media Relevance Feedback for Image Autoannotation.Proceedings
British Machine Vision Conference, September 2010.
• T. Mensink, J. Verbeek, H. Kappen. EP for efficient stochastic control with obstacles. Proceedings European
Conference on Artificial Intelligence, August 2010. (oral)
• J. Verbeek, M. Guillaumin, T. Mensink, C. Schmid. Image Annotation with TagProp on the MIRFLICKR set.
Proceedings ACM International Conference on Multimedia Information Retrieval, March 2010. (invited
paper)
• M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. TagProp: Discriminative metric learning in nearest neighbor
models for image auto-annotation. Proceedings IEEE International Conference on Computer Vision, September 2009. (oral)
• M. Guillaumin, J. Verbeek, C. Schmid. Is that you? Metric learning approaches for face identification. Proceedings IEEE International Conference on Computer Vision, September 2009.
• M. Allan, J. Verbeek Ranking user-annotated images for multiple query terms. Proceedings British Machine
Vision Conference, September 2009.
• M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid. Automatic face naming with caption-based supervision.
Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008.
• T. Mensink, and J. Verbeek. Improving people search using query expansions: How friends help to find people.
Proceedings European Conference on Computer Vision, pp. 86–99, October 2008. (oral)
• J. Verbeek and B. Triggs. Scene segmentation with CRFs learned from partially labeled images. Advances in
Neural Information Processing Systems 20, pp. 1553–1560, January 2008. (oral)
• H. Cevikalp, J. Verbeek, F. Jurie, and A. Kläser. Semi-supervised dimensionality reduction using pairwise equivalence constraints. Proceedings International Conference on Computer Vision Theory and Applications,
pp. 489–496, January 2008.
• J. van de Weijer, C. Schmid, and J. Verbeek. Learning color names from real-world images. Proceedings IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2007.
• J. Verbeek and B. Triggs. Region classification with Markov field aspect models. Proceedings IEEE Conference
on Computer Vision and Pattern Recognition, pp. 1–8, June 2007.
• J. van de Weijer, C. Schmid, and J. Verbeek. Using high-level visual information for color constancy. Proceedings IEEE International Conference on Computer Vision, pp. 1–8, October 2007.
• Z. Zivkovic and J. Verbeek. Transformation invariant component analysis for binary images. Proceedings IEEE
Conference on Computer Vision and Pattern Recognition, pp. 254–259, June 2006.
• J. Verbeek, S. Roweis, and N. Vlassis. Non-linear CCA and PCA by alignment of local models. Advances in
Neural Information Processing Systems 16, pp. 297–304, January 2004. (oral)
• J. Porta, J. Verbeek, and B. Kröse. Enhancing appearance-based robot localization using non-dense disparity maps.
Proceedings International Conference on Intelligent Robots and Systems, pp. 980–985, October 2003.
• J. Verbeek, N. Vlassis, and B. Kröse. Self-organization by optimizing free-energy. Proceedings 11th European
Symposium on Artificial Neural Networks, pp. 125–130, April 2003.
• J. Verbeek, N. Vlassis, and B. Kröse. Coordinating principal component analyzers. Proceedings International
Conference on Artificial Neural Networks, pp. 914–919, August 2002. (oral)
• J. Verbeek, N. Vlassis, and B. Kröse. Fast nonlinear dimensionality reduction with topology preserving networks.
Proceedings 10th European Symposium on Artificial Neural Networks, pp. 193–198, April 2002. (oral)
Selected Publications (continued)
2001
• J. Verbeek, N. Vlassis, and B. Kröse. A soft k-segments algorithm for principal curves. Proceedings International Conference on Artificial Neural Networks, pp. 450–456, August 2001.
Book chapters
2013
2012
• T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Large scale metric learning for distance-based image
classification on open ended data sets. In: G. Farinella, S. Battiato, and R. Cipolla. Advances in Computer Vision
and Pattern Recognition, Springer, 2013.
• R. Benavente, J. van de Weijer, M. Vanrell, C. Schmid, R. Baldrich, J. Verbeek, and D. Larlus. Color Names.
In: T. Gevers, A. Gijsenij, J. van de Weijer, and J. Geusebroek. Color in Computer Vision, Wiley, 2012.
Workshops and regional conferences
2015
2014
2013
2012
2011
2010
2009
2004
2003
2002
2001
2000
1999
• S. Saxena, and J. Verbeek. Coordinated Local Metric Learning. ICCV ChaLearn Looking at People workshop,
December 2015.
• V. Zadrija, J. Krapac, J. Verbeek, and S. Šegvić. Patch-level Spatial Layout for Classification and Weakly Supervised Localization. German Conference on Pattern Recognition, October 2015.
• M. Douze, D. Oneata, M. Paulin, C. Leray, N. Chesneau, D. Potapov, J. Verbeek, K. Alahari, Z. Harchaoui,
L. Lamel, J.-L. Gauvain, C. Schmidt, and C. Schmid. The INRIA-LIM-VocR and AXES submissions to Trecvid
2014 Multimedia Event Detection. TRECVID Workshop, November, 2014.
• R. Aly, R. Arandjelovic, K. Chatfield, M. Douze, B. Fernando, Z. Harchaoui, K. Mcguiness, N. O’Connor,
D. Oneaţă, O. Parkhi, D. Potapov, J. Revaud, C. Schmid, J.-L. Schwenninger, D. Scott, T. Tuytelaars, J. Verbeek, H. Wang, and A. Zisserman. The AXES submissions at TrecVid 2013. TRECVID Workshop, November,
2013.
• H. Bredin, J. Poignant, G. Fortier, M. Tapaswi, V.-B. Le, A. Roy, C. Barras, S. Rosset, A. Sarkar, Q. Yang, H.
Gao, A. Mignon, J. Verbeek, L. Besacier, G. Quénot, H. Ekenel, and R. Stiefelhagen. QCompere @ REPERE
2013. Workshop on Speech, Language and Audio for Multimedia, August 2013.
• D. Oneaţă, M. Douze, J. Revaud, J. Schwenninger, D. Potapov, H. Wang, Z. Harchaoui, J. Verbeek, C.
Schmid, R. Aly, K. Mcguiness S. Chen, N. O’Connor, K. Chatfield, O. Parkhi, and R. Arandjelovic, A.
Zisserman, F. Basura, and T. Tuytelaars. AXES at TRECVid 2012: KIS, INS, and MED. TRECVID Workshop,
November, 2012.
• H. Bredin, J. Poignant, M. Tapaswi, G. Fortier, V. Bac Le, T. Napoleon, H. Gao, C. Barras, S. Rosset, L. Besacier, J. Verbeek, G. Quénot, F. Jurie, H. Kemal Ekenel. Fusion of speech, faces and text for person identification
in TV broadcast. ECCV Workshop on Information fusion in Computer Vision for Concept Recognition, October, 2012.
• T. Mensink, J. Verbeek, and T. Caetano. Learning to Rank and Quadratic Assignment. NIPS Workshop on
Discrete Optimization in Machine Learning, December 2011.
• T. Mensink, G. Csurka, F. Perronnin, J. Sánchez, and J. Verbeek. LEAR and XRCEs participation to Visual
Concept Detection Task - ImageCLEF 2010. Working Notes for the CLEF 2010 Workshop, September 2010.
• M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Apprentissage de distance pour l’annotation d’images
par plus proches voisins. Reconnaissance des Formes et Intelligence Artificielle, January 2010.
• M. Douze, M. Guillaumin, T. Mensink, C. Schmid, and J. Verbeek. INRIA-LEARs participation to ImageCLEF
2009. Working Notes for the CLEF 2009 Workshop, September 2009.
• J. Nunnink, J. Verbeek, and N. Vlassis. Accelerated greedy mixture learning. Proceedings Annual Machine
Learning Conference of Belgium and the Netherlands, pp. 80–86, January 2004.
• J. Verbeek, N. Vlassis, and J. Nunnink. A variational EM algorithm for large-scale mixture modeling. Proceedings Conference of the Advanced School for Computing and Imaging, pp. 136–143, June 2003.
• J. Verbeek, N. Vlassis, and B. Kröse. Non-linear feature extraction by the coordination of mixture models. Proceedings Conference of the Advanced School for Computing and Imaging, pp. 287–293, June 2003.
• J. Verbeek, N. Vlassis, and B. Kröse. Locally linear generative topographic mapping. Proceedings Annual
Machine Learning Conference of Belgium and the Netherlands, pp. 79–86, December 2002.
• J. Verbeek, N. Vlassis, and B. Kröse. Efficient greedy learning of Gaussian mixtures. Proceedings 13th BelgianDutch Conference on Artificial Intelligence, pp. 251–258, October 2001.
• J. Verbeek, N. Vlassis, and B. Kröse. Greedy Gaussian mixture learning for texture segmentation. (oral) ICANN
Workshop on Kernel and Subspace Methods for Computer Vision, pp. 37–46, August 2001.
• J. Verbeek. Supervised feature extraction for text categorization. Proceedings Annual Machine Learning Conference of Belgium and the Netherlands, December 2000.
• J. Verbeek. Using a sample-dependent coding scheme for two-part MDL. Proceedings Machine Learning &
Applications (ACAI ’99), July 1999.
Selected Publications (continued)
2012
2011
2010
Patents
• T. Mensink, J. Verbeek, G. Csurka, and F. Perronnin. Metric Learning for Nearest Class Mean Classifiers.
United States Patent Application 20140029839, Publication date: 01/30/2014, filing date: 07/30/2012,
XEROX Corporation.
• T. Mensink, J. Verbeek, and G. Csurka. Learning Structured prediction models for interactive image labeling.
United States Patent Application 20120269436, Publication date: 25/10/2012, filing date: 20/04/2011,
XEROX Corporation.
• T. Mensink, J. Verbeek, and G. Csurka. Retrieval systems and methods employing probabilistic cross-media
relevance feedback. United States Patent Application 20120054130, Publication date: 01/03/2012, filing
date: 31/08/2010, XEROX Corporation.
Technical Reports
2013
2012
2011
2010
2008
2005
2004
2002
2001
2000
• J. Sanchez, F. Perronnin, T. Mensink, J. Verbeek. Image classification with the Fisher vector: theory and practice.
Technical Report RR-8209, INRIA, 2011.
• T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Large scale metric learning for distance-based image
classification. Technical Report RR-8077, INRIA, 2011.
• O. Yakhnenko, J. Verbeek, and C. Schmid. Region-based image classification with a latent SVM model. Technical Report RR-7665, INRIA, 2011.
• J. Krapac, J. Verbeek, F. Jurie. Spatial Fisher vectors for image categorization. Technical Report RR-7680,
INRIA, 2011.
• T. Mensink, J. Verbeek, and G. Csurka. Weighted transmedia relevance feedback for image retrieval and autoannotation. Technical Report RT-0415, INRIA, 2011.
• M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid. Face recognition from caption-based supervision.
Technical Report RT-392, INRIA, 2010.
• D. Larlus, J. Verbeek, and F. Jurie. Category level object segmentation by combining bag-of-words models and
Markov random fields. Technical Report RR-6668, INRIA, 2008.
• J. Verbeek, and N. Vlassis. Semi-supervised learning with Gaussian fields. Technical Report IAS-UVA-05-01,
University of Amsterdam, 2005.
• J. Verbeek. Rodent behavior annotation from video. Technical Report IAS-UVA-05-02, University of Amsterdam, 2005.
• J. Verbeek, and N. Vlassis. Gaussian mixture learning from noisy data. Technical Report IAS-UVA-04-01,
University of Amsterdam, 2004.
• J. Verbeek, N. Vlassis, and B. Kröse. The generative self-organizing map: a probabilistic generalization of Kohonen’s SOM. Technical Report IAS-UVA-02-03, University of Amsterdam, 2002.
• J. Verbeek, N. Vlassis, and B. Kröse. Procrustes analysis to coordinate mixtures of probabilistic principal component analyzers. Technical Report IAS-UVA-02-01, University of Amsterdam, 2002.
• A. Likas, N. Vlassis, and J. Verbeek. The global k-means clustering algorithm. Technical Report IAS-UVA-0102, University of Amsterdam, 2001.
• J. Verbeek, N. Vlassis, and B. Kröse. Efficient greedy learning of Gaussian mixtures. Technical Report IASUVA-01-10, University of Amsterdam, 2001.
• J. Verbeek, N. Vlassis, and B. Kröse. A k-segments algorithm for finding principal curves. Technical Report
IAS-UVA-00-11, University of Amsterdam, 2000.
Download