Uploaded by eliaszeng

Image Annotation Then and Now (Image and Vision Computing) (2018)

advertisement
Accepted Manuscript
Image Annotation: Then and Now
P K Bhagat, Prakash Choudhary
PII:
DOI:
Reference:
S0262-8856(18)30162-8
doi:10.1016/j.imavis.2018.09.017
IMAVIS 3726
To appear in:
Image and Vision Computing
Received date:
Accepted date:
6 August 2018
24 September 2018
Please cite this article as: P K Bhagat, Prakash Choudhary , Image Annotation: Then and
Now. Imavis (2018), doi:10.1016/j.imavis.2018.09.017
This is a PDF file of an unedited manuscript that has been accepted for publication. As
a service to our customers we are providing this early version of the manuscript. The
manuscript will undergo copyediting, typesetting, and review of the resulting proof before
it is published in its final form. Please note that during the production process errors may
be discovered which could affect the content, and all legal disclaimers that apply to the
journal pertain.
ACCEPTED MANUSCRIPT
Image Annotation: Then and Now
P K Bhagata,∗, Prakash Choudharya
Institute of Technology Manipur, Imphal, India-795001
IP
T
a National
CR
Abstract
Automatic image annotation (AIA) plays a vital role in dealing with the exponentially growing digital images. Image annotation helps in effective retrieval,
US
organization, classification, auto-illustration, etc. of the image. It started in
early 1990. However, in the last three decades, there has been extensive re-
AN
search in AIA, and various new approaches have been advanced. In this article,
we review more than 200 references related to image annotation proposed in
the last three decades. This paper is an attempt to discuss predominant ap-
M
proaches, its constraints and ways to deal. Each segment of the article exhibits
ED
a discourse to expound the finding and future research directions and their hurdles. This paper also presents performance evaluation measures with relevant
and influential image annotation database.
PT
Keywords: Image annoation, automatic image annotion, multi-label
classification, image labeling, image tagging, annotation dataset, annotaton
CE
performance evaluation, image features, image retrieval.
AC
1. Introduction
Human has better capability to organize pictures as we stores and recall im-
ages by the objects present in the image. Nowadays the number of digital images
are proliferating, in fact, faster than the expectation. Therefore a coherent organization of digital images benefits in the dynamic retrieval. The unorganized
∗ Corresponding
author
Email addresses: pkbhagat@nitmanipur.ac.in (P K Bhagat),
choudharyprakash@nitmanipur.ac.in (Prakash Choudhary)
Preprint submitted to Journal of LATEX Templates
September 29, 2018
CR
IP
T
ACCEPTED MANUSCRIPT
US
Figure 1: Image and corresponding labels from the ESP Game database [1].
collection of an extensive image database will lead to very inefficient search and
AN
may also give irrelevant images in retrieval. For example, searching of an images having a dog in an unorganized collection of an extensive image database
becomes inefficient - it takes long response time and due to imprecise and su-
M
perficial annotation may get irrelevant images with an animal such as cat, fox,
ED
wolf, etc. Image annotation is a process of labeling the image with keywords,
represents the contents of the image which helps in the intelligent retrieval of
relevant images through simple query representation. An example of image with
PT
labels is shown in figure 1. The assignment of keywords can be performed man-
CE
ually or automatically. The later is called automatic image annotation (AIA).
The retrieval of the images performed in two ways: the content-based image retrieval and the text-based image retrieval. In the content-based image
AC
retrieval (CBIR) images are organized using visual features, and during the retrieval process, the visual similarity index is a key between the query image and
database of images. The visual similarity refers the similarities based on visual
features like color, texture, shape, etc. The semantic gap is one of the most
critical concerns associated with CBIR. A gap between low-level contents and
high-level semantic concepts is known as semantic gap [2]. The semantic gap
arises due to CBIR reliance on the visual similarity for the retrieval of similar
2
ACCEPTED MANUSCRIPT
images. On the other hand, the text-based image retrieval (TBIR) technique
implies metadata associated with the image of its retrieval. In TBIR, images
organized according to its contents, and these contents are associated with the
image in the form of some metadata. At the time of retrieval, these metadata are
T
used to fetch the related images. In TBIR, images are stored using its semantic
IP
concepts and text is used as a query to retrieve the images. The manual assign-
CR
ment of keywords to the image is a time-consuming process and very expensive.
Thus, researchers tried to assign keywords to the images automatically.
The AIA techniques attempt to learn a model from the training data and
US
use trained model to assign semantic labels to the new image automatically
[3]. AIA assigns one or multiple labels to the image either based on visual fea-
AN
tures or exploiting image metadata. Image retrieval can be performed based on
many techniques like classification and clustering, similarity measures, search
paradigms, visual similarity, etc. [4]. Image annotation can be considered as a
M
submodule of image retrieval system. The image retrieval which can be implemented either using CBIR or TBIR has different working principles. The CBIR
ED
analyses the visual contents (color, texture, shape, etc.) of images without considering the image metadata (labels, surrounding texts, etc.) for its retrieval.
PT
The query in the CBIR can only be the image itself or its visual features. Moreover, TBIR exploits the image metadata for its retrieval. Semantic concepts are
CE
used to represent the image in the database and query can be represented in
the form of text or image itself. When the image itself is presented as a query,
it is interpreted in the semantic form (textual form).
AC
The AIA can be performed using visual features [5, 6, 7, 8, 9] or visual
features combined with textual features [10, 11, 12, 13, 14] or using metadata
associated with images [15, 16, 17, 18]. The visual features, which are based on
the pixel intensity values and their spatial relations and extracted directly from
the image, include color, texture, shape, etc. The textual features are derived
from the image metadata like keywords, filename, URL, surrounding texts, GPS
info, etc. Nevertheless, while annotation may be achieved in what so ever ways,
the basic idea is to learn a model that annotates new images so that images
3
ACCEPTED MANUSCRIPT
can be efficiently organized for a simplified retrieval process. Recently, focus
is to learn the model from weakly supervise [19, 20, 21, 22, 23, 24, 25, 26, 27]
or unsupervised training data [28, 13, 15, 17, 16] due to the fact that image
database is growing at very large scale and it is difficult to prepare fully labeled
T
training data.
IP
At present, there exist various survey papers on AIA [29, 3, 30, 31, 32]. In
CR
[32], the authors present a comprehensive review on AIA where the primary
focus is on the efficiency of social tagging. Authors discuss in depth about
the issues of social image tagging, the refinement of noisy and missing social
US
tags and present a comparative study of various annotation methods under a
common experimental protocol to check the efficiency of various methods for tag
AN
assignment, refinement, and retrieval. Although the paper covers a wide range
of AIA methods, various important aspects of AIA (semi-supervised learning,
unsupervised learning, deep learning, etc.) are not covered in detail.
M
A recent paper [31] presents a much generalized categorization of AIA methods. The paper focuses towards the presentation and explanation of core AIA
ED
techniques. As the implementation strategy varies from articles to articles, one
also need to pay attention towards how an appropriate technique is implemented
PT
in relevant research work. Also, AIA is a vast field; any attempt to explain these
methods require a well-defined categorization to cover all the aspects of the AIA
CE
methods. In contrast to [31], we have adopted a different basis for the classification of AIA methods which is more specialized and informative. Notably, we will
pay much more attention towards how efficiently a research work implemented
AC
AIA methods to achieve a useful result. In this survey paper, we will try to
cover every aspect of AIA while explaining how a particular branch of AIA has
expanded and the scope of the further improvement. This paper is an attempt
to discuss practical approaches, its constraints and ways to deal. Each segment
of the article exhibits a discourse to expound the finding and future research
directions and their hurdles. This paper also presents performance evaluation
measures and relevant and predominant image annotation database.
In this paper, we comprehensively reviewed more than 200 papers related
4
AN
US
CR
IP
T
ACCEPTED MANUSCRIPT
M
Figure 2: A broad categorization of image annotation methods.
ED
to image annotation, its baseline and presented the current progress and future
trends. The course of the categorization of image annotation methods presented
PT
in this survey paper is as shown in figure 2. The rest of the article is organized
as: Section 2 gives an overview of the history of image annotation methods. Section 3 describes some of the dominant visual features and extraction methods.
CE
Section 4 presents detailed descriptions of image annotation methods. Section 5
describes the most popularly used evaluation method for the verification of the
AC
performance of the annotation system. Section 6 presents some datasets mostly
used for training and evaluation of annotation model followed by conclusion
in section 7. Each section concluded with a discussion where we describe the
challenges faced, limitation and future scope of the methods of that section.
Although there are slight differences between method and technique, we will
ignore them and will use method and technique interchangeably.
5
ACCEPTED MANUSCRIPT
2. History of Image Annotation: An Overview
2.1. First Decade
The image annotation has been there from 1990. The year 1990 to 2000 can
T
be considered as very early years of image annotation. Especially, after 1996
IP
there has been a very sharp increase in the number of papers published about
image annotation. The comprehensive review of these years is summarized in
CR
[2]. Most of the proposed methods in the first decade used manually extracted
features followed by some classifiers to annotate an image. The main concerns
US
of all the methods are to fill the semantic gap. Semantic gap is the gap between
how the user and machine perceive an image. As the user interprets an image
in the high-level form, while the computer interprets the same image in low-
AN
level, this gap of interpretation is known as the semantic gap. In the early
years, low-level features like (texture, color, shape, etc.) are extracted using
M
handmade feature extraction techniques and there is no relationship between
these low-level features and textual features. Therefore, filling the semantic
ED
gap is a problem. All the proposed methods are supervised learning, and the
retrieval is based on CBIR. The paper [2] reviewed the work carried out in the
early era of annotation and retrieval and set the goal for the next age. Most of
PT
the articles in the later decade followed the direction provided in [2].
CE
2.2. Second Decade
In the early years, the problem of the semantic gap is realized, and in the
AC
medieval era, researchers explored various methods to deal with the semantic
gap. From 2000 to 2008, there has been extensive research to fill the semantic
gap problem. A thorough study of these techniques structured in [4]. Even
though in most of the papers, manually extracted features are used, the results
produced are far better and efficiently dealt with semantic gap [33] and seen
many states of the art techniques and baseline produced during this era [9, 34,
8]. The research area of these years focused mainly in the field of finding the
correlation between visual and textual features. We will explore these methods
6
ACCEPTED MANUSCRIPT
in the following sections. We called time span 2002 − 2008 as the medieval era
of the research and development on image annotation and retrieval techniques.
In a medieval period, machine learning techniques used extensively. Therefore,
this medieval era was dedicated to the use of machine learning techniques for
T
image annotation and its retrieval [35, 36, 37, 19, 38, 33]. Later, [4] reviewed the
IP
some of the most promising related work carried out during 2000 to 2008 and
CR
set the future direction in the image annotation and retrieval area. In the early
year and medieval era, almost all the image annotation methods were based on
supervised learning. In supervised learning, the training dataset provided with
US
the complete set of manually annotated labels. Most of the proposed methods
in this era followed the CBIR technique based image retrieval.
AN
2.3. Third Decade
After 2010, deep learning techniques are extensively being used for the an-
M
notation and retrieval process [11, 39, 40, 27]. The use of convolution neural
network (CNN) based features [39, 11] and features extracted by CNN based
ED
pre-trained AlexNet and VGGNet network [40, 27] are being utilized for AIA.
Recently, a restricted Boltzmann machine (RBM) based model is designed for
PT
the automatic annotation of geographical images [13]. The limitation of CBIR
led the research focus toward TBIR. In TBIR, semantic keywords are used for
the retrieval of the images [15, 41, 42]. TBIR requires the image to be anno-
CE
tated with semantic keywords. Also, due to the limitation of supervised learning
which requires a large number of labeled training data, researchers are explor-
AC
ing semi-supervised and unsupervised based learning methods. After 2010, researchers shifted their focus towards semi-supervised based image annotation
where a model is trained using incomplete labels in training data for large-scale
image annotation [43, 26, 24]. In semi-supervised learning (SSL) based largescale image annotation methods, the real challenge is to deal with the massive,
noisy dataset in which the number of images is increasing more rapidly than
ever thought. Recently, researchers have started exploring unsupervised image
annotation techniques [15, 44, 45, 46] where training dataset is not labeled at
7
ACCEPTED MANUSCRIPT
all, and only metadata (URL, surrounding texts, filename, etc.) are provided
with a training dataset. The unsupervised learning based annotation methods
are in its early stage of evolution.
Discussion: The semantic gap problem realized in the early era was mini-
T
mized or almost solved in the medieval era in a supervised learning framework.
IP
Over the years, a shift from handmade feature extraction techniques to the cor-
CR
relation between textural and visual features to deep learning based features
can be realized. Also, we have shifted our focus from CBIR to TBIR. Then,
the problem of large-scale database arisen which continues to grow. Currently,
US
researchers are trying to explore the possibility of SSL to deal with increasing
size of the dataset. Recent advancement in deep learning and capabilities of the
AN
high-performance computer system has given us the opportunity to think differently and do away with handmade feature extraction techniques. The image
annotation has also been applied for medical images [47, 48, 49, 50] and satellite
M
images [51, 13] and provided excellent results using the current state of the art
presented in table 1.
ED
methods. A highlight of image annotation methods in the last three decades is
PT
3. Image Processing Features
The contents of the image correspond to the objects present in the image.
CE
The objects in the images are presented at different levels of abstraction and
can symbolically be represented by moments, shapes, contours, etc., described
AC
by location and relative relationships with other objects. The tags predicted
by annotation model are associated with the image for semantic retrieval of the
image. The main idea behind AIA is to use semantic learning with high-level
semantic concepts [30]. Localization of the objects is one of the essential aspects
for describing the contents of the image [52, 12, 22, 7]. Although, there have
been various instances where the annotation is performed efficiently without
localization of object [53, 5, 54, 38, 14]. However, either the object is segmented
or not, low-level features like texture, color, shape, corners, etc. are extracted
8
ACCEPTED MANUSCRIPT
Table 1: Some of the highlights of image annotation methods in last three decades.
Medieval Era
Last Decade
handmade features
correlation of visual
correlation of visual
and textual features
and textual features,
extraction
T
Features
Early Era
supervised learning
Retrieval
semi-supervised learning∗ ,
CBIR
TBIR
semantic gap
image based query
learning from noisy and
Most of the methods followed this path only.
few of the methods started following this path.
M
+
unorganized texts
AN
∗
unsupervised leaning+
CBIR
US
Problem faced
supervised learning
CR
Learning
IP
deep learning features
∗
to describe the contents of the image. The challenge is to model this low-level
ED
features to high-level semantic concepts.
The content of the image can be described at the various levels each with
PT
its worthiness. Image annotation in whole doesn’t rely on the visual contents of
the image. It may process some metadata associated with the image and find
some association between visual contents and metadata to annotate an image.
CE
However, keeping in mind only visual contents, it can be represented at different
levels vis textual level, color level, object level and salient point level.
AC
Texture Features: Texture plays an important role in the human visual perception system. In the absence of any single definition of texture, surface characteristics and appearance of an object can be considered as the texture of that
object [55]. While describing the texture, the computer vision researchers are
describing it as everything of an image after extracting the color components
and local shapes [2]. The various methods for extraction of texture features
include histogram based first order texture features [56], gray tone spatial dependence matrix [57] and its variants [58, 59], local binary pattern (LBP) [60]
9
ACCEPTED MANUSCRIPT
and its variant [61], local directional pattern [62] and its variant [63], wavelet
features [64], etc. According to [65], if we get the same histogram twice, then
the same class of spatial stimulus has been encountered. For a 512 × 512 image, there are 26 × 106 possible permutations of pixels so, the probability of
T
getting the same histogram twice for different texture is almost impossible [65].
IP
The texture is an important property of the image. However, it alone can’t
CR
describe the image in its entirety. Although, it has been asserted that texture
pair agreeing in their second-order statistics can’t be discriminated [66]. Even
then, texture features alone is not sufficient to annotate any image.
US
Color Features: Color-based features of the image are one of the most powerful representations of the contents of the image. An image is composed of
AN
RGB colors, hence, extracting RGB based features have been a very active area
of image content representation. Opponent color representation of RGB has
also been reported in the literature [67]. The color histogram has been used
M
by several authors to represent the image contents [5, 10, 68, 6]. An excellent
review of color image processing can be seen in [2]. Out of the various visual
ED
features of an image, color is the most straightforward feature. Color plays an
indispensable role for human in the recognition of objects in an image as hu-
PT
man eyes are sensitive to colors. Color features are invariant to the rotation
and transformation which makes it one of the most dominant visual feature. If
CE
normalization is used then color features are also invariant to scaling. A comparative analysis of color features can be seen in [69]. A small variation in color
value is not noticeable by the human eye as compared to the variation in gray
AC
level values. It helps in designing more compact color-based features. The color
space can be represented using RGB, HSL, HSI, HSV, etc. and are invariant
to rotation and translation. No matter what color space representation is used,
color information can be represented in 3-D space (one for each color) [70]. An
MPEG7 based scalable color descriptor, represented in HSV color space and
encoded by a Haar transform is presented in [71].
Object Level Features: Recognition of objects present in the image is one
of the most important aspects of image annotation. The best way to recognize
10
ACCEPTED MANUSCRIPT
objects is to segment the objects present in the image and then extract features
of those segmented regions. Recognition of the objects helps in the annotation
of image and region label matching as well. However, segmentation of objects
is itself a complex task. Moreover, an unsupervised segmentation algorithm is
T
a brittle task. Semantic segmentation requires each pixel to be labeled with an
IP
object class [22], which can be implemented in a supervised manner [72, 73] or
CR
weakly supervised method [74, 75] or in the unsupervised manner [76]. We have
also seen a deep neural network is being extensively used for the segmentation
of objects [77, 78]. The accuracy of segmentation algorithm plays a crucial role
US
in the semantic annotation of the image. Various image annotation methods
have been reported which uses segmentation techniques for the recognition of
AN
objects [22, 79, 52, 80, 12]. Robust segmentation is essential to annotate image
semantically, which is hard to achieve. When one segmented region contains all
the pixels of one object, and no pixels of other object is considered as robust
M
segmentation. Therefore, researchers tried to implement weak segmentation.
Alternatively, segmentation can be omitted, and other types of features can
ED
be used for image annotation [81, 15, 82, 24, 7]. Although accurate, robust
semantic segmentation is difficult to achieve, segmented regions are useful and
PT
powerful features for the annotation.
Salient Image Point Features: Salient attributes present in the image, usu-
CE
ally identified by color, texture, or local shapes, are used to produce salient
features. Although, color and texture features are commonly used to represent
the contents of the image. But, salient points deliver more discriminative fea-
AC
tures. In the absence of object level segmentation, salient points act as weak
segmentation and play an eternal role in the representation of an image. Salient
points may be present at the different locations in the image and need not be
corners, i.e., it can be smooth lines as well. Ref [83] compared the salient points
extracted using wavelet with corner detection algorithm. A salient point detection and scale estimation method are proposed using the local minima based on
fractional Brownian [84]. Although salient points are the substitution for the
segmentation (as segmentation is a brittle task), if salient points are used along
11
ACCEPTED MANUSCRIPT
with segmentation it gives much more discriminative features. Salient points
are used by various authors for image annotation [52].
Recently, the scale invariant feature transform (SIFT) [85] based features
have become much more popular. SIFT is scale and rotation invariant local
T
feature descriptor based on the edge oriented histogram which extracts vital
IP
points (interest points) and its descriptors. A similar method which computes
CR
a histogram of the direction of gradients in a localized portion of an image
called histogram of oriented gradients (HOG) [86] is a robust feature descriptor.
Later, a speed-up version of SIFT called speeded-up robust features (SURF)
US
introduced in [87]. For real-time applications, a machine learning based corner
detection algorithm called features from accelerated segment test (FAST) intro-
AN
duced in [88]. SIFT and SURF descriptors are usually converted into binary
strings to speed-up the matching process. However, binary robust independent
elementary features (BRIEF) [89] provides a shortcut for finding binary strings
M
directly without computing the descriptors. It is worth mentioning that BRIEF
is a feature descriptor, not feature detector. SIFT uses a 128-dimensional de-
ED
scriptor, and SURF has a 64-dimensional descriptor, i.e., a high computational
complexity. An alternative of SIFT and SURF called oriented FAST and rotated
PT
BRIEF (ORB) is proposed in [90]. SIFT and SURF are patented algorithms
however ORB is not patented and has a comparable performance with low com-
CE
putational complexity compared to SIFT and SURF.
Discussion: Feature extraction is one of the essential components of the
image annotation system. Visual features have a significant role in the identi-
AC
fication and recognition of objects and image content representation. Different
types of features (texture, color, SIFT, etc.) have different characteristics. The
local features (SIFT, SURF, shape, etc.) describe image patches whereas global
features represent an image as a whole. So, local features are a specification of
an image and used for object recognition, global features are a generalization
of an image and suitable for object detection. The recent feature extraction
methods (SIFT, SURF, HOG, etc.) are compelling object representation methods and have become standard feature representation methods. However, global
12
ACCEPTED MANUSCRIPT
(b)
(c)
(e)
(d)
CR
IP
T
(a)
US
Figure 3: Image feature representation at different levels of abstraction. (a) original image
taken from [91]. (b) manually segmented image for object level feature representation. (c)
100 SIFT keypoints, its size, and orientation. (d) texture representation of (a) using LBP. (e)
AN
BGR color histogram of (a).
features like texture, color, etc. have their significance and will continue to be
M
in practice. An example of features at different levels shown in figure 3.
ED
4. Classification of Image Annotation Techniques
Human has incredible visual interpretation system. The human visual sys-
PT
tem interprets the image incredibly while making the association between subject and objects present in the image [22]. The human being can effortlessly
CE
describe the objects and their associated attributes present in the image. Researchers of the computer vision have tried endlessly since last few decades to
AC
make computer system capable of imitating this ability of human [22]. AIA is
a step forward in this direction where the aim is to detect each object present
in the image and assign the corresponding tags to describe the image contents.
We have seen intensive studies in last two decades (although there have been
few instances before as well) and many techniques have been proposed for image
annotation.
The proposed image annotation approaches can be categorized in many ways.
We have classified the annotation approaches according to Figure 2. We have
13
(a)
(b)
IP
(c)
T
ACCEPTED MANUSCRIPT
Figure 4: Example of visual correlation. A strong correlation is represented by a solid arrow,
CR
and a weak correlation is represented by a dotted arrow. (a) an instance airplane of the image
can be either flying or parked. (b) if the airplane is flying, then it has a strong correlation
with objects sky and clouds and weak correlation with ground and grass. (c) if the airplane
is parked, then it will have a strong correlation with instances ground and grass and a weak
US
correlation with sky and clouds instance.
figure 2 in the following section.
AN
followed the structure of this flow and presented the details of each subpart of
M
4.1. Model-Based Annotation Approach
Model-based approach trains an annotation model from the training set fea-
ED
tures (either visual, textual or both) to annotate the unknown images. The
model-based approach can be broadly categorized in the generative model, dis-
PT
criminative model, graph-based model and nearest neighbor based models. All
of these models and the prominent methods used by these models are presented
in figure 5. The primary objective of the model-based approach is to learn an
CE
annotation model from the training data through visual and textual correlation so that the learned model can accurately annotate the unknown image.
AC
The visual and textual correlation is a relationship established between visual
features and textual metadata (labels, surrounding texts, etc.) between a pair
of instances or multiple instances. A relationship among instances is shown in
figure 4. Multiple instances in an image may not be independent of each other
[92]. There are some dependencies between instances. A strong correlation
shows high interdependence among instances while a weak correlation points
low dependence among images.
14
ACCEPTED MANUSCRIPT
Model Based Approach
SVM
Topical model
Graph based
Model
NN based
Model
KNN [9]
Random walk [112, 110]
Multi-label KNN [105]
RWR [110]
PLSA [28]
Multiclass SVM [35, 54]
LDA [33]
MKL SVM [14]
cLDA [93]
Tr-mmLDA [94]
HR-SVM [100]
OVA-SVM [101]
TagProp [106]
SMMTC [95]
OVA-SSVM [102]
Distance metirc learning
Mixture model
Weighted KNN [106]
Adaptive RW [111]
2PKNN [107, 11]
Bi-partite graph [17]
KML [109]
Laplacian SVM [103]
CR
RKML [108]
Two layer MKL [81]
FMM [52]
EM [96]
Multi-layer MKL [81]
Adaptive EM [52]
Linear logistic regression [21]
GMM [37]
Hypergraph [116]
Diffusion graph [113]
Graph learning [114]
Sparse graph
reconstruction [115]
US
Kernel logistic regression [104]
Relevance model
T
Discriminative
Model
IP
Generative
Model
Multiple Bernoulli
relevance model [97]
CMRM [98, 12]
Dual CMRM [99]
AN
Extended CMRM [12]
4.1.1. Generative Model
M
Figure 5: Categorization of the model based annotation approach.
ED
The generative model aims at learning a joint distribution over visual and
contextual features so that the learned model can predict the conditional prob-
PT
ability of tags given the image features [117]. The generative model to work
correctly, the model has to capture dependency between visual features and
associated labels accurately. The generative model based image annotation is
CE
presented in [28, 33, 80, 51, 95, 93, 37, 96, 52, 99, 12]. Generative models are
usually based on topical models [95, 93], mixture models [96, 80] and relevance
AC
models [97, 99]. Several works based on topical models have been reported in
the literature [28, 33, 94, 51, 95, 93]. A topic model, which uses contextual clues
to connect words with similar meaning, can find meaning from the large volume
of unlabeled texts. A topical model is a powerful unsupervised tool for analyzing texts documents. A Topical model can predict future texts and consists of
topics which are distributed over words and assumes that each document can be
described as a mixture of these topics [118]. In one of the early use of the topical model for image annotation, probabilistic latent semantic analysis (PLSA)
15
ACCEPTED MANUSCRIPT
is used to annotate Corel images [28]. The authors used both textual (distribution of textual labels) and visual (global and regional) features to annotate
any unknown image [28, 12]. The work carried out in [33] uses Latent Dirichlet
Allocation (LDA) to reduce the noise present in the image for application in
T
web image retrieval. Furthermore, the work in [93] extended the LDA to corre-
IP
spondence LDA (cLDA) and later in [94], LDA is extended to topical regression
CR
multi-modal Latent Dirichlet Allocation (tr-mmLDA) to capture statistical association between image and text. LDA is also used to reduce the semantic
gap in the annotation of satellite images [51]. In [95], the authors presented an
US
extension of sparse topical coding (STC) which is the nonprobabilistic formulation of probability topical model (PTM). This extended method is called sparse
AN
multi-modal topical coding (SMMTC) [95].
The mixture model is based on the parametric model, where the basic idea
is to learn the missing model parameters based on expectation maximization
M
methods [119]. If the image features and tags are given, then the mixture model
computes joint distribution, and for an unknown image, the conditional proba-
ED
bility of tags is obtained based on the visual features of the image. The work
in [80, 52, 96, 37] uses a mixture model to annotate images. The work carried
PT
out in [52] is based on finite mixture model (FMM) and proposes an adaptive expectation maximization algorithm. The proposed adaptive expectation
CE
maximization algorithm is used for the selection of optimal model and model parameters estimation. The authors proposed multi-level annotation where FMM
is used at the concept level, and the content level, the segmented regions of the
AC
image is classified using support vector machine (SVM) with an optimal model
parameter search scheme. Later, [96] proposed a probabilistic semantic model
having hidden layers and used expectation maximization (EM) based model to
determine the probabilities of visual features and words in a concept class. In
[37], the authors used a Gaussian mixture model (GMM) for feature extraction
and proposed the sparse coding framework for image annotation. The proposed
GMM is inspired by a discrete cosine transform (DCT) features and used the
subspace learning algorithm to utilize the multi-label information for feature
16
ACCEPTED MANUSCRIPT
extraction [37] efficiently.
The relevance model calculates the joint probability distribution over the
tags and features vector [98]. The idea is to find the relevance of visual and
textual features to learn a non-parametric model for image annotation. A non-
T
parametric annotation model uses a vocabulary of blobs to describe an image
IP
where each image is generated using a certain number of blobs. The relevance
CR
models can integrate features from infinite dimensions and treat visual and textual information as different features to find relationships among them. Moreover, learning the co-occurrence among global, regional and contextual features
US
helps in tagging an image as a whole entity with semantic meaning. Crossmedia relevance models (CMRM) based image annotation have been proposed
AN
in [98, 42, 99, 12]. Firstly, Jeon et al. in [98] proposed CMRM which learns
joint distribution over blobs, where blobs are clustered regions which form a set
of vocabulary. The proposed model can be used to rank the images as well as
M
to generate a fixed length annotation from the ranked images. In [97], multiple
Bernoulli relevance model is proposed which used the visual features to find the
ED
correspondence between the features vector and tags. Later in [99], the authors
extended CMRM to analyze word to word relation as well. This CMRM is
PT
called dual CMRM. The authors integrated word relation, image retrieval, and
web search techniques together to solve the annotation problem [99]. In a sim-
CE
ilar work, [12], the idea of dual CMRM is used and proposed extended CMRM
which can integrate image patches as a bag of visual words representation and
AC
textual representation together.
4.1.2. Discriminative Model
The discriminative models consider each tag as an independent class. Thus,
for the image annotation, a classifier based approach is followed. Here, a separate classifier is trained for each tag using the visual features of the image. Later,
for the test image, the trained classifier predicts particular tags for that image.
The critical issues are the subset selection of visual features and multiple labels
for an image [117]. The high-dimensionality of various visual features raises the
17
ACCEPTED MANUSCRIPT
concerns of its organization and selection of potent discriminative subset from
this heterogeneous features [117]. Different methods have been proposed in the
literature for efficient feature selection [120, 121, 26].
As images are labeled with multiple tags, discriminative models can also be
T
seen as a multi-label classifier. The image annotation framework based on dis-
IP
criminative model are presented in [35, 122, 50, 123, 54, 47, 124, 125, 14, 41, 126].
CR
The discriminative model requires a large number of training data with accurate
and complete annotated keywords [101]. Generating a fully manually annotated
training dataset is an expensive and very time-consuming process. Recent re-
US
search focuses towards training the classifier in a semi-supervised manner to
alleviate this problem [101, 21, 100, 104, 81].
AN
Most of the discriminative models are based on support vector machine
(SVM) or its variants [35, 122, 14, 124, 127, 50, 47]. The decision trees (DT)
are also used for image annotation [126, 41]. The use of multiclass SVM (where
M
several classifiers are trained and their results are combined) is reported in
[35, 54] to classify the images in one of the predefined classes. Later in [122], for
ED
multiclass annotation, the binary SVM (support vector classification) is used
for semantic prediction, and one class SVM (support vector regression) is used
PT
for the prediction of confidence factor of the predicted semantic tags. Multiple
kernels learning SVM [14], where some different kernels (color histogram kernel,
CE
wavelet filter bank kernel, interest point matching kernel), are used to find a
certain type of visual properties of the image and to approximate the underlying visual similarity relationships between images more precisely. In [123], the
AC
authors proposed a kernel based classifier that can bypass the annotation task
for text-based retrieval. The discriminative models are used extensively for the
medical image annotation [50, 47] where SVM is used as a classifier.
The rule-based approach using decision tree (DT) and rule induction is reported in [126] for the automatic annotation of web images. Later in [41], the
decision tree is enhanced to have both classification and regression to store
semantic keywords and their corresponding ranks. In [54], annotation is performed using the idea of clustering based on visually similar and semantically
18
ACCEPTED MANUSCRIPT
related images. Artificial neural network (ANN) based discriminative model
[125, 128, 129] has also been applied for image annotation. A five-layer convolutional neural network with five hundred thousand neurons are used over
the ImageNet 2010 dataset to classify images in 1000 predefined classes [129].
T
A fusion method for multiple pretrained deep learning based model and its
IP
parameter tuning is presented in [128].
CR
In the last decade, focus of researchers has shifted towards semi-supervised
training due to the unavailability of a sufficient number of the fully annotated
dataset and its associated cost. To achieve the SSL, the Hessian regularization
US
(HR) is used with SVM in [100]. HR can exploit intrinsic local geometry of data
distribution, where the intrinsic structure of two data sets may roughly follow
AN
the similar structure from outside but have quite different local structure, by fitting the data within the training domain and predicting data point beyond the
boundary of training data [130]. In [102], the authors extended one versus all
M
SVM (OVA-SVM) to one versus all SVM with structured output learning capability (OVA-SSVM) which is further extended by [101] to adapt the multi-label
ED
situation for image annotation. Although Laplacian regularization (LR) can be
used with SVM [103], the advantage of Hessian regularization (HR) is that it
PT
is more suitable for SSL framework. To deal with missing labels, a smoothing
function which is a similarity measured between images and between classes is
CE
proposed, and linear logistic regression has re-weighted least square methods
is used for the classification and label assignment [21]. Inspired from manifold
regularization [103], annotation framework applicable to the multiclass problem
AC
using manifold regularized kernel logistic regression (KLR) with a smooth loss
function is proposed in [104]. Here, for SSL, Laplacian regularization (LR) used
which exploits the intrinsic structure of the data distribution [104]. Later in
[81], the authors proposed the extension of two-layer multiple kernel learning
(MKL) for SSL. MKL learns the best linear combination of elementary kernels
that fits given distribution of data. The proposed methods extend standard
(two layers) MKL to multi-layer MKL (deep kernel learning) and the learned
kernel is used with SVM (as well as its Laplacian variants) [81].
19
ACCEPTED MANUSCRIPT
4.1.3. Nearest Neighbor based Model
Nearest neighbor (NN) based models primarily focus on selecting the similar neighbors and then propagating the labels to the test image. The similar
neighbors can be defined by the image to image similarity (visual similarity) or
T
image to label similarity or both. A distance matric is used for selecting similar
IP
neighbors. The efficiency of a distance metric plays a vital role in choosing the
CR
relevant and appropriate neighbors, hence on the overall performance of the NN
method. NN is a non-parametric classifier that can directly work on the data
without any learning parameters [131].
US
Various image annotation works using NN based models are presented in
[9, 106, 132, 133, 134, 108, 135, 136, 137, 11, 10, 40]. Image annotation by
AN
finding the neighborhood of test image based on color and texture features
is presented in [9] where authors proposed a classical label transfer scheme
for query image. The nearest neighbor of the query image is computed by
M
combining the basic distance metrics and then by ranking the keywords based
on its frequency, a fixed number of keywords (n) are assigned to the test image
ED
[9]. In 2009, a novel model for AIA based on NN model called tag propagation
(TagProp) was designed [106]. The authors proposed tag propagation based on
PT
the weighted NN model where the weights of the neighbor are assigned based on
its ranking or distance [106]. A comparative analysis of the variants of TagProp
CE
and its effectiveness is shown in [133]. Learning the distance matrix is key to
the NN models. Various methods for learning the distance have been proposed
in the literature [132, 138, 134, 107]. In [132, 138], learning a Mahalanobis
AC
distance matrix is proposed to design the large margin nearest neighbor (LMNN)
classifier. Later, the LMNN is extended for multi-label annotation in [107].
Later in [134], a label specific distance metric learning method is proposed to
distinguish the multiple labels of an image and neighbors are fetched based
on the distance between each label. This approach helps in reducing the false
positive and negative labels.
The main problem with NN model is that it requires entirely manually an-
20
ACCEPTED MANUSCRIPT
notated training set. Also, each label should have a sufficient number of images
along with an almost same number of images per label. A very novel algorithm
called two pass KNN (2PKNN) is proposed [107, 11] to alleviate the problem
of class imbalance and weak labeling. The 2PKNN uses the two types of sim-
T
ilarity in two passes. In the first pass, image to label similarity is used, and
IP
image to image similarity is used in the second pass [107]. A combined fea-
CR
ture set of TagProp [106], convolution neural network (CNN), fisher vector, and
vector of locally aggregated descriptors (VLAD), and its combination created
using canonical correlation analysis (CCA) and kernel CCA [109] is used with
US
2PKNN for image annotation [11]. A KCCA [109] based image annotation is
presented in [137] where KCCA finds the correlation between visual and tex-
AN
tural features and then NN based model is used for annotation. A comparison
between the pre-trained CNN network (AlexNet and VGG16) and 15 different
manual features for the 2PKNN model is presented in [40]. In [108], an exten-
M
sion of kernel metric learning (KML) [109] called robust kernel metric learning
(RKML), which is a distance calculation technique based on regression, is used
ED
to find the visually similar neighbors of an image and then majority based
ranking is used to propagate the labels to a query image. A flexible number of
PT
neighborhood selection strategy proposed in [136] where selected neighbors are
within a predefined range (rather than the fixed number of neighbors) and are
CE
tag dependent. The proposed methods give the flexibility to choose only relevant neighbors. A multi-label KNN introduced in [105], and utilized in [10] for
multi-label annotation, combines various features using feature fusion technique
AC
and then used multi-label KNN to annotate images automatically.
4.1.4. Graph based Model
The basic idea behind the graph-based model is to design a graph from the
visual and textual features in such a way that the correlation between visual and
textual features can be represented in the form of vertices and edges and their
dependency can be explained. The data points (visual features of images) and
the labels can be represented as separate subgraphs, and edges represent the
21
ACCEPTED MANUSCRIPT
Label Subgraph
L3
L1
L4
Strong correlation
IP
Weak correlation
Vertex representing
a single label i
X5
X1
Xi
X4
X3
AN
Data Subgraph
Vertex representing
a single data instance i
US
X2
CR
Li
X6
T
L5
L2
Figure 6: Example of graphical model. Label subgraph and data subgraph represent label to
label correlation and visual correlation respectively. The correlation between labels and data
M
instances are represented by the edges connecting label and data subgraph.
ED
correlation among subgraphs [139]. The semantical correlation between labels
can be represented using interconnected nodes which helps in multi-label image
annotation. The graphical model can also be used to find the correlation among
PT
labels. In such a case, vertices represent labels and edges represent correlation
among labels. Figure 6 shows an example of correlation representation among
CE
visual and textual features using a graphical model.
The graph-based model can be used with both supervised and semi-supervised
AC
framework to model the intrinsic structure of image data from both labeled and
unlabeled images [23]. Significant works on graph-based models are presented
in [112, 17, 114, 116, 140, 139, 115, 111, 141, 113, 79, 92, 23, 25]. The refinement
of the results may improve the overall accuracy when the results of an image
annotation method are unsatisfactory. A random walk based refinement process, where the obtained candidate tags are re-ranked, is presented in [112][110].
The authors used a random walk with restart algorithm, which is a probabilitybased graphical model that either selects a particular edge or jumps to another
22
ACCEPTED MANUSCRIPT
node with some probability and re-ranks the original candidate tags and then
top-ranked tags are chosen as the final annotation of the image [112]. Further,
in [111], an adaptive random walk method is proposed, where the choice of next
route is determined not only by the probability but also consider confidence
T
factor based on the number of connected neighbors. A bipartite graph based
IP
representation of candidate annotations and then use of reinforcement algorithm
CR
in the graph to re-ranked the candidate tags are presented in [17]. The obtained
top-ranked tags are used as the final annotation. In the further application of
bipartite graph and random walk, the work carried out in [139] constructed
US
two subgraph (data graph and label graph) as part of a graph called bi-relation
graph (BG) and then using random walk with restart on BG, a class to class
AN
asymmetric relationship is calculated which improve the label prediction.
A graph learning based annotation method is presented in [114] where imagebased graph learning based on nearest spanning chain (NSC) is used to cap-
M
ture the similarity and structural data distribution information. The proposed
image-based graph learning produces the candidate tags which are further re-
ED
fined using word-based graph learning [114]. Image annotation can be considered as a kind of multi-label classification. A graph-based multi-label classifi-
PT
cation approach is presented in [116]. A hypergraph is constructed where each
instance is represented with a vertex, and corresponding labels are connected
CE
with edges to describe the correlation between the image and its various labels.
In [140], the authors proposed a weighted graph, where weights are assigned
based on neighborhood construction, approach in the semi-supervised frame-
AC
work for image annotation.
A sparse graph reconstruction method using only the K nearest neighbor
is proposed in [115]. The proposed method used only the K number of neighbors (one-vs-KNN) instead of all the remaining image (one-vs-all) to construct a
sparse graph to model the relationships among images and its concepts. The proposed method is implemented in the semi-supervised framework and efficiently
deals with noisy tags. A hybrid of graph learning and KNN based annotation
method [141] uses K nearest neighbors of the test image to propagate the labels
23
ACCEPTED MANUSCRIPT
to query image. The proposed method uses the image label similarity as a graph
weight and using the image to label distance a final list of tags are generated
and assigned to query image. The proposed method can be implemented efficiently even when the number of labels is substantial [141]. A technique based
T
on records of common interest between users on social networking site is pro-
IP
posed for AIA [113]. The proposed method is called social diffusion analysis
CR
where a diffusion graph is constructed based on common social diffusion records
of users to represent common interest of users. By utilizing this diffusion graph,
learning to rank technique is used to annotate images.
US
Discussion: The focus of model-based learning is to train a model using training data so that the trained model can tag unknown images. A shift from
AN
supervised model to semi-supervised model can be noticed in the last decade.
The progress made in SSL, based image annotation model, is remarkable. But,
the performance of real-time semi-supervised annotation model is not up to the
M
mark. However, these models paved the way and gave a hope that the real-time
annotation model is semi-supervised framework is achievable. The simplicity
ED
and performance of nearest neighbor model can’t be ignored while designing
a model based annotation system. At the same time, representation of corre-
PT
lation among images and labels by the generative model is at its best. The
discriminative power of classifier can play a very prominent role in the design of
CE
annotation system. For refinement of candidate tags, the graphical models can
be used. The hybrid models of generative and discriminative [142], generative
and nearest neighbor [135], and graph learning and nearest neighbor [141] have
AC
shown good results. No matter what kind of model is used, the focus of future
research should be in the direction of designing a semi-supervised and unsupervised learning based annotation model with satisfactory accuracy, and that can
be implemented in real time.
4.2. Learning Based Annotation Approach
The visual features are not sufficient to annotate all the objects presents
in the image. Moreover, we can say that the visual elements do not represent
24
ACCEPTED MANUSCRIPT
Learning Based Approach
Multi-label
Learning
Multi-instance
Multi-label Learning
Multi-view
Learning
Distance Metric
Learning
MIMLBoost [110]
MSE [145]
DCA [148]
MIMLSVM [151]
m-SNE [143]
KDCA [148]
D-MIMLSVM [161]
MISL [144]
RLSIM [152]
MVVVMR [146]
Multi-label sparse coding [37, 20]
Ranking preserving low rank
factorization
MIMLfast [156]
HD-MSL [147]
M3LDA [153]
mHR [130]
KML [108]
Robust KML [108]
IP
Graph based model [159, 160]
MLDC [53]
T
Discriminative model [157, 158]
NN based model [105, 10]
UDML [150]
C2I distance [149]
LMNN [132]
CMIML [92]
CR
MIML-Gaussian [155]
MIML-KNN [154]
Multi-label
LMNN [107]
US
Figure 7: Categorization of the learning based annotation approach.
the contents of the image in its entirety. To label the contents of the image
AN
accurately, the model should learn features from various available sources. The
correlation among labels is one such source of features. The exploitation of tags
M
correlation also helps in dealing with noisy and incomplete tag. The learning
based models primarily focus on to learn a model from multiple sources of the
ED
training set. Most of the learning models exploit the inputs features space
and have the capability to deal with incomplete tags. Learning from various
PT
sources (visual features, textual features) improves the generalization capability
of the model. When a model explores multiple views of features and stores the
features of each view separately in the feature vector, these feature vectors need
CE
to be concatenated systematically so that the obtained single feature vector is
physically meaningful and has more discriminative power than features from
AC
only one view. The learning based model can be classified into four different
categories - multi-label learning, multi-instance multi-label learning, multi-view
learning and distance metric learning (figure 7). Figure 7 also shows all the
prominent methods used by learning based models.
4.2.1. Multi-label Learning
Most of the images contain more than one objects hence a single label doesn’t
describe that image’s content properly. Multi-label learning (MLL) means the
25
ACCEPTED MANUSCRIPT
assignment of multiple tags to an instance. The model based on multi-label
learning deals with multiple class of labels and assigns multiple labels to an
image based on the contents of the image [162]. Multi-class learning is different
form multi-label learning as in former, out of many classes only one class is
T
assigned to an instance whereas in later, out of many classes multiple classes
IP
can be assigned to an instance. MLL usually assumes that the training dataset
CR
is completely labeled [163]. However recent progress in semi-supervised MLL
is remarkable, and various methods have been proposed for semi-supervised
based MLL [162, 164, 165, 166] where an MLL model is trained from miss-
US
ing labels. The images downloaded from Flickr or any other internet sources
are not fully labeled. Semi-supervised based learning methods are extremely
AN
desirable due to the unavailability of a large number of fully manually annotated dataset. A detailed review of MLL methods can be seen in [167]. MLL
techniques can be implemented using discriminative models [157, 158], near-
M
est neighbor based model [105, 10], graph-based model [159, 160] and various
other techniques [20, 37, 53, 168, 27]. The discriminative model [157, 158] try
ED
to learn a classifier from the incomplete training data. The training data has
incomplete label information with some of the noisy labels. In [21], the miss-
PT
ing labels are filled automatically using a smoothing function. The similarity
measured between images and its labels in smoothing function fills the missing
CE
labels. The smoothing function assumes smoothness at two levels: image level
and class label. Image level smoothness assumes that if two images have similar
features, then labels represented by these two images are close enough. Class-
AC
level smoothness assumes that if two labels have close semantic meaning, then
there exists a similar instantiation. In [157], an image is represented as an overlapping window at a different scale and then using parameter estimation and
cutting plane algorithm; all the objects are simultaneously labeled. A machine
learning technique to handle the densely correlated labels without incorporating
one to one correlation is proposed in [158] which can deal with different training
and testing label correlation.
A graph based model, for the multi-label framework, called a multi-label
26
ACCEPTED MANUSCRIPT
Gaussian random field (ML-GRF) and multi-label local and global consistency
(ML-LGC) inspired from single label Gaussian random field (GRF) [169] and
single label local and global consistency (LGC) [170] respectively is proposed
in [159]. The proposed model exploits the labels correlations and labels con-
T
sistency over the graph. A bidirectional fully connected graph is introduced to
IP
tackle MLL problem [160]. The graph is based on the generalization of condi-
CR
tional dependency network which calculates conditional joint distribution over
output labels given the input features. Although nearest neighbor based binary
classifiers have proved its effectiveness for binary classification, its multi-label
US
version [105] paved its way for MLL.
Nearest neighbor based MLL using multi-label KNN [105, 10] is easy to
AN
implement with low complexity. The method provides very effective classification accuracy with various features fusion techniques [10]. A subspace learning
method to tackle the multi-label information and sparse coding based image
M
annotation method is proposed in [37]. An extension of sparse graph based
SSL [115] called multi-label sparse graph based SSL framework is proposed in
ED
[20] with optimal sample selection strategy to incorporate semantic correlation
for multi-label annotation. A regression based method for structured feature
PT
selection method which selects a subset of the extracted features is proposed in
[117]. The proposed method uses tree structured grouping sparsity to represent
CE
hierarchical correlation among output labels to speed up the annotation process. In [53], dictionary learning is used in input feature space to enable MLL.
The author’s proposed multi-label dictionary learning (MLDC) which exploits
AC
the label information to represent an identical label set as a cluster and partially identical label sets connected [53]. An MLL model in a semi-supervised
framework which can efficiently deal with noisy tags is proposed in [168]. The
proposed method used graph Laplacian for unlabeled images, and trace norm
regularization to reduce the model complexity and to find the co-occurrence of
labels and its dependencies. Finding and utilizing the tag correlation is one of
the leading aspects of the MLL framework. In [27], the authors exploit the tag
correlations where missing labels are dealt with using tag ranking under the
27
ACCEPTED MANUSCRIPT
regularization of tag correlation and sample similarity.
4.2.2. Multi-instance Multi-label Learning
Multi-instance learning (MIL) is a way of dealing with objects which are de-
T
scribed by numerous instances. The training set in MIL usually has incomplete
IP
labels. Multi-instance means an object has various instances with a single label.
In complex applications where an object has various instances, hence multiple
CR
feature vectors, and only one of these vectors represent that object is called
MIL [171]. The ambiguity associated with the training dataset, where out of
US
various instances at least one instance present the bag represent that object can
be solved efficiently using MIL [172]. Presence of at least one instance in the
bag means that the bag is positive otherwise bag is labeled as negative [172].
AN
One the other, MLL deals with the situation where one object is described by
more than one class label [162]. In MLL, an instance can be classified in more
M
than one class, which means classes are not mutually exclusive. Multi-instance
multi-label (MIML) learning is combination MIL and MLL. MIML is essen-
ED
tially a supervised learning problem with ambiguous training set where each
object in the training set has multiple instances, and it belongs to multiple
PT
classes [151, 161]. One of the early works proposed in the MIML framework is
MIMLBoost and MIMLSVM methods [161] for scene classification.
A single instance single label supervised learning is a kind of degenerated
CE
version of multi-instance multi-label learning, which in turn is the degenerated
version of MIML learning [161]. In this process of degeneration, some of the
AC
important information in the training set may be lost. To tackle this problem, an
extended version of [161], called direct MIMLSVM (D-MIMLSVM), is proposed
in [151] which can handle the class imbalance as well by utilizing either the
MIL or MLL as a bridge. Later in [154], the nearest neighbor based MIML
method (MIML-KNN) is proposed to deal with the degeneration problem. The
proposed method utilized the K number neighbors and its citers to effectively
deal with the degeneration problem. Finding the co-occurrence among labels,
and instances and labels are key to MIML learning. To model the relationship
28
ACCEPTED MANUSCRIPT
between instances and label, and between labels, an efficient MIML method
using Gaussian process prior is proposed in [155]. An annotation method by
utilizing labeled data only at bag level, instead of predicting labels for previously
unseen bags, uses an optimized regularized rank-loss function is called rank-loss
T
support instance machine is proposed in [152]. An LDA based MIML learning
IP
method can be seen in [153], where the proposed method exploits both visual and
CR
textual features by combining MIML learning and generative model. Finding
the correlation between labels, and instances and labels simultaneously is a time
consuming process. To reduce the complexity of the MIML system, MIMLfast
US
[156] can be used which is much faster than existing MIML learning methods.
In [22], for a given image level labels and objects attributes, the image is first
AN
over segmented into superpixels and, object and its attribute association with
the segmented object are performed using the Bayesian model by generating
non-parametric Indian Buffet Process (IBP) which is weakly supervised and
M
hierarchical. The proposed method is called weakly supervised Markov Random
Field Stacked Indian Buffet Process (WS-MRF-SIBP) can be deemed as MIML
image as a bag.
ED
learning method where each superpixel can be considered as instance and each
PT
Multi-instance multi-label learning framework based image annotation method,
called context aware MIML (CMIML) [92], works in three stages. In stage one,
CE
multiple graph structures are constructed for each bag to model the instance
context. A multi-label classifier with the kernel is designed in stage two using
a mapping graph to Reproducing Kernel Hilbert Space. In the third stage, a
AC
test image is annotated using the stage one and stage two. Recently, MIML
learning framework based medical image annotation method [49] utilizes CNN
features. CNN is used to extract the region based features, and then author
used sparse Bayesian MIML framework which uses a basic learner and then
learns the weight using relevant vector machine (RVM).
29
ACCEPTED MANUSCRIPT
4.2.3. Multi-view Learning
An object has various characteristics and it can be represented using multiple feature sets. These feature sets are obtained from different views (texture,
shape, color, etc.). These different feature sets are usually independent but
T
having complementary nature, and one view alone is not sufficient to classify an
IP
object. Combinations of all feature sets give much more discriminative power
CR
to characterize an object. When we combine all the feature sets directly in a
vector form (without following any systematic concatenation rule), the concatenation is not meaningful as each feature set has specific statistical property. The
US
multi-view learning is concerned with the problem of machine learning which
provides systematic concatenation of features from different views (multiple dis-
AN
tinct feature sets) in such a way that the obtained feature vector is physically
meaningful. In multi-view learning, heterogeneous features from different views
are integrated to exploit the complementary nature of the different views in
M
the training set. The concatenation of features from two or more views can
be implemented using various methods [173, 145, 143, 144, 147, 174, 175, 130].
ED
The various methods explored in [173] are primarily based on machine learning
techniques and presented supervised and semi-supervised multi-view learning
PT
techniques. The complementary property of different views can be explored using [145] with low-dimensional embedding where the integration of features from
CE
different views has an almost smooth distribution over all views. A probabilistic
framework based multi-view learning (m-SNE) [143] uses pairwise distance to
obtain probability distribution by learning the optimal combination coefficient
AC
in all the views. Later in [147], this pairwise distance is replaced with high-order
distance using hypergraph where each data sample is represented with a vertex
in hypergraph and a centroid and its K number of nearest neighbor is used to
connect a vertex with other vertices using hyperedges. The proposed probabilistic framework based method (HD-MSL) [147] produced excellent classification
accuracy. A detailed review of different multi-view learning approaches can be
seen in [176, 177]. In [176], various subparts of multi-view learning have been
30
ACCEPTED MANUSCRIPT
discussed in detail. Manifold regularization based multi-view learning approach
[146, 175] explores the intrinsic local geometry of different views and produces
low-dimensional embedding. Inspired from vector valued manifold regularization, a multi-view vector valued manifold regularization (MV3MR) is proposed
T
in [146], which assumes that different views have different weightage, and learns
IP
combination coefficient to integrate all the multiple views for multi-label image
CR
classification. Semi-supervised based multi-label learning methods [175, 174]
can exploit the unlabeled set as well to improve the annotation performance. In
[175], the correlation among labels, multiple features, data distributions are ex-
US
plored simultaneously using manifold regularization and the proposed method is
called manifold regularized multi-view feature selection (MRMVFS). A bipartite
AN
ranking framework to handle the class imbalance in semi-supervised multi-view
learning based annotation method [174] learns views specific ranker from labeled data and then these rankers are improved iteratively. Multi-view learning
M
can also be implemented using Hessian regularization [130, 178], which has the
advantages of providing unbiased classification function. The multi-view Hes-
ED
sian regularized (mHR) multi-view features can be used with any classifier for
the classification (used with SVM [130]). In [178], Hessian regularization with
PT
discriminative sparse coding, which has a maximum margin for class separability, for multi-view learning is proposed for AIA. In [23], multiple features from
CE
different views are concatenated and used with graph based semi-supervised
annotation model to annotate the images. To deal with the problem of a large
volume of storage space required for a large number of images, the authors
AC
generated prototype in feature space and concept space using a clustering algorithm. Then, using a feature fusion method, the best subset of features in
both spaces are chosen. For any test image, its nearest cluster is chosen in both
feature space and concept space.
4.2.4. Distance Metric Learning
When image annotation is performed based on visual features, the problem
of semantic gap arises. Another approach for annotation is the retrieval based
31
ACCEPTED MANUSCRIPT
annotation. In retrieval based annotation, a set of similar images are retrieved,
and an unknown image is tagged based on the tags of retrieved images. To
find the similar images in a training set, the distance between images has to
be measured. Learning these distances among images is called distance metric
T
learning. The accuracy and efficiency of learned distance metric play important
IP
role in neighbor selection or selection of similar images. If the distance matrix is
CR
inaccurate, selected similar images may not be similar, and this will lead to the
assignment of irrelevant tags to a new image. Various methods for distance metric learning has been proposed [148, 132, 107, 134, 106, 150, 149, 108, 153]. The
US
distance among images is learned based on visual similarity and label specific
similarity.
AN
When the dataset is provided with contextual constraints, a linear data
transformation function (discriminative component analysis [148]) is learned to
optimize the Mahalanobis distance metric. For non-linear distance metric, a
M
discriminative component analysis (DCA) is used with a kernel, and this extended method is called kernel DCA (KDCA) [148]. A linear metric is often
ED
unable to accurately capture the complexity of the task (multimodal data, nonlinear class boundaries). Whereas, a non-linear metric can represent complex
PT
multi-dimensional data and can capture nonlinear relationships between data
instances with the contextual information. For nearest neighbor classifier, accu-
CE
rate calculation of distance among images or tags plays a significant role in the
overall accuracy of the method. A classical method for learning Mahalanobis
distance metric for KNN classifier is proposed in [132] and called it large mar-
AC
gin nearest neighbor (LMNN). The proposed LMNN uses labeled training set
and using Euclidean distance the target nearest neighbors are determined which
is iteratively changed during the learning process. Later in [107], the LMNN
is extended for the multi-label problem and is used with proposed two pass
KNN (2PKNN) for AIA. Inspired by linear discriminant metric learning which
is based on Mahalanobis distance, a label specific distance metric where for
each specific label a distance is calculated is proposed in [134]. The proposed
method is used with weighted KNN for multi-label annotation and, as claimed,
32
ACCEPTED MANUSCRIPT
reduced false positive and false negative labels. A weighted nearest neighbor
based model (TagProp) [106] used either ranking or distance metric to update
the weight. When the distance is used to update the weight, a distance metric
is used to calculate the weight by directly maximizing the log-likelihood using
T
a gradient algorithm. An inductive metric learning, which is inspired by [132],
IP
and a tranductive metric learning, where both textual and visual are exploited
CR
to learn distance, are unified together to learn the distance metric [150]. This
combined technique is called unified distance metric learning (UDML). In [138],
two types of classifiers are used with distance learning metric where LMNN [132]
US
is used with KNN and multiclass logistic regression based distance metric learning is used with the nearest class mean (NCM). For multi-label classification, a
AN
class to image (C2I) distance learning method [149] uses large margin constraint
for error term and L1-norm for regularization in the objective function. While
learning the distance metric, if the linear transformation is used (as in linear
M
Mahalanobis distance metric learning [132]), the non-linear relations among images cannot be represented. To deal with non-linear relationships among images,
ED
usually, a kernel is used with a linear function. The use of kernel metric learning
(KML) based distance calculation technique is proposed in [108] where authors
PT
proposed a robust KML (RKML) to measure the distance. RKML is based on
regression to deal with high dimensionality problem of KML. An AIA approach
CE
using probability density function (PDR) based distance metric learning is proposed in [179]. The proposed distance learning technique can deal with some
mission labels as well.
AC
Discussion: The ease and efficiency of learning based image annotation techniques and its good accuracy have paved the way for future research in this
direction. On the one hand, multi-label learning is the desirable facet of image
annotation. On the other hand, multi-view learning accords more discriminative feature selection capability. The power of multi-view learning can be best
utilized if we incorporate multi-view features with MIML and distance learning techniques. The combination of visual and textual features in the nearest
neighbor method is easy to implement and produces higher accuracy with ex33
ACCEPTED MANUSCRIPT
cellent efficiency. Although distance metric learning based annotation has not
been explored to its full extent, we believe, it can produce excellent results for
annotation. We hope that there will be a lot more work in AIA using learning
T
based methods in the near future.
IP
4.3. Image Annotation Based on Tag Length
CR
When we consider the length of the tag for AIA, there can be a fixed number
of tags or a variable number of tags associated with images. When an image is
annotated, it is assigned labels based on the contents of the image. The fixed
US
number of tags refers to the assignment of n number of labels to the images.
Once n is fixed, it remains constant for all images. Here, the number of objects
AN
present in the image doesn’t have much significance. However, the contents of
the image may be utilized for annotation. This type of annotation may result
in the higher false positive rate especially when the number of objects present
M
in the image is less than n. In reality, an image may contain many objects,
so while labeling this type of image, labels have to be assigned to the relevant
ED
objects. The labeling of all the relevant objects may result in a variable number
of labels of different images. While it is comparatively easy to implement image
PT
annotation for a fixed number of tags, variable length tags represent the realistic
contents of an image. Both fixed length tags and variable length tags are part
CE
of multi-label annotation as long as n > 1.
4.3.1. Fixed Length Tags
AC
The conventional annotation approach follows fixed length annotation where
the number of tags is fixed for all the test images irrespective of their content. In
the fixed length tag annotation approach, a set of candidate labels are obtained,
and final list of tags are fixed from the candidate labels. When the candidate
labels are obtained, selecting the final list of annotation tags is essentially a
label transfer problem. The label transfer schemes deal with decision making
process where a list of n keywords are selected as final tags for any query image.
A probability value is assigned to each candidate label, and all the candidate
34
ACCEPTED MANUSCRIPT
labels are ranked according to its probability. This probability indicates the
confidence score for the candidate annotation. Ranking based annotation approaches revolve around the following methods: (i) The candidate annotations
are ranked according to their probabilities, and top n images with the highest
T
probability are selected as the final annotation. (ii) A threshold based approach
IP
can be followed where all the candidate labels whose probability is greater than
CR
the threshold are selected as final labels. When top n labels are selected as
final annotation, the annotation approach is easy to evaluate. The value of n
is variable, but once it is fixed, it selects the same number of labels for all the
US
images. Selecting the appropriate value of n is a point of concern here. If n is
too small, the recall capability of the annotation will be degraded. If n is too
AN
large, the precision of the annotation method will fall. The size of n is usually
fixed at 5 [12, 97, 180, 27], or sometimes it may be variable [181, 108, 113, 15].
In the threshold based label transfer approach, each candidate label is as-
M
signed a probability, and a threshold is used which acts as a boundary line. All
the labels, having a probability greater than the threshold, are selected as the
ED
final annotation. The threshold based approach may produce a different number
of tags for different images. Selecting the appropriate value for the threshold
PT
is again a matter of concern. The optimal value for threshold can be decided
by an expert or on expert advice as it may be variable from image to image.
CE
There are various works where threshold based approach have been mentioned
[112, 110, 17]. A modified threshold based approach is proposed in [17] where
the number of labels is decided according to the number of final labels in the
AC
candidate set. A greedy based approach for label transfer is proposed in [9].
The authors proposed two-stage label transfer algorithm where neighbors are
arranged according to their distance from the query image, and if possible, the
method first tries to create final label set from the first neighbor, and if not, the
proposed approach uses other neighbors as well. The labels in the neighbor are
ranked based on its frequency in the training set [9].
The popularity if fixed length tag annotation approaches are as a result of
easy and simple evaluation criteria. Also, various state-of-arts for fixed length
35
ACCEPTED MANUSCRIPT
annotation and lack of baseline and state-of-arts for arbitrary length annotation
methods are the reason behind the inclination towards fixed length annotation
approach. The state-of-arts help in the comparative evaluation of the proposed
T
methods.
IP
4.3.2. Variable Length Tags
All the relevant objects present in the image should be assigned a label.
CR
Thus, the length of the label is determined by the contents of the image. Annotating an image with all the relevant labels is the realistic way of image annota-
US
tion. Predicting the proper length of tags is foremost task of any variable length
tags annotation approach so that appropriate tags can be assigned given a list of
tags. Variable length tags can be predicted using a slightly modified threshold
AN
based label transfer approach [9, 17]. When a threshold value is fixed, all the
labels having confidence score greater than the threshold are selected as final
M
labels. The number of labels having confidence score greater than the threshold
may vary from image to image which may result in variable length tags. But,
ED
finding the optimal threshold is something that may cause worry. Recently deep
learning techniques [182, 183, 184] are being used for variable length label pre-
PT
diction. The basic idea behind the deep learning based arbitrary length tags
annotation method is to use a neural model that can automatically generate
a list of labels. Inspired by the image caption generation methods, recurrent
CE
neural network (RNN) based variable length annotation method is proposed in
[183]. The proposed RNN model uses long short term memory network (LSTM)
AC
as its sub-module which has extensively been used for image caption generation
[185]. As LSTM requires ordered sentences in training set for caption generation, tags in the training set of annotation dataset have to be ordered. The
four types of method to order the tags have been suggested and implemented
in [183]. The authors used a tag to tag correlation where prediction of the next
keyword is inspired by immediate previous prediction. CNN-RNN based multilabel classification [182] extracts visual features using CNN, and semantic label
dependency is obtained using RNN having LSTM as its extension. Then, mul-
36
ACCEPTED MANUSCRIPT
tilayer perceptron is used to predict the labels. Later in [184], an extension of
[182, 183] is presented to make the learning more stable and faster. A semantic
regularized layer is put in between CNN and RNN which regularizes the CNN
network and also provides the interface between them to produce more accurate
T
result [184].
IP
Discussion: Fixed length annotation methods tried to solve the problem of class
CR
imbalance, noisy dataset, incomplete labels, etc. by utilizing the correlation
among labels and visual features. The multi-label classification based previous
methods use one-vs-all classifier hence multiple classifiers have to be trained for
US
multi-label classification. Variable length tags based annotation method is a
realistic way for label assignment. It allows labeling all the relevant contents
AN
present in the image. However, arbitrary length tag annotation methods have
not been explored in details. The success of [183, 182] have given us a hope to
use deep learning based model to annotate images with variable length tags; it
M
is still in its early year of research. Also, the time taken to train a deep learning
based model is very high. However, the increasing capabilities of the modern
ED
day computer system made it easy to train the deep models. We believe that a
lot more work will be carried out for variable length tags annotation approach
PT
in the near future.
4.4. Image Annotation Based on Training Dataset
CE
To design an annotation model, the first thing to consider is the training
dataset. The type of training dataset has a significant influence on the final an-
AC
notation model. The dataset can be categorized into three sets - (i) the training
set has complete labels. This indicates that all the images in the training set
are manually annotated and contains a complete set of tags. Training with this
types of the dataset is known as supervised learning. The availability of fully
manually annotated dataset is expensive and time consuming, but once it is obtained, a very efficient model can be designed using supervised learning. (ii) The
training set has incomplete labels. This types of the dataset are either manually
annotated, or annotation is performed by users over social sites, but the images
37
ACCEPTED MANUSCRIPT
are not fully annotated. As the images are labeled by the users, the training
images may contain noisy tags or have an incomplete set of labels. Training with
this kind of dataset is known as semi-supervised or weakly supervised or active
learning. SSL based annotation models are very impressive as they can adapt
T
themselves to very large scale and expanding dataset. (iii) The training dataset
IP
is not annotated at all. The images in the training dataset are not labeled with
CR
any tags, but each image is equipped with some metadata like GPS data, URL,
etc. In such cases, the candidate annotation tags have to be mined from the
metadata associated with images. Training with this kind of dataset is known
US
as unsupervised learning. It is difficult to design an unsupervised learning based
annotation model with high efficiency and accuracy.
AN
The representational accuracy of the training dataset has a direct influence
on the performance of any learning model [186]. If the dataset contains any
noise, then it has to be removed before training a model. The refinement of
ED
4.4.1. Supervised Learning
M
noisy or incomplete dataset leads to active learning based approach.
When the training dataset is provided with corresponding output labels, the
PT
designing of the model is somewhat straightforward. The supervised training
dataset is given as {(X1 ,Y1 ), (X2 ,Y2 ),....., (Xm ,Ym )} where Xi is input set and
Yi is the corresponding output set. In the multiclass scenario ith output label
CE
Yi =yi1 , yi2 , ....., yiP where P is the number of output class for ith input data.
The input data can also have multiple features which is represented as Xi =x1i ,
AC
th
x2i , ....., xN
input data. For image
i where N is the number of features for i
annotation, an image may be annotated with multiple keywords hence a single
j
1
2
P
training data item is given as (x1i , x2i , ....., xN
i , yi , yi , ....., yi ) where xi is a an
element of the set Xi of input image features and yij is an element of set Yi of
output keywords for ith training image. The input features can be either feature
extracted from input images, correlated features extracted from image and tags
collectively. The size of the training dataset may also influence the overall performance of the model as more training images mean less generalization error.
38
ACCEPTED MANUSCRIPT
Obtaining a large number of fully manually annotated dataset is burdensome,
time consuming and expensive. The supervised dataset for the annotation are
[187, 188, 189, 190]. The advantage of supervised learning is that the training
dataset enables the model to learn the concepts characteristics and classification
T
rule. Thus, a large number of training sets will enhance the capability of the
IP
trained model. The basic idea behind all the image annotation models based on
CR
supervised learning [180, 191, 18, 120, 192, 1, 193, 121, 92, 142, 110, 194] is to effectively exploit the labeled data in training set. The discriminative supervised
annotation models [192, 180, 194] trains a classifier by exploiting visual and tex-
US
tual features of labeled images. The imbalance class is also an issue faced in the
fully annotated dataset. Each label should have a sufficient and almost similar
AN
number of the training set to avoid the underfitting of decision boundary. Thus,
it poses the scalability problem. A supervised annotation model where both the
retrieval and annotation are considered as classification problem which is de-
M
fined by the database rather than query is proposed in [180]. The dataset [187]
used in [180] is fully manually annotated, and the proposed probabilistic model
ED
calculates the conditional probability density given the feature vector and class
label. An extension of the regression model into a classification model [60], by
PT
exploring the annotations and class label jointly, modified the supervised LDA
(sLDA) to multiclass sLDA for simultaneous classification and annotation. The
CE
multi-label annotation task where the number of labels is variable for different
images further complicate the design and complexity of the supervised model.
If there are a vast number of output labels, then the training time for the an-
AC
notation model is very high.
In supervised learning, once the model is trained, the training dataset be-
comes obsolete. So, the training time of the model is usually considered as
the computational complexity of the model. The supervised image annotation
models [92] uses multi-instance learning technique to model the instance context and kernel based classifier is used for multi-label annotation. The graph
based model [110] utilized the co-occurrence of semantic concepts of training
images and the using visual features and semantic concepts the probability of
39
ACCEPTED MANUSCRIPT
the candidate tags are obtained.
4.4.2. Semi-supervised Learning
The main drawback of supervised learning is that it requires a large number
T
of annotated training images which is difficult to obtain. Also, for large scale
IP
dataset, the training time of the supervised models is usually very high. To deal
with these complications, a new method to train the model called SSL or weakly
CR
supervised learning or active learning is proposed in literature [195, 43]. The
SSL uses only a small number of labeled data and makes use of unlabeled data to
US
train a model. The SSL can deal with noisy, incomplete and imbalanced training
dataset. The basic motive behind SSL based model is to reduce the size of the
labeled training set. When the image dataset is annotated by user [188, 190],
AN
the tagged images are usually noisy where the tagged labels may not represent
the contents of images accurately and user-tagged are either incomplete or over-
M
tagged. Over-tagging is a kind of noisy tags which have to be removed. The
noisy tags have to be replaced with relevant tags that accurately reflect the
ED
concepts of the image. Various method for denoising the tags [26, 168, 24]
have been proposed in the literature. To deal with incomplete tags, various
PT
techniques [21, 24] have been proposed. A statistical model that explores the
word to word correlation [19] can be used directly with any active learning to
reduce the required number of annotated example significantly. The use of a
CE
mixture model in semi-supervised framework [80] is shown good results in the
early age of semi-supervise learning. Green’s function is very effective for single
AC
label tagging in the semi-supervised framework and its extended version [196],
which exploits the label correlations, can be used for multi-label annotation in
the semi-supervised framework.
The popularity of SSL based image annotation is because of its effectiveness
over the growing size of the image database with noisy and incomplete tags.
An SSL based training dataset contains an only small number of fully labeled
images and has a majority of unlabeled or noisy labeled images. An SSL based
annotation methods either first trains the model on a small set of labeled data
40
ACCEPTED MANUSCRIPT
and then using various correlation noisy labels are refined or first the noisy
labels are refined and then a model is trained on the whole dataset. Hessian
regularization based annotation models [100, 130, 178] which operates in the
semi-supervised framework have produced good results. The Multiview Hessian
T
regularization (mHR) [130] combined multiple kernels and HR obtained from
IP
various views. The sparse coding based SSL model [178] uses mHR to annotate
CR
images produced competitive results.
The use of graph based models in the SSL framework [159, 140, 6, 139, 174]
has exploited both labeled and unlabeled images. Graph based models can
US
also be used to find the consistency among labels and label correlation [159].
Usually, the SSL based annotation models can’t handle a very large number of
AN
unlabeled images and have a limitation on the maximum number of unlabeled
images. In [140], large scale multi-label propagation (LSMP) is proposed to
handle the large number of unlabeled images in the SSL framework. A graph
M
Laplacian based method for labeled and unlabeled images is proposed in [6].
The proposed method is called structural feature selection with sparsity (SFSS)
ED
where a graph Laplacian is constructed on the visual features of images to find
the label consistency. Random walk with restart (RWR) used for SSL over Bi-
PT
relational graph (BG) [139] measures class to image correlation and is a very
good example of graph based model for AIA in SSL framework.
CE
To learn an agreement between keywords and image region, the multiple instance learning (MIL) techniques have been exploited [197] with deep learning
features in SSL framework. A simple assumption with all the keywords associ-
AC
ated with training set is that at least one keyword provided with initial data is
correct and then using deep neural network and MIL, the class of the image is
predicted. The use of instance selection approach to reducing the complexity
of large scale image annotation system in SSL framework is proposed in [198].
The proposed method used single label prototype selection based instance selection method and extended it to multi-label purpose. The proposed method
[198] removes those instances which do not affect the classification. The association of image objects and its attributes in a semi-supervised framework
41
ACCEPTED MANUSCRIPT
[22, 199] achieved an excellent performance. By using non-parametric in SSL,
the authors tried to find the association between object and attribute.
The fusion of multigraph learning, matrix factorization and multimodal correlation using a hash function for learning the annotation model using only a
T
small number of labeled tags is proposed in [25]. The proposed method can
CR
4.4.3. Unsupervised Learning
IP
handle a very large number of unlabeled images.
Unsupervised learning based methods are one of the most attractive anno-
US
tation methods. It is perfectly suited for a large number of unorganized images
available these days. The unsupervised learning based annotation methods do
not require labeled training images (strongly or weakly), which is one of the
AN
strongest points for these methods. Although, it is not that unsupervised learning based annotation methods can annotate images out of nowhere. It also
M
needs text to label any unlabeled image, but the candidate labels are mined
from the metadata. As we know that, each image on the web has a URL, some
ED
text surrounding the image and some other information associated with the
image. This information and text associated with the image are called meta-
PT
data. An unsupervised learning based annotation method mines label from
these metadata and annotate the image. As candidate labels are obtained from
image metadata, unsupervised learning based methods don’t require fully or
CE
partially labeled training dataset and can annotate image without training a
model [15]. The metadata such as URL, GPS info, surrounding text, filename,
AC
etc. of the image usually provides a significant clue about the concepts of the
image. Although all the texts/words present in the metadata are not relevant
for annotation, metadata contains almost all the candidate labels that can perfectly describe the contents and concepts of the image. Mining the candidate
labels from the metadata and finding association among candidate labels to
produce the final labels for image is a challenging task. Mining the labels from
metadata also enables to have variable length tags which are again a driving
factor for the attractiveness of unsupervised learning based image annotation.
42
ACCEPTED MANUSCRIPT
The metadata of images may be noisy and unstructured thus candidate label
detection method should be robust enough.
In one of the early work on unsupervised learning based annotation [28],
PLSA based model is used which is first trained on the text and then on visual
T
features and then using inference the images are annotated. A reinforcement
IP
learning based graphical model [17] obtains initial candidate annotation from
CR
surrounding texts (e.g., filename, URL, ALT text, etc.), then mining the semantically and visually related images from large scale image database the initial
candidate annotations are further refined, and then ranking technique is used
US
to obtain the final annotation. The annotation of personal photo collection
[16] used GPS, timestamps and other metadata to find the correlation between
AN
scene labels and scene labels and event labels. A restricted Boltzmann machine
(RBM) based model for geographical image annotation is presented in [13].
Since 2012, ImageCLEF is organizing scalable image annotation task, where
M
various expert teams from all over the world participate. The idea is to find
a robust relationship between images and its surrounding text. The focus is
ED
mainly given towards obtaining the annotation from web metadata. The motive is to design an image annotation system where number of keywords are
PT
scalable. A large number of teams participate in the event, and some teams
have come up with excellent proposals [200, 46, 201, 202, 45, 44] to deal with
CE
noisy and irrelevant texts associated with web images. Although, each year’s
annotation challenge include various subtasks, scalable concept detection, and
tagging remains center of the challenge. Overall, significant progress has been
AC
made towards unsupervised learning based image annotation. Still, finding the
correlation between metadata and images is a very challenging task as metadata
provides very weak supervision.
Discussion: Although supervised annotations have reached the advanced stage
and have achieved an excellent level of accuracy, it is not desirable in the current situation because of a number of reasons: (i) It requires a large number
of fully manually annotated images which is a time consuming process and difficult to obtain. (ii) The training time of the supervised model is high. (iii)
43
ACCEPTED MANUSCRIPT
if training data changes, we need to retrain the model. (iv) It is very difficult
to scale the model, hence, not suitable for large scale growing database. (v)
Length of the tag is fixed. (vi) It can’t handle noisy or incomplete training set.
To deal with these problems, SSL based annotation model has been introduced
T
which gained significant popularity. Although it can deal with almost all the
IP
problems of supervised model mentioned above, it still needs some manually
CR
annotated images. Also, usually, a semi-supervised model has limitation about
the maximum number of unlabeled images it can handle. The length of the tags
remains a challenge for the SSL based model as it is usually fixed. Fixed length
US
tags may not describe the contents accurately. All the images on the internet
are associated with metadata (e.g., filename, alternate texts, surrounding texts,
AN
URL, etc.). This metadata can be mined to obtain the tags for the image.
The unsupervised learning based models can produce variable length tags and
don’t require labeled data. Training the model over the image metadata is a
M
challenging task as most of the texts in the metadata are unrelated and /or
redundant. These noisy metadata have to be mined and refined to obtain the
ED
tags. Although, the task is challenging it has become very much popular as it
suites the real life situation of dealing with large scale database and web images.
PT
The recent advancement in the processing capabilities of the chip, increased
size of training data, and advances in machine learning research led to the
CE
widespread application of deep learning techniques. A deep learning model consists of multiple stages of non-linear information processing units to represent
features at the successively higher level of abstraction in supervised or unsuper-
AC
vised manner [203]. To tackle the problem of the semantic gap, deep learning
based feature representation techniques [204] can be utilized which contains
richer semantic representations than the handcrafted feature extraction methods. One of the key requirement for deep learning is to have a large number
of training data. Also, the training of deep neural network from scratch is a
very time-consuming process. Nowadays, there exist several publically available
datasets with a very large number of the training image. Also, we can generalize
the pre-trained deep learning models to tackle the problems of the requirement
44
ACCEPTED MANUSCRIPT
of a large number of training data and to reduce the training time. The use
of deep learning in semi-supervised learning framework for AIA can be seen
in [197]. Authors proposed MIL based deep learning architecture for AIA and
achieved commendable results. Although deep learning models are extensively
T
being used for various computer vision tasks and shown a benchmarking per-
IP
formance, However, the application of deep learning for AIA is still in its early
4.5. Image Annotation Based on User Interaction
CR
stage and needs the active participation of research community.
US
Another way to view an image annotation system is through a user’s interaction with the system. If the system is fully automatic then there is no
AN
interaction with the user once the data is submitted, the system automatically performs annotation. If the system is fully manual, then a human expert
performs the annotation which itself is a tedious, time consuming and very ex-
M
pensive process. Annotation can also be performed in a semi-automatic manner
where most of the processing is handled by system and user can interact with
ED
the system using relevance feedback or other mechanisms to improve the confidence of the model. The focus of the most of the researchers is towards AIA
PT
[174, 26, 25, 19, 142, 98, 150, 141, 10, 192, 50]. AIA can be achieved through
CBIR provide that the problem of the semantic gap is addressed effectively.
The advantages of AIA includes efficient linguistic indexing where a query can
CE
be specified in natural language. The basic idea of AIA is to minimize user
intervention during annotation. Although a human can annotate images with
AC
high accuracy, one can’t expect that all human have the same intelligence [205].
A novice user has very little knowledge about the contents of the image in association with correlated keywords. Hence, an image annotated by novice user
may contain a completely different set of labels form the labels assigned by
an expert. However, when a machine performs annotation, this subjectivity of
the annotated labels doesn’t make any sense, and the same set of labels are
produced on every run.
Although the AIA removes the subjectivity, once the model produces a final
45
ACCEPTED MANUSCRIPT
annotation, it can’t be changed or improved. Also, AIA requires accurate training dataset so that the classifier can be trained accurately. If training dataset
is of low quality, the performance of the model will degrade [206]. This led
the focus of researchers towards semi-automatic image annotation. The perfor-
T
mance of an annotation model can be improved greatly with little intervention
IP
of the user. The user intervention can be in the form of relevance feedback
CR
or identification of the relevant areas of the image to be annotated. Also, the
semi-supervised annotation system doesn’t require high quality ground truth
dataset. Some of the promising work in the field of semi-automatic image an-
US
notation is presented in [207, 208]. A semi-automatic annotation system called
Photocopian [207] takes the information from camera metadata, GPS data, cal-
AN
endar data, etc. to annotate an image and has the scaling ability to adopt new
annotation or classification services. To represent the semantic of image tags
can be linked using predicate words [208]. This semantic links among image
M
tags are represented semi-automatically.
Relevance feedback based annotation systems try to improve the candidate
ED
annotation based on user’s feedback [209, 208, 210]. In this kind of models, first a
set of candidate labels are obtained automatically and then using the relevance
PT
feedback, the candidate tags are further refined to generate final annotation.
Here, relevance feedback is one of the forms of a user’s interaction with the
CE
system during the annotation process that is why we call it a form of a semiautomatic system. A feature selection strategy to annotate image [209, 211] also
incorporates the keyword similarity and relevance feedback to find the similar
AC
keyword and visual contents that can be used for retrieval of similar and relevant
images. A relevance feedback based model [210] for medical images uses a
classification approach for annotation. The authors used keyword based image
retrieval using relevance feedback.
No matter what strategy is used, the semi-automatic annotation model has
the advantage of on-site user interaction to improve the quality of annotation.
However, it requires a user to be there during the annotation process in one
form or other. But it has the advantage of dealing with the incomplete dataset.
46
ACCEPTED MANUSCRIPT
One the other hand AIA doesn’t require the user’s presence during annotation,
and also the system has consistency. Because of availability of a large number
of dataset and advancement in AIA, the AIA has been much more popular than
IP
even deal with a noisy, incomplete and unstructured dataset.
T
semi-automatic annotation. Due to the advancement in AIA techniques, it can
4.5.1. Image Annotation via Crowdsourcing
CR
Image annotation via crowdsourcing is a collaborative approach to obtain
annotated images. The images are by non-expert users (paid or volunteer users).
US
To obtain the fully annotated image database for the training and ground truth
purpose, either the dataset is labeled by expert annotators which is a time consuming and expensive or it can be annotated via crowdsourcing which requires
AN
much more less time and is less costly. One of the most popular platforms for
the crowdsourcing is Amazon Mechanical Turk (MTurk). MTurk is an online
M
crowdsourcing platform that allows a requester to post the task, called human
intelligence task (HIT) on MTurk and turkers (workers) across the world ex-
ED
ecute the given task in a very short span of time with less amount of money
expended by the requester [212]. The MTurk turkers are non-expert annotator,
PT
and a large number of workers usually performs the annotation.
The annotation on MTurk is performed by thousands of non-expert turkers
hence quality control is an essential part to obtain the high-quality annotated
CE
data. MTurk has no effective built-in mechanism for quality control and offers
minimal control over participants (turkers) who is allowed to perform annotation
AC
[212]. If the images are labeled with free-form text, it will result in a very
large number of random collection of labels which is undesirable. Even if the
image is tagged from the set of concepts provided by the requesters, images
may be annotated with irrelevant and noisy labels primarily due to lack of
motivation, knowledge, and carelessness of the turkers. Also, as the image
may be annotated by a large number of turkers with different expertise and
background, consistency of labels cannot be guaranteed.
MTurk provides a quality test for workers, which may improve the quality
47
ACCEPTED MANUSCRIPT
of the result if a careful quality test for each worker is performed. We can also
reduce the diversity of labels by setting a limit on the number of workers. Apart
from these, there is various inter-annotator agreement test to check the quality
of annotated data [213, 214]. Often different turkers labels the same image with
T
a different set of labels and if the number of workers is odd we can break the
IP
tie, else we can use the inter-annotator agreement test to judge the quality of
US
5. Image Annotation Evaluation Methods
CR
annotated image data.
The keywords assigned to an image represent semantic contents of the image.
When an image is assigned with only one keyword, it can be considered as single
AN
label annotation or simply a binary classification where a classifier ascertains
only the presence or absence of keyword. Only one label can’t represent the
M
true contents of the image. Hence, image annotation methods usually assign
multiple keywords to indicate the presence of multiple objects in the image.
ED
This type of method is called multi-label annotation system, or it can also be
considered as a multi-class classification system.
To ascertain the accuracy of the annotation system, there exists two broad
PT
class of evaluation measure: (i) the qualitative measure and (ii) the quantitative
measure. The qualitative measure [215] deals with human subject based assess-
CE
ment. The subjects are asked to evaluate the performance of the system so that
a more comprehensive picture of the annotation system can be obtained. The
AC
quantitative evaluation of the system deals with system level evaluation where
ground truth dataset is used to ascertain the accuracy of the system.
For a single label annotation system, the accuracy of the system can be
considered as the performance evaluation criteria [151]. Here, accuracy refers
merely to the overall percentage of correctly classified test images over the total
number of test images. But, in the case of multi-label annotation, the performance evaluation criteria is much more complicated. Also, many annotation
systems are ranking based models where labels are ranked based on some con-
48
ACCEPTED MANUSCRIPT
fidence factor which may require a different set of evaluation criteria [216].
For single label annotation system, the system is provided with test dataset
and obtained result is evaluated for its precision (1), recall (2) and F1 score (3)
TP
TP + FP
TP
TP + FN
P recision × recall
F 1 score = 2 ×
P recision + Recall
CR
Recall R =
IP
P recision P =
T
as performance evaluation criterion[183].
(2)
(3)
US
where,
(1)
TP (True Positive) = Both, actual and obtained results are same and indicate
AN
the presence of label.
TN (True Negative) = Both, actual and obtaince results are same and indicate
the absence of label.
M
FN (False Negative) = Although, the label is present in actual annotation
(ground truth), the obtained result shows absence of label.
ED
FP (False Positive) = The obtained results show presence of label even though
label is absent in ground truth.
PT
For multi-label annotation , three evaluation measures equation (4-6) have
been proposed [217]. The multi-label annotation system can produce fixed
CE
length keywords or variable length keywords. In fixed length annotation multiple keywords are assigned to images, however number of keywords are fixed.
Whereas, in variable length annotation the number of assigned keywords vary
AC
from image to image. The ”one-error” is multi-label version of the accuracy
calculation of the single-label annotation system. It measures how many times
the top ranked labels are not in the set of possible labels. That is, it measures
the false-positive. The second one, coverage, evaluates performance of system
in terms of its label ranking. Coverage measures the performance of system for
all possible labels. It calculates the average of the maximum positive confidence
factor which indicates how all the possible labels of the dataset can be covered
[217]. The average precision, an image retrieval (IR) performance measurement
49
ACCEPTED MANUSCRIPT
system, can be used for label ranking which calculates the average fraction of
labels ranked above a particular ranked output label. Here, a particular refers to
a fixed value which is used as a threshold point and all the labels ranked above
the threshold are considered as the output of the system. The image retrieval
T
(IR) evaluation method measures whether the retrieved images are relevant to
IP
query image or not? How many retrieved images are relevant to the user’s
CR
query? Whereas, the image annotation evaluation method measures prediction
of accurate labels. It checks how many tags are accurately predicted, how many
tags are missing from the result, etc. for a query image.
US
Let m be the number of training samples. The input dataset is in the form
of <(X1 ,Y1 ), (X2 ,Y2 ), ......, (Xm ,Ym )>, where Xi is and Yi is the input and
AN
output set respectively for ith training data. yij is an element of set Yi , h(Xi )
represents the set of top K predicted labels for Xi and rank h(Xi , l) is real
value confidence factor for lable l from the top K predicted labels for xi .
m
M
1 X
h(Xi ) ∈
/ Yi
m i=1
ED
One − error =
(4)
m
Coverage =
m
1 X 1 X |{l0 ∈ Yi |rank(Xi , l0 ) ≤ rank(Xi , l)}|
m i=1 |Yi |
rank(Xi , l)
PT
Avg. precision =
1 X
maxl∈Yi rank h(Xi , l) − 1
m i=1
(5)
(6)
l∈Yi
CE
Later in [216], the author used equation (4-6) along with four other evaluation
criteria for multi-label annotation (7)-(9) and label ranking (10). Hamming loss
AC
counts the misclassification of an image-label pair. That is, it takes the average
of whenever actual and predicted labels are different. ”Macro-F1” averages the
F1-measure on the predictions of different annotations. ”Micro-F1” calculates
the F1-measure on the predictions of different labels as a whole. A low value of
hamming loss and a large value for micro-F1 and macro-F1 is the realization of
an excellent performance of the system. ”Ranking loss” is used to evaluate the
performance of label ranking and calculates the average fraction of label pairs
50
ACCEPTED MANUSCRIPT
that are not correctly ordered.
m
1 X kh(Xi ) ⊕ Yi k1
Hamming loss =
m i=1
N
CR
T
IP
Pm i
N
i
1 X 2 × j=1 h (Xj )yj
Pm i Pm i
M acro − F 1 =
N i=1 j=1 yj + j=1 h (Xj )
Pm
T
2 × i=1 kh(Xi ) Yi k1
Pm
M icro − F 1 = Pm
i=1 kYi k1 +
i=1 kh(Xi )k1
(7)
m
Ranking loss =
1 X
1
m i=1 |Yi | Yi
yt , ys ∈ Yi × Yi |h(Xi , yt ) ≤ h(Xi , ys )
(8)
(9)
(10)
US
where, N=number of labels for an image and k.k1 is the l1 norm and ⊕ is the
logical XOR operation. The complement of Yi is denoted by Yi .
AN
All these evaluation measures have been frequently used for the performance
evaluations of annotation system [113, 151, 58, 155, 183, 92]. For evaluating
M
the performance of large scale image annotation system, two complimentary
evaluation metrics [218] ascertain the robustness and stability of the annotation
ED
system. To assess the robustness of the annotation system, the authors proposed
zero-rate annotation accuracy which calculates the number of keywords that
has never been predicted accurately. To assess the stability of the system, the
PT
authors proposed the coefficient of variation (CV) which measures variation in
the annotation accuracy among keywords. For a system to be stable, the value
CE
of CV should be low which indicates all the annotation accuracy are close to
each other for an image. It will help in the retrieval of similar images while the
AC
query can be any of the keyword [218].
6. The Database for Image Annotation
There are several publically available database for the training and evalua-
tion of image annotation and retrieval system. For the annotation system to be
effective, it is necessary that the system is trained with a large number of the
balanced dataset. However, efforts have been made to train the system even
if the dataset is unbalanced [219]. The creation of a balanced and manually
51
ACCEPTED MANUSCRIPT
annotated dataset is an expensive and time-consuming process. Thus, efforts
have been made to develop a semi-supervised model that can be trained using
noisy or incomplete tags [27, 101, 82]. Nevertheless, to train a model, manually
annotated images are needed one way or other. Efforts have been made by the
T
various community to develop a standard balanced database for the training and
IP
evaluation purpose. Details about some of the globally accepted and standard
CR
database are given below.
6.1. The Corel Database
US
The Corel database is created from the Corel Photo Gallery [187]. Several
research groups [27, 183, 81, 220, 221] used Corel data for evaluation of their an-
AN
notations system. Corel dataset is entirely manually annotated dataset labeled
by human experts. There are various versions of the Corel dataset as explained
below.
M
Corel5K : Corel5K contains 5000 manually annotated images. Each image is
either 192 × 128 or 128 × 192 pixels. There are total 371 unique words, and each
ED
image is annotated with 1 to 5 keywords [220]. The size of the Corel5K dataset
is tiny; also, it has a small number of vocabulary. Hence when an annotation
PT
system is evaluated on the Corel5K dataset, it is difficult to determine that the
proposed system has good generalization capability.
Corel30K : It is an extension of the Corel5k dataset having 31695 images.
CE
The images are 384 × 256 or 256 × 384 pixels. The size of the vocabulary is
also increased to 5587. Each image is labeled with 1 to 5 keywords. An average
AC
number of words per image is around 3.6 [127].
Corel60K : There has been another extension of the Corel database known
as Corel60K. It is a balanced dataset. It contains 60000 images of 600 different
categories. Each category has around 100 images that are 384 × 256 pixels or
256 × 384 pixels. There are 417 distinct keywords, and each image is tagged
with 1 to 7 keywords.
Although Corel dataset is a standard dataset, Ref [191] pointed out some of
the disadvantages of the Corel dataset. The authors implemented three auto52
ACCEPTED MANUSCRIPT
image annotation methods (CSD-prop, SvdCos and CSD-svm) [222] on Corel
dataset and compared the results with some of the states of the art methods
(translation model [223], CRM model [98], MBRM model [97], MIX-Hier model
[42]). The authors tried to show that, when using training set and testing set
T
from Corel dataset itself, it is relatively easy to perform annotation [191]. They
IP
also argued that the Corel dataset contains redundant training information and
CR
a model can be trained even only using the 25% of training information.
6.2. ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
US
Since 2010, ImageNet is organizing a competition every year for the detection, classification, localization, etc. of the objects from an image [188].
AN
ILSVRC-10 is a competition targeted for the evaluation of the effectiveness of
image annotation (multi-label classification) methods where the goal is to annotate each image with at most five labels in descending order of the confidence.
M
There are 1,000 object categories, and labels are organized hierarchically with
three levels and contain 1,000 nodes at leaf level without overlapping. The
ED
dataset comprises 120,000 training images each having at most five labels. The
dataset contains 50000 validation images and 150000 test images collected from
PT
Flickr and other sites. All the training images are manually labeled without
any segmentation. In ILSVRC-2011, the organizers added one extra task, that
is, to classify and localize the objects. The goal of this new task is to pre-
CE
dict the top five class labels and five bounding boxes, each for one class label.
Later in ILSVRC-2013, the organizer organized the competition for two tasks:
AC
the object detection (a new task) and classification and localization (same as
ILSVRC-2011). The object detection challenge is same as PASCAL VOC challenge [224] but with a more substantial number of object categories and image
dataset. The goal of object detection task is to identify the object class (200
categories) from an entirely manually annotated images with bounding boxes.
Later, object detection for video and scene classification tasks have been included in the competition (ILSVRC-2015, ILSVRC-2016, ILSVRC-2017).
53
ACCEPTED MANUSCRIPT
6.3. IAPR TC-12 Benchmark
IAPR TC-12 benchmark consists of 20000 natural images [189]. The database
contains images used during 2006 to 2008 ImageCLEF evaluation campaign.
T
The images in IAPR TC-12 dataset has multiple objects and includes complete
IP
annotations (full text as well as English, German and Random) as well as light
annotations (all annotations except for the description). There are around 291
CR
unique labels, and each image has approximately 1 to 23 labels. On an average,
there is 5.7 number of labels per image and 153 to 4999 images per labels. There
are 347 images per label on average. The dataset is publically available without
US
any copyright restriction.
There is an extended version of the IAPR TC-12 benchmark called seg-
AN
mented and annotated IAPR TC-12 (SAIAPR TC-12) [91]. SAIAPR TC-12
includes all the images of IAPR TC-12 along with the segmented mask and segmented images. SAIAPR TC-12 contains region wise extracted features along
M
with labels assigned to each region. Region-level annotations according to hierarchy and spatial relationships information are also listed in the dataset. SA-
ED
IAPR TC-12 benchmark is a publically available dataset without any copyright
PT
restriction.
6.4. ImageCLEF Photo Annotation
CE
Launched in 2003 as part of cross language evaluation forum (CLEF) for the
performance evaluation of concept detection, annotation, and retrieval method,
ImageCLEF started organizing visual concept detection and annotation chal-
AC
lenge for photo images in 2008 [225]. Although, ImageCLEF started organizing
medical image annotation and retrieval since 2005. At first, only a few numbers
of photo images were available for training a model (1800 in ImageCLEF-2008,
5000 in ImageCLEF-2009, 8000 in ImageCLEF-2010 and ImageCLEF-2011) and
all these images were manually annotated. Later since 2012, the organization
introduced a new challenge called scalable image annotation task. The idea is
to rank the annotated keywords and decide the number of keywords that can
54
ACCEPTED MANUSCRIPT
be assigned to an image. Also, the training dataset contains only textual features (URL, surrounding texts, etc.) that can be mined and used as labels.
The size of the training set has also been increased significantly. ImageCLEF
2013 photo annotation task, is a benchmark for visual concepts detection an-
T
notation and retrieval of photos [226]. The dataset contains 250,000 training
IP
images downloaded from the internet, 1,000 development set images and 2,000
CR
test set images which belong to 95 categories, and ground truth is provided only
on the development set. The dataset is designed with an intention to check the
scalability of the proposed annotation system. Hence, the list of the concept
US
is different for training, and development and test set. There are no manually
annotated images in the training set. The organization has provided textual as
AN
well as visual features with the dataset. Textual features include the complete
web in XML form, a list of the word-score pair, image URL, the rank of images
during searching of the image through a search engine. Visual features include
M
four types of SIFT features (SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT),
two kinds of GIST features (GIST, GIST2), LBP center features and two types
ED
of color features (COLOR HIST, HSVHIST) [227] and GETLF. The dataset
has been used in [15, 81].
PT
6.5. NUS-WIDE Database
The dataset is created by NUS’s Lab for media search for the annotation
CE
and retrieval from Flickr [190]. The dataset contains 269,648 images with 5,018
unique tags. The dataset also includes six types of low-level features. The
AC
dataset is divided into 161,789 training images and 107,859 testing images. For
the evaluation purpose, the ground truth for 81 concepts is also provided. The
six features include a 64-D color histogram, 144-D color correlation, 73-D edge
direction histogram, 128-D wavelet texture, 255-D block-wise color moments
and a 500-D bag of words based on SIFT descriptions. The NUS-WIDE dataset
also comes with three different version of the dataset, (i) a light version called
NUS-WIDE LITE, (ii) NUS-WIDE OBJECT dataset which contains only one
object in each image and (iii) NUS-WIDE SCENE dataset where each image
55
ACCEPTED MANUSCRIPT
has one scene. Some research groups have used NUS-WISE dataset [27, 198,
101, 82, 168, 48] to evaluate the performance of their proposed method.
6.6. ESP Game Database
T
ESP Game dataset contains images from ESP Game where images are la-
IP
beled collaboratively by users. ESP Game dataset contains images of various
CR
ranges from drawing to logos to different other images including a human. The
dataset includes 67,796 images with labels. There have been multiple instances
where a subset of this dataset has been used [1, 21, 53, 27, 183].
US
There exist various other datasets like PASCAL VOC, MIR Flickr, LabelMe,
MS COCO, etc. All these datasets can be used to check the competitiveness
AN
of proposed methods. Table 2 shows several primary statistics of the datasets
AC
CE
PT
ED
M
mentioned above.
56
CR
IP
T
US
Table 2: Some of the predominant database (Corel [187], ILSVRC [188], IAPR TC-12 [189], SAIAPR TC-12 [91], ImageCLEF [225, 226], NUS-WIDE
Number of test images
test images
15000
Number of
concepts
371
5587
417
1000
fully manually
annotated
a
Partially
annotated
4b
4
4
4
4
4
4
4
1.2m
50000
10000
1000
4
4
1800
5000
8000
8000
1000
1300
10000
10000
2000
16
53
93
99
4
4
4
4
4
4
4
4
250000
-
-
-
4
4
20000
20000
269648
67769
-
107859
-
1000
4
4
4
4
4
4
CE
ED
Number of
validation images
50000
Corel5K
Corel30K
Corel60K
ILSVRC-10
ILSVRC-11
to
ILSVRC-14
(C&L)d
ImageCLEF-2008
ImageCLEF-2009
ImageCLEF-2010
ImageCLEF-2011
ImageCLEF-2012
to
ImageCLEF-2016
(web data)e
IAPR TC-12
SAIAPR TC-12
NUS-WIDE
ESP Game
AC
57
Number of
training images
5000
31695
60000
1.2mc
PT
Dataset
M
AN
[190], ESP Game [1]) used for the training and evaluation of image annotation method.
a =
Yes.
No.
c 1.2m=1.2 millions.
d (C&L)= Classification and Localization.
e (web data)= Collection of automatically obtained web images and its associated metadata.
b 4=
Unannotated
ACCEPTED MANUSCRIPT
Discussion: It is vital that the system is evaluated on the well balanced
unbiased dataset to check the competitiveness of any proposed method. The
demerits of the Corel dataset is pointed out in [191]. Most of the available
datasets are manually labeled by an expert or by users. Manually annotated
T
images are subjective [205], i.e., set of labels assigned by a person to an image
IP
may vary from person to person. As the number of images is increasing, the
CR
need of the hours is to develop a scalable system using weakly supervised and
unsupervised learning. The scalable annotation system has the capabilities to
easily change or scale the number of keywords used for image annotation. For
US
the development of a scalable system, very few datasets exist (ImageCLEF 2012
to 2016 Photo Annotation and NUS-WIDE). These dataset doesn’t contain
AN
drawing, logos or some complex hidden objects labels. We think, if we have
more dataset with the intention of the scalable system and a variety of images,
M
it would be suitable for the development of a full fledge annotation system.
ED
7. Conclusion
This article presented a comprehensive study of the swift methods and
emerging directions evolved in the exciting field of the image annotation. The
PT
study started with the early years of the image annotation and presented that
the semantic gap was the absolute challenging problem faced in about all ap-
CE
proaches in the first decade. However, approaches proposed in the second decade
almost resolved the semantic gap problem, but the hindrance of supervised
AC
learning based AIA, large-scale image annotation, etc. are unfolded in this
decade.
Although this article attained every aspect of the image annotation quoted
in the last three decades, much attention is paid towards the progress made
in the previous decade with the current trends in AIA. The article discussed
the new approaches to do away with the need of fully labeled training data,
explored current trends of unsupervised learning based approaches for AIA and
conclude that the field of unsupervised learning based AIA is contemporary and
58
ACCEPTED MANUSCRIPT
very impressive and needs to be explored meticulously.
This article also presented the details about some of the predominant database
used for the training and evaluation of the annotation method. Most of the
available databases are intended for the training and evaluation of supervised
T
learning based image annotation methods, and only a handful of the database
IP
exists for the training and evaluation of semi-supervised and unsupervised meth-
CR
ods. Apart from that, we have also described the performance evaluation measures. The evaluation measures for multi-label annotation method are much
more different and complicated than that of single label annotation system. A
US
performance evaluation measures for multi-label annotation and ranking based
annotation methods have also been described in this paper.
AN
The section/subsection of the article is concluded with a discussion detailing
the prominent challenges faced and conjectured specific future directions. We
believe that discourse will help the reader gain some intuitions in the exploring
M
the applicable field in the foreseeable future.
ED
References
[1] J. H. Su, C. L. Chou, C. Y. Lin, V. S. Tseng, Effective semantic an-
PT
notation by image-to-concept distribution model, IEEE Transactions on
CE
Multimedia 13 (3) (2011) 530–538.
[2] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Contentbased image retrieval at the end of the early years, IEEE Transactions
AC
on Pattern Analysis and Machine Intelligence 22 (12) (2000) 1349–1380.
doi:10.1109/34.895972.
[3] D. Zhang, M. M. Islam, G. Lu, A review on automatic image annotation
techniques, Pattern Recognition 45 (1) (2012) 346 – 362. doi:https:
//doi.org/10.1016/j.patcog.2011.05.013.
[4] R. Datta, D. Joshi, J. Li, J. Z. Wang, Image retrieval: Ideas, influences,
59
ACCEPTED MANUSCRIPT
and trends of the new age, ACM Comput. Surv. 40 (2) (2008) 5:1–5:60.
doi:10.1145/1348246.1348248.
[5] M. Ivasic-Kos, M. Pobar, S. Ribaric, Two-tier image annotation model
T
based on a multi-label classifier and fuzzy-knowledge representation
IP
scheme, Pattern Recognition 52 (Supplement C) (2016) 287 – 305.
CR
[6] Z. Ma, Y. Yang, F. Nie, J. Uijlings, N. Sebe, Exploiting the entire feature
space with sparsity for automatic image annotation, in: Proceedings of the
19th ACM International Conference on Multimedia, MM ’11, ACM, New
US
York, NY, USA, 2011, pp. 283–292. doi:10.1145/2072298.2072336.
[7] R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, X. Wu, Image annotation
AN
by multiple-instance learning with discriminative feature mapping and
selection, IEEE Transactions on Cybernetics 44 (5) (2014) 669–680.
M
[8] A. Makadia, V. Pavlovic, S. Kumar, Baselines for image annotation, International Journal of Computer Vision 90 (1) (2010) 88–105.
ED
[9] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation,
in: D. Forsyth, P. Torr, A. Zisserman (Eds.), Computer Vision – ECCV
PT
2008, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 316–329.
[10] S. Xia, P. Chen, J. Zhang, X. Li, B. Wang, Utilization of rotation-invariant
CE
uniform {LBP} histogram distribution and statistics of connected regions
in automatic image annotation based on multi-label learning, Neurocom-
AC
puting 228 (2017) 11 – 18, advanced Intelligent Computing: Theory and
Applications. doi:https://doi.org/10.1016/j.neucom.2016.09.087.
[11] Y. Verma, C. V. Jawahar, Image annotation by propagating labels from semantic neighbourhoods, International Journal of Computer Vision 121 (1)
(2017) 126–148. doi:10.1007/s11263-016-0927-0.
[12] Y. Wang, T. Mei, S. Gong, X.-S. Hua, Combining global, regional and
contextual features for automatic image annotation, Pattern Recognition
60
ACCEPTED MANUSCRIPT
42 (2) (2009) 259 – 266, learning Semantics from Multimedia Content.
doi:https://doi.org/10.1016/j.patcog.2008.05.010.
[13] K. Li, C. Zou, S. Bu, Y. Liang, J. Zhang, M. Gong, Multi-modal feature
T
fusion for geographic image annotation, Pattern Recognition 73 (Supple-
IP
ment C) (2018) 1 – 14. doi:https://doi.org/10.1016/j.patcog.2017.
CR
06.036.
[14] J. Fan, Y. Gao, H. Luo, Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image anno-
US
tation, IEEE Transactions on Image Processing 17 (3) (2008) 407–426.
[15] L. Pellegrin, H. J. Escalante, M. Montes-y Gómez, F. A. González, Local
AN
and global approaches for unsupervised image annotation, Multimedia
Tools and Applications 76 (15) (2017) 16389–16414.
M
[16] L. Cao, J. Luo, H. Kautz, T. S. Huang, Image annotation within the
context of personal photo collections using hierarchical event and scene
ED
models, IEEE Transactions on Multimedia 11 (2) (2009) 208–219.
[17] X. Rui, M. Li, Z. Li, W.-Y. Ma, N. Yu, Bipartite graph reinforcement
PT
model for web image annotation, in: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, ACM, New York, NY, USA,
CE
2007, pp. 585–594. doi:10.1145/1291233.1291378.
[18] A. Ulges, M. Worring, T. Breuel, Learning visual contexts for image anno-
AC
tation from flickr groups, IEEE Transactions on Multimedia 13 (2) (2011)
330–341.
[19] R. Jin, J. Y. Chai, L. Si, Effective automatic image annotation via a
coherent language model and active learning, in: Proceedings of the
12th Annual ACM International Conference on Multimedia, MULTIMEDIA ’04, ACM, New York, NY, USA, 2004, pp. 892–899.
10.1145/1027527.1027732.
61
doi:
ACCEPTED MANUSCRIPT
[20] J. Tang, Z. J. Zha, D. Tao, T. S. Chua, Semantic-gap-oriented active
learning for multilabel image annotation, IEEE Transactions on Image
Processing 21 (4) (2012) 2354–2360.
T
[21] B. Wu, S. Lyu, B.-G. Hu, Q.Ji, Multi-label learning with missing labels for
IP
image annotation and facial action unit recognition, Pattern Recognition
48 (7) (2015) 2279 – 2289.
CR
[22] Z. Shi, Y. Yang, T. M. Hospedales, T. Xiang, Weakly-supervised image
annotation and segmentation with objects and attributes, IEEE Transac-
US
tions on Pattern Analysis and Machine Intelligence 39 (12) (2017) 2525–
2538.
AN
[23] M. Hu, Y. Yang, F. Shen, L. Zhang, H. T. Shen, X. Li, Robust web image annotation via exploring multi-facet and structural knowledge, IEEE
M
Transactions on Image Processing 26 (10) (2017) 4871–4884.
[24] L. Wu, R. Jin, A. K. Jain, Tag completion for image retrieval, IEEE
716–727.
ED
Transactions on Pattern Analysis and Machine Intelligence 35 (3) (2013)
PT
[25] J. Wang, G. Li, A multi-modal hashing learning framework for automatic
image annotation, in: 2017 IEEE Second International Conference on
CE
Data Science in Cyberspace (DSC), 2017, pp. 14–21.
[26] T. Uricchio, L. Ballan, L. Seidenari, A. D. Bimbo, Automatic image annotation via label transfer in the semantic space, Pattern Recognition
AC
71 (Supplement C) (2017) 144 – 157. doi:https://doi.org/10.1016/
j.patcog.2017.05.019.
[27] X. Li, B. Shen, B. D. Liu, Y. J. Zhang, Ranking-preserving low-rank
factorization for image annotation with missing labels, IEEE Transactions
on Multimedia PP (99) (2017) 1–1.
[28] F. Monay, D. Gatica-Perez, Plsa-based image auto-annotation: Constraining the latent space, in: Proceedings of the 12th Annual ACM Interna62
ACCEPTED MANUSCRIPT
tional Conference on Multimedia, MULTIMEDIA ’04, ACM, New York,
NY, USA, 2004, pp. 348–351. doi:10.1145/1027527.1027608.
[29] A. Hanbury, A survey of methods for image annotation, Journal of Visual
T
Languages & Computing 19 (5) (2008) 617 – 627. doi:https://doi.org/
IP
10.1016/j.jvlc.2008.01.002.
[30] Y. Liu, D. Zhang, G. Lu, W.-Y. Ma, A survey of content-based image
CR
retrieval with high-level semantics, Pattern Recognition 40 (1) (2007) 262
– 282. doi:https://doi.org/10.1016/j.patcog.2006.04.045.
US
[31] Q. Cheng, Q. Zhang, P. Fu, C. Tu, S. Li, A survey and analysis on
automatic image annotation, Pattern Recognition 79 (2018) 242 – 259.
AN
doi:https://doi.org/10.1016/j.patcog.2018.02.017.
[32] X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M. Snoek, A. D. Bimbo,
M
Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval, ACM Comput. Surv. 49 (1) (2016)
ED
14:1–14:39. doi:10.1145/2906152.
[33] C. Wang, L. Zhang, H.-J. Zhang, Learning to reduce the semantic gap in
PT
web image retrieval and annotation, in: Proceedings of the 31st Annual
International ACM SIGIR Conference on Research and Development in
Information Retrieval, SIGIR ’08, ACM, New York, NY, USA, 2008, pp.
CE
355–362. doi:10.1145/1390334.1390396.
[34] X.-J. Wang, L. Zhang, F. Jing, W.-Y. Ma, Annosearch: Image auto-
AC
annotation by search, in: 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, 2006, pp.
1483–1490. doi:10.1109/CVPR.2006.58.
[35] C. Cusano, G. Ciocca, R. Schettini, Image annotation using svm,
Proc.SPIE 5304 (2003) 5304 – 5304 – 9. doi:10.1117/12.526746.
[36] C. Yang, M. Dong, J. Hua, Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning, in:
63
ACCEPTED MANUSCRIPT
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, 2006, pp. 2057–2063.
[37] C. Wang, S. Yan, L. Zhang, H. J. Zhang, Multi-label sparse coding for
IP
Vision and Pattern Recognition, 2009, pp. 1643–1650.
T
automatic image annotation, in: 2009 IEEE Conference on Computer
[38] X. Qi, Y. Han, Incorporating multiple svms for automatic image an-
CR
notation, Pattern Recognition 40 (2) (2007) 728 – 741.
doi:https:
//doi.org/10.1016/j.patcog.2006.04.042.
US
[39] J. Johnson, L. Ballan, L. Fei-Fei, Love thy neighbors: Image annotation
by exploiting image metadata, in: 2015 IEEE International Conference on
AN
Computer Vision (ICCV), 2015, pp. 4624–4632.
[40] M. B. Mayhew, B. Chen, K. S. Ni, Assessing semantic information in
M
convolutional neural network representations of images via image annotation, in: 2016 IEEE International Conference on Image Processing (ICIP),
ED
2016, pp. 2266–2270.
[41] A. Fakhari, A. M. E. Moghadam, Combination of classification and re-
PT
gression in decision tree for multi-labeling image annotation and retrieval, Applied Soft Computing 13 (2) (2013) 1292 – 1302. doi:https:
CE
//doi.org/10.1016/j.asoc.2012.10.019.
[42] G. Carneiro, N. Vasconcelos, Formulating semantic image annotation as a
supervised learning problem, in: 2005 IEEE Computer Society Conference
AC
on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, 2005,
pp. 163–168 vol. 2.
[43] P. Niyogi, Manifold regularization and semi-supervised learning: Some
theoretical analyses, J. Mach. Learn. Res. 14 (1) (2013) 1229–1250.
[44] L. Pellegrin, H. J. Escalante, M. Montes-y Gómez, Evaluating termexpansion for unsupervised image annotation, in: A. Gelbukh, F. C. Espinoza, S. N. Galicia-Haro (Eds.), Human-Inspired Computing and Its
64
ACCEPTED MANUSCRIPT
Applications, Springer International Publishing, Cham, 2014, pp. 151–
162.
[45] L. Pellegrin, J. A. Vanegas, J. E. A. Ovalle, V. Beltrán, H. J. Escalante,
T
M. M. y Gómez, F. A. González, Inaoe-unal at imageclef 2015: Scalable
IP
concept image annotation, in: CLEF, 2015.
[46] T. Uricchio, M. Bertini, L. Ballan, A. D. Bimbo, Micc-unifi at imageclef
shop, Valencia, Spain, 2013, (Benchmark).
CR
2013 scalable concept image annotation, in: Proc. of ImageCLEF Work-
US
[47] T. Tommasi, F. Orabona, B. Caputo, Discriminative cue integration for
medical image annotation, Pattern Recognition Letters 29 (15) (2008)
doi:https://doi.org/10.1016/j.
AN
1996 – 2002, image CLEF 2007.
patrec.2008.03.009.
M
[48] A. Kumar, S. Dyer, J. Kim, C. Li, P. H. Leong, M. Fulham, D. Feng,
Adapting content-based image retrieval techniques for the semantic an-
ED
notation of medical images, Computerized Medical Imaging and Graphics
49 (Supplement C) (2016) 37 – 45. doi:https://doi.org/10.1016/j.
PT
compmedimag.2016.01.001.
[49] G. Zhang, C.-H. R. Hsu, H. Lai, X. Zheng, Deep learning based feature
representation for automated skin histopathological image annotation,
CE
Multimedia Tools and Applicationsdoi:10.1007/s11042-017-4788-5.
[50] A. Mueen, R. Zainuddin, M. S. Baba, Automatic multilevel medical image
AC
annotation and retrieval, Journal of Digital Imaging 21 (2007) 290–295.
[51] D. Bratasanu, I. Nedelcu, M. Datcu, Bridging the semantic gap for satellite
image annotation and automatic mapping applications, IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing 4 (1)
(2011) 193–204.
[52] J. Fan, Y. Gao, H. Luo, G. Xu, Automatic image annotation by using
concept-sensitive salient objects for image content representation, in: Pro65
ACCEPTED MANUSCRIPT
ceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, ACM, New
York, NY, USA, 2004, pp. 361–368. doi:10.1145/1008992.1009055.
T
[53] X. Y. Jing, F. Wu, Z. Li, R. Hu, D. Zhang, Multi-label dictionary learn-
IP
ing for image annotation, IEEE Transactions on Image Processing 25 (6)
CR
(2016) 2712–2725.
[54] S. Lindstaedt, R. Mörzinger, R. Sorschag, V. Pammer, G. Thallinger, Automatic image annotation using visual content and folksonomies, Mul-
US
timedia Tools and Applications 42 (1) (2009) 97–113. doi:10.1007/
s11042-008-0247-7.
AN
[55] P. Brodatz, Textures: A Photographic Album for Artists and Designers,
Peter Smith Publisher, Incorporated, 1981.
M
[56] G. Leboucher, G. Lowitz, What a histogram can really tell the classifier,
Pattern Recognition 10 (5) (1978) 351 – 357. doi:https://doi.org/10.
ED
1016/0031-3203(78)90006-7.
[57] R. M. Haralick, K. Shanmugam, I. Dinstein, Textural features for im-
PT
age classification, IEEE Transactions on Systems, Man, and Cybernetics
SMC-3 (6) (1973) 610–621. doi:10.1109/TSMC.1973.4309314.
CE
[58] V. Kovalev, M. Petrou, Multidimensional co-occurrence matrices for object recognition and matching, Graphical Models and Image Processing
AC
58 (3) (1996) 187 – 197. doi:https://doi.org/10.1006/gmip.1996.
0016.
[59] K. Valkealahti, E. Oja, Reduced multidimensional co-occurrence histograms in texture classification, IEEE Transactions on Pattern Analysis
and Machine Intelligence 20 (1) (1998) 90–94. doi:10.1109/34.655653.
[60] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and
rotation invariant texture classification with local binary patterns, IEEE
66
ACCEPTED MANUSCRIPT
Transactions on Pattern Analysis and Machine Intelligence 24 (7) (2002)
971–987. doi:10.1109/TPAMI.2002.1017623.
[61] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition
T
under difficult lighting conditions, IEEE Transactions on Image Processing
IP
19 (6) (2010) 1635–1650.
CR
[62] T. Jabid, M. H. Kabir, O. Chae, Local directional pattern (ldp) for face
recognition, in: 2010 Digest of Technical Papers International Conference
on Consumer Electronics (ICCE), 2010, pp. 329–330. doi:10.1109/ICCE.
US
2010.5418801.
[63] M. H. Kabir, T. Jabid, O. Chae, A local directional pattern variance
AN
(ldpv) based face descriptor for human facial expression recognition, in:
2010 7th IEEE International Conference on Advanced Video and Signal
M
Based Surveillance, 2010, pp. 526–532. doi:10.1109/AVSS.2010.9.
[64] I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Ap-
ED
plied Mathematics, Philadelphia, PA, USA, 1992.
[65] G. E. Lowitz, Can a local histogram really map texture information?,
PT
Pattern Recognition 16 (2) (1983) 141 – 147. doi:https://doi.org/10.
1016/0031-3203(83)90017-1.
CE
[66] B. Julesz, Experiments in the visual perception of texture, Scientific American 232 (4) (1975) 34–43.
AC
[67] M. J. Swain, D. H. Ballard, Color indexing, International Journal of Computer Vision 7 (1) (1991) 11–32. doi:10.1007/BF00130487.
[68] A. Bahrololoum, H. Nezamabadi-pour, A multi-expert based framework
for automatic image annotation, Pattern Recognition 61 (Supplement C)
(2017) 169 – 184. doi:https://doi.org/10.1016/j.patcog.2016.07.
034.
67
ACCEPTED MANUSCRIPT
[69] S. R. Kodituwakku, S. Selvarajah, Comparison of color features for image
retrieval, Indian Journal of Computer Science and Engineering 1 (3) (2010)
207–2011.
T
[70] A. K. Jain, A. Vailaya, Image retrieval using color and shape, Pattern
IP
Recognition 29 (8) (1996) 1233 – 1244. doi:https://doi.org/10.1016/
CR
0031-3203(95)00160-3.
[71] T. Deselaers, D. Keysers, H. Ney, Features for image retrieval: an
doi:10.1007/s10791-007-9039-3.
US
experimental comparison, Information Retrieval 11 (2) (2008) 77–107.
[72] S. Zheng, M. M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother,
AN
P. H. S. Torr, Dense semantic image segmentation with objects and attributes, in: 2014 IEEE Conference on Computer Vision and Pattern
M
Recognition, 2014, pp. 3214–3221. doi:10.1109/CVPR.2014.411.
[73] G. Singh, J. Kosecka, Nonparametric scene parsing with adaptive fea-
ED
ture relevance and semantic context, in: 2013 IEEE Conference on
Computer Vision and Pattern Recognition, 2013, pp. 3151–3157. doi:
PT
10.1109/CVPR.2013.405.
[74] A. Vezhnevets, V. Ferrari, J. M. Buhmann, Weakly supervised structured
CE
output learning for semantic segmentation, in: 2012 IEEE Conference
on Computer Vision and Pattern Recognition, 2012, pp. 845–852. doi:
AC
10.1109/CVPR.2012.6247757.
[75] M. Rubinstein, C. Liu, W. T. Freeman, Annotation propagation in large
image databases via dense image correspondence, in: A. Fitzgibbon,
S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision
– ECCV 2012, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp.
85–99.
[76] H. Zhang, J. E. Fritts, S. A. Goldman, Image segmentation evaluation:
A survey of unsupervised methods, Computer Vision and Image Under68
ACCEPTED MANUSCRIPT
standing 110 (2) (2008) 260 – 280. doi:https://doi.org/10.1016/j.
cviu.2007.08.003.
[77] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for seman-
T
tic segmentation, IEEE Transactions on Pattern Analysis and Machine
IP
Intelligence 39 (4) (2017) 640–651. doi:10.1109/TPAMI.2016.2572683.
CR
[78] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille,
Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs, IEEE Transactions on Pat-
US
tern Analysis and Machine Intelligence PP (99) (2017) 1–1. doi:10.1109/
TPAMI.2017.2699184.
AN
[79] T. Li, B. Cheng, B. Ni, G. Liu, S. Yan, Multitask low-rank affinity graph
for image segmentation and image annotation, ACM Trans. Intell. Syst.
M
Technol. 7 (4) (2016) 65:1–65:18. doi:10.1145/2856058.
[80] J. Fan, Y. Gao, H. Luo, Multi-level annotation of natural scenes us-
ED
ing dominant image components and semantic concepts, in: Proceedings of the 12th Annual ACM International Conference on Multime-
PT
dia, MULTIMEDIA ’04, ACM, New York, NY, USA, 2004, pp. 540–547.
doi:10.1145/1027527.1027660.
CE
[81] M. Jiu, H. Sahbi, Nonlinear deep kernel learning for image annotation,
IEEE Transactions on Image Processing 26 (4) (2017) 1820–1832.
AC
[82] L. Tao, H. H. Ip, A. Zhang, X. Shu, Exploring canonical correlation analysis with subspace and structured sparsity for web image annotation, Image
and Vision Computing 54 (Supplement C) (2016) 22 – 30.
[83] N. Sebe, Q. Tian, E. Loupias, M. Lew, T. Huang, Evaluation of salient
point techniques, Image and Vision Computing 21 (13) (2003) 1087 –
1095, british Machine Vision Computing 2001. doi:https://doi.org/
10.1016/j.imavis.2003.08.012.
69
ACCEPTED MANUSCRIPT
[84] K. S. Pedersen, M. Loog, P. Dorst, Salient point and scale detection by
minimum likelihood, in: N. D. Lawrence, A. Schwaighofer, J. Q. Candela
(Eds.), Gaussian Processes in Practice, Vol. 1 of Proceedings of Machine
T
Learning Research, PMLR, Bletchley Park, UK, 2007, pp. 59–72.
Int. J. Comput. Vision 60 (2) (2004) 91–110.
doi:10.1023/B:VISI.
CR
0000029664.99615.94.
IP
[85] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
[86] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,
US
in: 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), Vol. 1, 2005, pp. 886–893 vol. 1. doi:
AN
10.1109/CVPR.2005.177.
[87] H. Bay, T. Tuytelaars, L. V. Gool, Surf: Speeded up robust features, in:
M
A. Leonardis, H. Bischof, A. Pinz (Eds.), Computer Vision – ECCV 2006,
Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 404–417.
ED
[88] E. Rosten, T. Drummond, Machine learning for high-speed corner detection, in: A. Leonardis, H. Bischof, A. Pinz (Eds.), Computer Vision
PT
– ECCV 2006, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp.
430–443.
CE
[89] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: Binary robust independent elementary features, in: K. Daniilidis, P. Maragos, N. Paragios (Eds.), Computer Vision – ECCV 2010, Springer Berlin Heidelberg,
AC
Berlin, Heidelberg, 2010, pp. 778–792.
[90] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient alternative to sift or surf, in: 2011 International Conference on Computer Vision,
2011, pp. 2564–2571. doi:10.1109/ICCV.2011.6126544.
[91] H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López,
M. Montes, E. F. Morales, L. E. Sucar, L. Villasenor, M. Grubinger,
The segmented and annotated iapr tc-12 benchmark, Computer Vision
70
ACCEPTED MANUSCRIPT
and Image Understanding 114 (4) (2010) 419 – 428, special issue on Image and Video Retrieval Evaluation. doi:https://doi.org/10.1016/j.
cviu.2009.03.008.
T
[92] X. Ding, B. Li, W. Xiong, W. Guo, W. Hu, B. Wang, Multi-instance multi-
IP
label learning combining hierarchical context and its application to image
CR
annotation, IEEE Transactions on Multimedia 18 (8) (2016) 1616–1627.
[93] D. M. Blei, M. I. Jordan, Modeling annotated data, in: Proceedings of
the 26th Annual International ACM SIGIR Conference on Research and
US
Development in Informaion Retrieval, SIGIR ’03, ACM, New York, NY,
USA, 2003, pp. 127–134. doi:10.1145/860435.860460.
AN
[94] D. Putthividhy, H. T. Attias, S. S. Nagarajan, Topic regression multimodal latent dirichlet allocation for image annotation, in: 2010 IEEE
M
Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3408–3415.
ED
[95] L. Song, M. Luo, J. Liu, L. Zhang, B. Qian, M. H. Li, Q. Zheng, Sparse
multi-modal topical coding for image annotation, Neurocomputing 214
005.
PT
(2016) 162 – 174. doi:https://doi.org/10.1016/j.neucom.2016.06.
CE
[96] R. Zhang, Z. Zhang, M. Li, W.-Y. Ma, H.-J. Zhang, A probabilistic semantic model for image annotation and multimodal image retrieval, in: Tenth
IEEE International Conference on Computer Vision (ICCV’05) Volume 1,
AC
Vol. 1, 2005, pp. 846–851 Vol. 1.
[97] S. L. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance
models for image and video annotation, in: Proceedings of the 2004
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, 2004. CVPR 2004., Vol. 2, 2004, pp. II–1002–II–1009 Vol.2.
doi:10.1109/CVPR.2004.1315274.
71
ACCEPTED MANUSCRIPT
[98] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and
retrieval using cross-media relevance models, in: Proceedings of the 26th
Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, ACM, New York, NY, USA,
IP
T
2003, pp. 119–126. doi:10.1145/860435.860459.
[99] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, S. Ma, Dual cross-media
CR
relevance model for image annotation, in: Proceedings of the 15th ACM
International Conference on Multimedia, MM ’07, ACM, New York, NY,
US
USA, 2007, pp. 605–614. doi:10.1145/1291233.1291380.
[100] D. Tao, L. Jin, W. Liu, X. Li, Hessian regularized support vector ma-
AN
chines for mobile image annotation on the cloud, IEEE Transactions on
Multimedia 15 (4) (2013) 833–844.
M
[101] X. Xu, A. Shimada, H. Nagahara, R. i. Taniguchi, L. He, Image annotation with incomplete labelling by modelling image specific structured loss,
ED
IEEJ Transactions on Electrical and Electronic Engineering 11 (1) (2016)
73–82.
PT
[102] J. J. McAuley, A. Ramisa, T. S. Caetano, Optimization of robust loss functions for weakly-labeled image taxonomies, International Journal of Com-
CE
puter Vision 104 (3) (2013) 343–361. doi:10.1007/s11263-012-0561-4.
[103] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric
framework for learning from labeled and unlabeled examples, J. Mach.
AC
Learn. Res. 7 (2006) 2399–2434.
[104] W. Liu, H. Liu, D. Tao, Y. Wang, K. Lu, Manifold regularized kernel
logistic regression for web image annotation, Neurocomputing 172 (Supplement C) (2016) 3 – 8. doi:https://doi.org/10.1016/j.neucom.
2014.06.096.
[105] M.-L. Zhang, Z.-H. Zhou, Ml-knn: A lazy learning approach to multi-label
72
ACCEPTED MANUSCRIPT
learning, Pattern Recognition 40 (7) (2007) 2038 – 2048. doi:https:
//doi.org/10.1016/j.patcog.2006.12.019.
[106] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: Discrimina-
T
tive metric learning in nearest neighbor models for image auto-annotation,
CR
pp. 309–316. doi:10.1109/ICCV.2009.5459266.
IP
in: 2009 IEEE 12th International Conference on Computer Vision, 2009,
[107] Y. Verma, C. V. Jawahar, Image annotation using metric learning in
semantic neighbourhoods, in: A. Fitzgibbon, S. Lazebnik, P. Perona,
US
Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012, Springer
Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 836–849.
AN
[108] Z. Feng, R. Jin, A. Jain, Large-scale image annotation by efficient and
robust kernel metric learning, in: 2013 IEEE International Conference on
M
Computer Vision, 2013, pp. 1609–1616.
[109] D. R. Hardoon, J. Shawe-taylor, Kcca for different level precision in
ED
content-based image retrieval, in: In Submitted to Third International
Workshop on Content-Based Multimedia Indexing, IRISA, 2003.
PT
[110] L. Feng, B. Bhanu, Semantic concept co-occurrence patterns for image
annotation and retrieval, IEEE Transactions on Pattern Analysis and Ma-
CE
chine Intelligence 38 (4) (2016) 785–799.
[111] X. Zhu, W. Nejdl, M. Georgescu, An adaptive teleportation random walk
AC
model for learning social tag relevance, in: Proceedings of the 37th International ACM SIGIR Conference on Research & Development in
Information Retrieval, SIGIR ’14, ACM, New York, NY, USA, 2014, pp.
223–232. doi:10.1145/2600428.2609556.
[112] C. Wang, F. Jing, L. Zhang, H.-J. Zhang, Image annotation refinement
using random walk with restarts, in: Proceedings of the 14th ACM International Conference on Multimedia, MM ’06, ACM, New York, NY, USA,
2006, pp. 647–650. doi:10.1145/1180639.1180774.
73
ACCEPTED MANUSCRIPT
[113] C. Lei, D. Liu, W. Li, Social diffusion analysis with common-interest
model for image annotation, IEEE Transactions on Multimedia 18 (4)
(2016) 687–701.
T
[114] J. Liu, M. Li, Q. Liu, H. Lu, S. Ma, Image annotation via graph learning,
IP
Pattern Recognition 42 (2) (2009) 218 – 228, learning Semantics from
Multimedia Content. doi:https://doi.org/10.1016/j.patcog.2008.
CR
04.012.
[115] J. Tang, R. Hong, S. Yan, T.-S. Chua, G.-J. Qi, R. Jain, Image annota-
US
tion by knn-sparse graph-based label propagation over noisily tagged web
images, ACM Trans. Intell. Syst. Technol. 2 (2) (2011) 14:1–14:15.
AN
[116] G. Chen, J. Zhang, F. Wang, C. Zhang, Y. Gao, Efficient multi-label
classification with hypergraph regularization, in: 2009 IEEE Conference
M
on Computer Vision and Pattern Recognition, 2009, pp. 1658–1665. doi:
10.1109/CVPR.2009.5206813.
ED
[117] Y. Han, F. Wu, Q. Tian, Y. Zhuang, Image annotation by input-output
structural grouping sparsity, IEEE Transactions on Image Processing
PT
21 (6) (2012) 3066–3079.
[118] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, D. M. Blei, Reading
CE
tea leaves: How humans interpret topic models, in: Proceedings of the
22Nd International Conference on Neural Information Processing Systems,
AC
NIPS’09, Curran Associates Inc., USA, 2009, pp. 288–296.
[119] A. P. Dempster, M. N. Laird, D. B. Rubin, Maximum likelihood from
incomplete data via the EM algorithm, Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 39 (1977) 1–22.
[120] D. Kong, C. Ding, H. Huang, H. Zhao, Multi-label relieff and f-statistic
feature selections for image annotation, in: 2012 IEEE Conference on
Computer Vision and Pattern Recognition, 2012, pp. 2352–2359.
74
ACCEPTED MANUSCRIPT
[121] X. Jia, F. Sun, H. Li, Y. Cao, X. Zhang, Image multi-label annotation
based on supervised nonnegative matrix factorization with new matching
measurement, Neurocomputing 219 (Supplement C) (2017) 518 – 525.
T
doi:https://doi.org/10.1016/j.neucom.2016.09.052.
IP
[122] K. S. Goh, E. Y. Chang, B. Li, Using one-class and two-class svms for
Engineering 17 (10) (2005) 1333–1346.
CR
multiclass image annotation, IEEE Transactions on Knowledge and Data
[123] D. Grangier, S. Bengio, A discriminative kernel-based approach to rank
US
images from text queries, IEEE Transactions on Pattern Analysis and
Machine Intelligence 30 (8) (2008) 1371–1384. doi:10.1109/TPAMI.2007.
AN
70791.
[124] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, T. Huang, Large-
M
scale image classification: Fast feature extraction and svm training, in:
CVPR 2011, 2011, pp. 1689–1696. doi:10.1109/CVPR.2011.5995477.
ED
[125] K. Kuroda, M. Hagiwara, An image retrieval system by impression words
and specific object namesiris, Neurocomputing 43 (1) (2002) 259 – 276,
PT
selected engineering applications of neural networks. doi:https://doi.
org/10.1016/S0925-2312(01)00344-7.
CE
[126] R. C. F. Wong, C. H. C. Leung, Automatic semantic annotation of realworld web images, IEEE Transactions on Pattern Analysis and Machine
AC
Intelligence 30 (11) (2008) 1933–1944. doi:10.1109/TPAMI.2008.125.
[127] G. Carneiro, A. B. Chan, P. J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 394–410.
[128] S. Chengjian, S. Zhu, Z. Shi, Image annotation via deep neural network, in:
2015 14th IAPR International Conference on Machine Vision Applications
(MVA), 2015, pp. 518–521. doi:10.1109/MVA.2015.7153244.
75
ACCEPTED MANUSCRIPT
[129] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with
deep convolutional neural networks, in: F. Pereira, C. J. C. Burges,
L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information
Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105.
T
URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-n
IP
pdf
CR
[130] W. Liu, D. Tao, Multiview hessian regularization for image annotation,
IEEE Transactions on Image Processing 22 (7) (2013) 2676–2687.
US
[131] O. Boiman, E. Shechtman, M. Irani, In defense of nearest-neighbor based
image classification, in: 2008 IEEE Conference on Computer Vision and
AN
Pattern Recognition, 2008, pp. 1–8. doi:10.1109/CVPR.2008.4587598.
[132] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin
M
nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244.
[133] J. Verbeek, M. Guillaumin, T. Mensink, C. Schmid, Image annotation
ED
with tagprop on the mirflickr set, in: Proceedings of the International
Conference on Multimedia Information Retrieval, MIR ’10, ACM, New
PT
York, NY, USA, 2010, pp. 537–546. doi:10.1145/1743384.1743476.
[134] X. Xu, A. Shimada, R.-i. Taniguchi, Image annotation by learning label-
CE
specific distance metrics, in: A. Petrosino (Ed.), Image Analysis and
Processing – ICIAP 2013, Springer Berlin Heidelberg, Berlin, Heidelberg,
AC
2013, pp. 101–110.
[135] M. M. Kalayeh, H. Idrees, M. Shah, Nmf-knn: Image annotation using
weighted multi-view non-negative matrix factorization, in: 2014 IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pp. 184–
191.
[136] Z. Lin, G. Ding, M. Hu, Image auto-annotation via tag-dependent random
search over range-constrained visual neighbours, Multimedia Tools Appl.
74 (11) (2015) 4091–4116. doi:10.1007/s11042-013-1811-3.
76
ACCEPTED MANUSCRIPT
[137] L. Ballan, T. Uricchio, L. Seidenari, A. D. Bimbo, A cross-media model for
automatic image annotation, in: Proceedings of International Conference
on Multimedia Retrieval, ICMR ’14, ACM, New York, NY, USA, 2014,
T
pp. 73:73–73:80. doi:10.1145/2578726.2578728.
IP
[138] T. Mensink, J. Verbeek, F. Perronnin, G. Csurka, Metric learning for large
scale image classification: Generalizing to new classes at near-zero cost,
CR
in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.),
Computer Vision – ECCV 2012, Springer Berlin Heidelberg, Berlin, Hei-
US
delberg, 2012, pp. 488–501.
[139] H. Wang, H. Huang, C. Ding, Image annotation using bi-relational graph
AN
of images and semantic labels, in: CVPR 2011, 2011, pp. 793–800.
[140] X. Chen, Y. Mu, S. Yan, T.-S. Chua, Efficient large-scale image annotation
M
by probabilistic collaborative multi-label propagation, in: Proceedings of
the 18th ACM International Conference on Multimedia, MM ’10, ACM,
ED
New York, NY, USA, 2010, pp. 35–44. doi:10.1145/1873951.1873959.
[141] F. Su, L. Xue, Graph learning on k nearest neighbours for automatic image
PT
annotation, in: Proceedings of the 5th ACM on International Conference
on Multimedia Retrieval, ICMR ’15, ACM, New York, NY, USA, 2015,
CE
pp. 403–410.
[142] P. Ji, X. Gao, X. Hu, Automatic image annotation by combining generative and discriminant models, Neurocomputing 236 (Supplement C)
AC
(2017) 48 – 55, good Practices in Multimedia Modeling. doi:https:
//doi.org/10.1016/j.neucom.2016.09.108.
[143] B. Xie, Y. Mu, D. Tao, K. Huang, m-sne: Multiview stochastic neighbor
embedding, IEEE Transactions on Systems, Man, and Cybernetics, Part
B (Cybernetics) 41 (4) (2011) 1088–1096. doi:10.1109/TSMCB.2011.
2106208.
77
ACCEPTED MANUSCRIPT
[144] C. Xu, D. Tao, C. Xu, Multi-view intact space learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12) (2015) 2531–
2544. doi:10.1109/TPAMI.2015.2417578.
T
[145] T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding, IEEE
IP
Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)
CR
40 (6) (2010) 1438–1446. doi:10.1109/TSMCB.2009.2039566.
[146] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, Y. Wen, Multiview vector-valued
manifold regularization for multilabel image classification, IEEE Transac-
doi:10.1109/TNNLS.2013.2238682.
US
tions on Neural Networks and Learning Systems 24 (5) (2013) 709–722.
AN
[147] J. Yu, Y. Rui, Y. Y. Tang, D. Tao, High-order distance-based multiview
stochastic learning in image classification, IEEE Transactions on Cyber-
M
netics 44 (12) (2014) 2431–2442. doi:10.1109/TCYB.2014.2307862.
[148] S. C. H. Hoi, W. Liu, M. R. Lyu, W.-Y. Ma, Learning distance met-
ED
rics with contextual constraints for image retrieval, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
PT
(CVPR’06), Vol. 2, 2006, pp. 2072–2078. doi:10.1109/CVPR.2006.167.
[149] Z. Wang, S. Gao, L.-T. Chia, Learning class-to-image distance via large
CE
margin and l1-norm regularization, in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012, Springer
AC
Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 230–244.
[150] P. Wu, S. C.-H. Hoi, P. Zhao, Y. He, Mining social images with distance metric learning for automated image tagging, in: Proceedings of
the Fourth ACM International Conference on Web Search and Data
Mining, WSDM ’11, ACM, New York, NY, USA, 2011, pp. 197–206.
doi:10.1145/1935826.1935865.
[151] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, Y.-F. Li, Multi-instance multi-
78
ACCEPTED MANUSCRIPT
label learning, Artificial Intelligence 176 (1) (2012) 2291 – 2320. doi:
https://doi.org/10.1016/j.artint.2011.10.002.
[152] F. Briggs, X. Z. Fern, R. Raich, Rank-loss support instance machines for
T
miml instance annotation, in: Proceedings of the 18th ACM SIGKDD In-
IP
ternational Conference on Knowledge Discovery and Data Mining, KDD
’12, ACM, New York, NY, USA, 2012, pp. 534–542.
CR
2339530.2339616.
doi:10.1145/
[153] C.-T. Nguyen, D.-C. Zhan, Z.-H. Zhou, Multi-modal image annotation
US
with multi-instance multi-label lda, in: Proceedings of the Twenty-Third
International Joint Conference on Artificial Intelligence, IJCAI ’13, AAAI
AN
Press, 2013, pp. 1558–1564.
[154] M. L. Zhang, A k-nearest neighbor based multi-instance multi-label learning algorithm, in: 2010 22nd IEEE International Conference on Tools with
M
Artificial Intelligence, Vol. 2, 2010, pp. 207–212. doi:10.1109/ICTAI.
ED
2010.102.
[155] J. He, H. Gu, Z. Wang, Bayesian multi-instance multi-label learning using
gaussian process prior, Machine Learning 88 (1) (2012) 273–295. doi:
PT
10.1007/s10994-012-5283-x.
[156] S. Huang, Z. Zhou, Fast multi-instance multi-label learning, CoRR
CE
abs/1310.2049.
[157] C. Desai, D. Ramanan, C. C. Fowlkes, Discriminative models for multi-
AC
class object layout, International Journal of Computer Vision 95 (1) (2011)
1–12. doi:10.1007/s11263-011-0439-x.
[158] B. Hariharan, S. V. N. Vishwanathan, M. Varma, Efficient max-margin
multi-label classification with applications to zero-shot learning, Machine
Learning 88 (1) (2012) 127–155. doi:10.1007/s10994-012-5291-x.
[159] Y. Guo, S. Gu, Multi-label classification using conditional dependency
networks, in: Proceedings of the Twenty-Second International Joint
79
ACCEPTED MANUSCRIPT
Conference on Artificial Intelligence - Volume Volume Two, IJCAI’11,
AAAI Press, 2011, pp. 1300–1305. doi:10.5591/978-1-57735-516-8/
IJCAI11-220.
T
[160] Z.-J. Zha, T. Mei, J. Wang, Z. Wang, X.-S. Hua, Graph-based semi-
IP
supervised learning with multiple labels, Journal of Visual Communication and Image Representation 20 (2) (2009) 97 – 103, special issue on
CR
Emerging Techniques for Multimedia Content Sharing, Search and Understanding. doi:https://doi.org/10.1016/j.jvcir.2008.11.009.
US
[161] Z. h. Z., M. l. Zhang, Multi-instance multi-label learning with application
to scene classification, in: B. Schölkopf, J. C. Platt, T. Hoffman (Eds.),
AN
Advances in Neural Information Processing Systems 19, MIT Press, 2007,
pp. 1609–1616.
M
[162] Y. Liu, R. Jin, L. Yang, Semi-supervised multi-label learning by constrained non-negative matrix factorization, in: Proceedings of the 21st
ED
National Conference on Artificial Intelligence - Volume 1, AAAI’06, AAAI
Press, 2006, pp. 421–426.
PT
[163] F. Kang, R. Jin, R. Sukthankar, Correlated label propagation with application to multi-label learning, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2,
CE
2006, pp. 1719–1726. doi:10.1109/CVPR.2006.90.
[164] S. S. Bucak, R. Jin, A. K. Jain, Multi-label learning with incomplete class
AC
assignments, in: CVPR 2011, 2011, pp. 2801–2808. doi:10.1109/CVPR.
2011.5995734.
[165] H.-F. Yu, P. Jain, P. Kar, I. S. Dhillon, Large-scale multi-label learning
with missing labels, in: Proceedings of the 31st International Conference
on International Conference on Machine Learning - Volume 32, ICML’14,
JMLR.org, 2014, pp. I–593–I–601.
80
ACCEPTED MANUSCRIPT
[166] H. Yang, J. T. Zhou, J. Cai, Improving multi-label learning with missing
labels by structured semantic correlations, in: B. Leibe, J. Matas, N. Sebe,
M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International
T
Publishing, Cham, 2016, pp. 835–851.
IP
[167] M. L. Zhang, Z. H. Zhou, A review on multi-label learning algorithms,
1819–1837. doi:10.1109/TKDE.2013.39.
CR
IEEE Transactions on Knowledge and Data Engineering 26 (8) (2014)
[168] S. Feng, C. Lang, Graph regularized low-rank feature mapping for multi-
US
label learning with application to image annotation, Multidimensional
Systems and Signal Processingdoi:10.1007/s11045-017-0505-9.
AN
[169] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of the Twentieth In-
M
ternational Conference on International Conference on Machine Learning,
ICML’03, AAAI Press, 2003, pp. 912–919.
ED
[170] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, Learning
with local and global consistency, in: Proceedings of the 16th Interna-
PT
tional Conference on Neural Information Processing Systems, NIPS’03,
MIT Press, Cambridge, MA, USA, 2003, pp. 321–328.
CE
[171] T. G. Dietterich, R. H. Lathrop, T. Lozano-Prez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence
89 (1) (1997) 31 – 71. doi:https://doi.org/10.1016/S0004-3702(96)
AC
00034-3.
[172] O. Maron, A. L. Ratan, Multiple-instance learning for natural scene classification, in: Proceedings of the Fifteenth International Conference on
Machine Learning, ICML ’98, Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1998, pp. 341–349.
[173] Learning with multiple views (2005).
81
ACCEPTED MANUSCRIPT
[174] A. Fakeri-Tabrizi, M. R. Amini, P. Gallinari, Multiview semi-supervised
ranking for automatic image annotation, in: Proceedings of the 21st ACM
International Conference on Multimedia, MM ’13, ACM, New York, NY,
T
USA, 2013, pp. 513–516. doi:10.1145/2502081.2502136.
IP
[175] Y. Li, X. Shi, C. Du, Y. Liu, Y. Wen, Manifold regularized multi-view
feature selection for social image annotation, Neurocomputing 204 (Sup-
CR
plement C) (2016) 135 – 141, big Learning in Social Media Analytics.
doi:https://doi.org/10.1016/j.neucom.2015.07.151.
abs/1304.5634. arXiv:1304.5634.
US
[176] C. Xu, D. Tao, C. Xu, A survey on multi-view learning, CoRR
AN
[177] S. Sun, A survey of multi-view machine learning, Neural Computing and Applications 23 (7) (2013) 2031–2038.
M
s00521-013-1362-6.
doi:10.1007/
[178] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative
ED
sparse coding for image annotation, Computer Vision and Image Understanding 118 (Supplement C) (2014) 50 – 60. doi:https://doi.org/10.
PT
1016/j.cviu.2013.03.007.
[179] C. Jin, S.-W. Jin, Image distance metric learning based on neighborhood
CE
sets for automatic image annotation, Journal of Visual Communication
and Image Representation 34 (Supplement C) (2016) 167 – 175. doi:
AC
https://doi.org/10.1016/j.jvcir.2015.10.017.
[180] G. Carneiro, N. Vasconcelos, A database centric view of semantic image
annotation and retrieval, in: Proceedings of the 28th Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’05, ACM, New York, NY, USA, 2005, pp. 559–566.
doi:10.1145/1076034.1076129.
[181] C. Wang, F. Jing, L. Zhang, H. J. Zhang, Content-based image annotation
82
ACCEPTED MANUSCRIPT
refinement, in: 2007 IEEE Conference on Computer Vision and Pattern
Recognition, 2007, pp. 1–8.
[182] J. W., Y. Y., J. Mao, Z. Huang, C. Huang, W. Xu, CNN-RNN: A uni-
T
fied framework for multi-label image classification, CoRR abs/1604.04573.
IP
arXiv:1604.04573.
CR
[183] J. Jin, H. Nakayama, Annotation order matters: Recurrent image annotator for arbitrary length image tagging, in: 2016 23rd International
Conference on Pattern Recognition (ICPR), 2016, pp. 2452–2457.
US
[184] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, C. Sun, Semantic regularisation for recurrent image annotation, in: 2017 IEEE Conference on
AN
Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4160–4168.
doi:10.1109/CVPR.2017.443.
M
[185] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural
image caption generator, in: 2015 IEEE Conference on Computer Vision
ED
and Pattern Recognition (CVPR), 2015, pp. 3156–3164. doi:10.1109/
CVPR.2015.7298935.
PT
[186] Y. Zhou, Y. Wu, Analyses on influence of training data set to neural network supervised learning performance, in: D. Jin, S. Lin (Eds.), Advances
CE
in Computer Science, Intelligent System and Environment, Springer Berlin
Heidelberg, Berlin, Heidelberg, 2011, pp. 19–25.
AC
[187] J. Z. Wang, J. Li, G. Wiederhold, Simplicity: semantics-sensitive integrated matching for picture libraries, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (9) (2001) 947–963.
[188] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet
Large Scale Visual Recognition Challenge, Vol. 115, 2015, pp. 211–252.
doi:10.1007/s11263-015-0816-y.
83
ACCEPTED MANUSCRIPT
[189] M. Grubinger, Analysis and evaluation of visual information systems performance, thesis (Ph. D.)–Victoria University (Melbourne, Vic.), 2007
(2007).
T
[190] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: A
IP
real-world web image database from national university of singapore, in:
Proceedings of the ACM International Conference on Image and Video
CR
Retrieval, CIVR ’09, ACM, New York, NY, USA, 2009, pp. 48:1–48:9.
doi:10.1145/1646396.1646452.
US
[191] J. Tang, P. H. Lewis, A study of quality issues for image auto-annotation
with the corel dataset, IEEE Transactions on Circuits and Systems for
AN
Video Technology 17 (3) (2007) 384–389.
[192] Y. Xiang, X. Zhou, Z. Liu, T. S. Chua, C. W. Ngo, Semantic context mod-
M
eling with maximal margin conditional random fields for automatic image
annotation, in: 2010 IEEE Computer Society Conference on Computer
ED
Vision and Pattern Recognition, 2010, pp. 3368–3375.
[193] S. Gao, L. T. Chia, I. W. H. Tsang, Z. Ren, Concurrent single-label image
PT
classification and annotation via efficient multi-layer group sparse coding,
IEEE Transactions on Multimedia 16 (3) (2014) 762–771.
CE
[194] W. Chong, D. Blei, F. F. Li, Simultaneous image classification and annotation, in: 2009 IEEE Conference on Computer Vision and Pattern
AC
Recognition, 2009, pp. 1903–1910. doi:10.1109/CVPR.2009.5206800.
[195] F. Nie, D. Xu, I. W. H. Tsang, C. Zhang, Flexible manifold embedding:
A framework for semi-supervised and unsupervised dimension reduction,
IEEE Transactions on Image Processing 19 (7) (2010) 1921–1932. doi:
10.1109/TIP.2010.2044958.
[196] H. Wang, H. Huang, C. Ding, Image annotation using multi-label correlated green’s function, in: 2009 IEEE 12th International Conference
84
ACCEPTED MANUSCRIPT
on Computer Vision, 2009, pp. 2029–2034. doi:10.1109/ICCV.2009.
5459447.
[197] J. Wu, Y. Yu, C. Huang, K. Yu, Deep multiple instance learning for
T
image classification and auto-annotation, in: 2015 IEEE Conference on
IP
Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3460–3469.
CR
[198] H. K. Shooroki, M. A. Z. Chahooki, Selection of effective training instances
for scalable automatic image annotation, Multimedia Tools and Applications 76 (7) (2017) 9643–9666. doi:10.1007/s11042-016-3572-2.
US
[199] Z. Shi, Y. Yang, T. M. Hospedales, T. Xiang, Weakly supervised learning of objects, attributes and their associations, CoRR abs/1504.00045.
AN
arXiv:1504.00045.
[200] J. Sánchez-Oro, S. Montalvo, A. S. Montemayor, J. J. Pantrigo, A. Duarte,
M
V. Fresno-Fernández, R. Martı́nez-Unanue, Urjc&uned at imageclef 2013
photo annotation task, in: CLEF, 2012.
ED
[201] S. Stathopoulos, T. Kalamboukis, T.: Ipl at imageclef 2014: Scalable concept image annotation, in: In: CLEF 2014 Evaluation Labs and Work-
PT
shop, Online Working, 2014.
[202] P. Budı́ková, J. Botorek, M. Batko, P. Zezula, DISA at imageclef
CE
2014 revised: Search-based image annotation with decaf features, CoRR
abs/1409.4627. arXiv:1409.4627.
AC
[203] L. Deng, D. Yu, Deep learning: Methods and applications, Foundations
and Trends in Signal Processing 7 (34) (2014) 197–387. doi:10.1561/
2000000039.
URL http://dx.doi.org/10.1561/2000000039
[204] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, J. Li, Deep
learning for content-based image retrieval: A comprehensive study, in:
Proceedings of the 22Nd ACM International Conference on Multimedia,
85
ACCEPTED MANUSCRIPT
MM ’14, ACM, New York, NY, USA, 2014, pp. 157–166.
URL http://doi.acm.org/10.1145/2647868.2654948
[205] S. Nowak, S. Rüger, How reliable are annotations via crowdsourcing: A
T
study about inter-annotator agreement for multi-label image annotation,
IP
in: Proceedings of the International Conference on Multimedia Information Retrieval, MIR ’10, ACM, New York, NY, USA, 2010, pp. 557–566.
CR
doi:10.1145/1743384.1743478.
[206] J. Moehrmann, G. Heidemann, Semi-automatic image annotation, in:
US
R. Wilson, E. Hancock, A. Bors, W. Smith (Eds.), Computer Analysis
of Images and Patterns, Springer Berlin Heidelberg, Berlin, Heidelberg,
AN
2013, pp. 266–273.
[207] M. Tuffield, S. Harris, D. P. Dupplaw, A. Chakravarthy, C. Brewster,
M
N. Gibbins, K. O’Hara, F. Ciravegna, D. Sleeman, Y. Wilks, N. R. Shadbolt, Image annotation with photocopain, in: First International Work-
ED
shop on Semantic Web Annotations for Multimedia (SWAMM 2006) at
WWW2006, 2006, event Dates: May 2006.
PT
[208] D.-H. Im, G.-D. Park, Linked tag: image annotation using semantic relationships between image tags, Multimedia Tools and Applications 74 (7)
CE
(2015) 2273–2287. doi:10.1007/s11042-014-1855-z.
[209] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, D. N. Metaxas, Automatic
image annotation using group sparsity, in: 2010 IEEE Computer Society
AC
Conference on Computer Vision and Pattern Recognition, 2010, pp. 3312–
3319.
[210] B. C. Ko, J. Lee, J.-Y. Nam, Automatic medical image annotation and
keyword-based image retrieval using relevance feedback, Journal of Digital
Imaging 25 (4) (2012) 454–465. doi:10.1007/s10278-011-9443-5.
[211] S. Zhang, J. Huang, H. Li, D. N. Metaxas, Automatic image annotation
86
ACCEPTED MANUSCRIPT
and retrieval using group sparsity, IEEE Transactions on Systems, Man,
and Cybernetics, Part B (Cybernetics) 42 (3) (2012) 838–849.
[212] C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting im-
T
age annotations using amazon’s mechanical turk, in: Proceedings of the
IP
NAACL HLT 2010 Workshop on Creating Speech and Language Data
with Amazon’s Mechanical Turk, CSLDAMT ’10, Association for Com-
CR
putational Linguistics, Stroudsburg, PA, USA, 2010, pp. 139–147.
URL http://dl.acm.org/citation.cfm?id=1866696.1866717
US
[213] S. Nowak, S. Rüger, How reliable are annotations via crowdsourcing: A
study about inter-annotator agreement for multi-label image annotation,
AN
in: Proceedings of the International Conference on Multimedia Information Retrieval, MIR ’10, ACM, New York, NY, USA, 2010, pp. 557–566.
doi:10.1145/1743384.1743478.
M
URL http://doi.acm.org/10.1145/1743384.1743478
ED
[214] P. Welinder, P. Perona, Online crowdsourcing: Rating annotators and
obtaining cost-effective labels, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010, pp.
PT
25–32. doi:10.1109/CVPRW.2010.5543189.
[215] C.-F. Tsai, K. McGarry, J. Tait, Qualitative evaluation of automatic
CE
assignment of keywords to images, Information Processing & Management 42 (1) (2006) 136 – 154, formal Methods for Information Retrieval.
AC
doi:https://doi.org/10.1016/j.ipm.2004.11.001.
[216] Y. Zhang, Z.-H. Zhou, Multilabel dimensionality reduction via dependence
maximization, ACM Trans. Knowl. Discov. Data 4 (3) (2010) 14:1–14:21.
doi:10.1145/1839490.1839495.
[217] R. E. Schapire, Y. Singer, Boostexter: A boosting-based system for text
categorization, Machine Learning 39 (2) (2000) 135–168. doi:10.1023/A:
1007649029923.
87
ACCEPTED MANUSCRIPT
[218] W.-C. Lin, S.-W. Ke, C.-F. Tsai, Robustness and reliability evaluations
of image annotation, The Imaging Science Journal 64 (2) (2016) 94–99.
[219] X. Ke, M. Zhou, Y. Niu, W. Guo, Data equilibrium based automatic
T
image annotation by fusing deep model and semantic propagation, Pattern
IP
Recognition 71 (Supplement C) (2017) 60 – 77. doi:https://doi.org/
CR
10.1016/j.patcog.2017.05.020.
[220] L. Sun, H. Ge, S. Yoshida, Y. Liang, G. Tan, Support vector description of
clusters for content-based image annotation, Pattern Recognition 47 (3)
US
(2014) 1361 – 1374, handwriting Recognition and other PR Applications.
doi:https://doi.org/10.1016/j.patcog.2013.10.015.
AN
[221] J. Li, J. Z. Wang, Real-time computerized annotation of pictures, IEEE
Transactions on Pattern Analysis and Machine Intelligence 30 (6) (2008)
M
985–1002. doi:10.1109/TPAMI.2007.70847.
[222] J. Tang, P. H. Lewis, Image auto-annotation using ’easy’ and ’more chal-
ED
lenging’ training sets, in: 7th International Workshop on Image Analysis
for Multimedia Interactive Services, 2006, pp. 121–124, event Dates: April
PT
19-21.
[223] P. Duygulu, K. Barnard, J. F. G. de Freitas, D. A. Forsyth, Object recog-
CE
nition as machine translation: Learning a lexicon for a fixed image vocabulary, in: A. Heyden, G. Sparr, M. Nielsen, P. Johansen (Eds.), Computer
Vision — ECCV 2002, Springer Berlin Heidelberg, Berlin, Heidelberg,
AC
2002, pp. 97–112.
[224] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, A. Zisserman,
The pascal visual object classes (voc) challenge, International Journal of
Computer Vision 88 (2) (2010) 303–338.
[225] H. Mller, P. Clough, T. Deselaers, B. Caputo, ImageCLEF: Experimental
Evaluation in Visual Information Retrieval, 1st Edition, Springer Publishing Company, Incorporated, 2010.
88
ACCEPTED MANUSCRIPT
[226] M. Villegas, R. Paredes, B. Thomee, Overview of the ImageCLEF 2013
Scalable Concept Image Annotation Subtask, in: CLEF 2013 Evaluation
Labs and Workshop, Online Working Notes, Valencia, Spain, 2013.
T
[227] J. Sánchez-Oro, S. Montalvo, A. S. Montemayor, J. J. Pantrigo, A. Duarte,
IP
V. Fresno, R. Martı́nez, URJC&UNED at ImageCLEF 2013 Photo An-
AC
CE
PT
ED
M
AN
US
Working Notes, Valencia, Spain, 2013.
CR
notation Task, in: CLEF 2013 Evaluation Labs and Workshop, Online
89
Download