Accepted Manuscript Image Annotation: Then and Now P K Bhagat, Prakash Choudhary PII: DOI: Reference: S0262-8856(18)30162-8 doi:10.1016/j.imavis.2018.09.017 IMAVIS 3726 To appear in: Image and Vision Computing Received date: Accepted date: 6 August 2018 24 September 2018 Please cite this article as: P K Bhagat, Prakash Choudhary , Image Annotation: Then and Now. Imavis (2018), doi:10.1016/j.imavis.2018.09.017 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. ACCEPTED MANUSCRIPT Image Annotation: Then and Now P K Bhagata,∗, Prakash Choudharya Institute of Technology Manipur, Imphal, India-795001 IP T a National CR Abstract Automatic image annotation (AIA) plays a vital role in dealing with the exponentially growing digital images. Image annotation helps in effective retrieval, US organization, classification, auto-illustration, etc. of the image. It started in early 1990. However, in the last three decades, there has been extensive re- AN search in AIA, and various new approaches have been advanced. In this article, we review more than 200 references related to image annotation proposed in the last three decades. This paper is an attempt to discuss predominant ap- M proaches, its constraints and ways to deal. Each segment of the article exhibits ED a discourse to expound the finding and future research directions and their hurdles. This paper also presents performance evaluation measures with relevant and influential image annotation database. PT Keywords: Image annoation, automatic image annotion, multi-label classification, image labeling, image tagging, annotation dataset, annotaton CE performance evaluation, image features, image retrieval. AC 1. Introduction Human has better capability to organize pictures as we stores and recall im- ages by the objects present in the image. Nowadays the number of digital images are proliferating, in fact, faster than the expectation. Therefore a coherent organization of digital images benefits in the dynamic retrieval. The unorganized ∗ Corresponding author Email addresses: pkbhagat@nitmanipur.ac.in (P K Bhagat), choudharyprakash@nitmanipur.ac.in (Prakash Choudhary) Preprint submitted to Journal of LATEX Templates September 29, 2018 CR IP T ACCEPTED MANUSCRIPT US Figure 1: Image and corresponding labels from the ESP Game database [1]. collection of an extensive image database will lead to very inefficient search and AN may also give irrelevant images in retrieval. For example, searching of an images having a dog in an unorganized collection of an extensive image database becomes inefficient - it takes long response time and due to imprecise and su- M perficial annotation may get irrelevant images with an animal such as cat, fox, ED wolf, etc. Image annotation is a process of labeling the image with keywords, represents the contents of the image which helps in the intelligent retrieval of relevant images through simple query representation. An example of image with PT labels is shown in figure 1. The assignment of keywords can be performed man- CE ually or automatically. The later is called automatic image annotation (AIA). The retrieval of the images performed in two ways: the content-based image retrieval and the text-based image retrieval. In the content-based image AC retrieval (CBIR) images are organized using visual features, and during the retrieval process, the visual similarity index is a key between the query image and database of images. The visual similarity refers the similarities based on visual features like color, texture, shape, etc. The semantic gap is one of the most critical concerns associated with CBIR. A gap between low-level contents and high-level semantic concepts is known as semantic gap [2]. The semantic gap arises due to CBIR reliance on the visual similarity for the retrieval of similar 2 ACCEPTED MANUSCRIPT images. On the other hand, the text-based image retrieval (TBIR) technique implies metadata associated with the image of its retrieval. In TBIR, images organized according to its contents, and these contents are associated with the image in the form of some metadata. At the time of retrieval, these metadata are T used to fetch the related images. In TBIR, images are stored using its semantic IP concepts and text is used as a query to retrieve the images. The manual assign- CR ment of keywords to the image is a time-consuming process and very expensive. Thus, researchers tried to assign keywords to the images automatically. The AIA techniques attempt to learn a model from the training data and US use trained model to assign semantic labels to the new image automatically [3]. AIA assigns one or multiple labels to the image either based on visual fea- AN tures or exploiting image metadata. Image retrieval can be performed based on many techniques like classification and clustering, similarity measures, search paradigms, visual similarity, etc. [4]. Image annotation can be considered as a M submodule of image retrieval system. The image retrieval which can be implemented either using CBIR or TBIR has different working principles. The CBIR ED analyses the visual contents (color, texture, shape, etc.) of images without considering the image metadata (labels, surrounding texts, etc.) for its retrieval. PT The query in the CBIR can only be the image itself or its visual features. Moreover, TBIR exploits the image metadata for its retrieval. Semantic concepts are CE used to represent the image in the database and query can be represented in the form of text or image itself. When the image itself is presented as a query, it is interpreted in the semantic form (textual form). AC The AIA can be performed using visual features [5, 6, 7, 8, 9] or visual features combined with textual features [10, 11, 12, 13, 14] or using metadata associated with images [15, 16, 17, 18]. The visual features, which are based on the pixel intensity values and their spatial relations and extracted directly from the image, include color, texture, shape, etc. The textual features are derived from the image metadata like keywords, filename, URL, surrounding texts, GPS info, etc. Nevertheless, while annotation may be achieved in what so ever ways, the basic idea is to learn a model that annotates new images so that images 3 ACCEPTED MANUSCRIPT can be efficiently organized for a simplified retrieval process. Recently, focus is to learn the model from weakly supervise [19, 20, 21, 22, 23, 24, 25, 26, 27] or unsupervised training data [28, 13, 15, 17, 16] due to the fact that image database is growing at very large scale and it is difficult to prepare fully labeled T training data. IP At present, there exist various survey papers on AIA [29, 3, 30, 31, 32]. In CR [32], the authors present a comprehensive review on AIA where the primary focus is on the efficiency of social tagging. Authors discuss in depth about the issues of social image tagging, the refinement of noisy and missing social US tags and present a comparative study of various annotation methods under a common experimental protocol to check the efficiency of various methods for tag AN assignment, refinement, and retrieval. Although the paper covers a wide range of AIA methods, various important aspects of AIA (semi-supervised learning, unsupervised learning, deep learning, etc.) are not covered in detail. M A recent paper [31] presents a much generalized categorization of AIA methods. The paper focuses towards the presentation and explanation of core AIA ED techniques. As the implementation strategy varies from articles to articles, one also need to pay attention towards how an appropriate technique is implemented PT in relevant research work. Also, AIA is a vast field; any attempt to explain these methods require a well-defined categorization to cover all the aspects of the AIA CE methods. In contrast to [31], we have adopted a different basis for the classification of AIA methods which is more specialized and informative. Notably, we will pay much more attention towards how efficiently a research work implemented AC AIA methods to achieve a useful result. In this survey paper, we will try to cover every aspect of AIA while explaining how a particular branch of AIA has expanded and the scope of the further improvement. This paper is an attempt to discuss practical approaches, its constraints and ways to deal. Each segment of the article exhibits a discourse to expound the finding and future research directions and their hurdles. This paper also presents performance evaluation measures and relevant and predominant image annotation database. In this paper, we comprehensively reviewed more than 200 papers related 4 AN US CR IP T ACCEPTED MANUSCRIPT M Figure 2: A broad categorization of image annotation methods. ED to image annotation, its baseline and presented the current progress and future trends. The course of the categorization of image annotation methods presented PT in this survey paper is as shown in figure 2. The rest of the article is organized as: Section 2 gives an overview of the history of image annotation methods. Section 3 describes some of the dominant visual features and extraction methods. CE Section 4 presents detailed descriptions of image annotation methods. Section 5 describes the most popularly used evaluation method for the verification of the AC performance of the annotation system. Section 6 presents some datasets mostly used for training and evaluation of annotation model followed by conclusion in section 7. Each section concluded with a discussion where we describe the challenges faced, limitation and future scope of the methods of that section. Although there are slight differences between method and technique, we will ignore them and will use method and technique interchangeably. 5 ACCEPTED MANUSCRIPT 2. History of Image Annotation: An Overview 2.1. First Decade The image annotation has been there from 1990. The year 1990 to 2000 can T be considered as very early years of image annotation. Especially, after 1996 IP there has been a very sharp increase in the number of papers published about image annotation. The comprehensive review of these years is summarized in CR [2]. Most of the proposed methods in the first decade used manually extracted features followed by some classifiers to annotate an image. The main concerns US of all the methods are to fill the semantic gap. Semantic gap is the gap between how the user and machine perceive an image. As the user interprets an image in the high-level form, while the computer interprets the same image in low- AN level, this gap of interpretation is known as the semantic gap. In the early years, low-level features like (texture, color, shape, etc.) are extracted using M handmade feature extraction techniques and there is no relationship between these low-level features and textual features. Therefore, filling the semantic ED gap is a problem. All the proposed methods are supervised learning, and the retrieval is based on CBIR. The paper [2] reviewed the work carried out in the early era of annotation and retrieval and set the goal for the next age. Most of PT the articles in the later decade followed the direction provided in [2]. CE 2.2. Second Decade In the early years, the problem of the semantic gap is realized, and in the AC medieval era, researchers explored various methods to deal with the semantic gap. From 2000 to 2008, there has been extensive research to fill the semantic gap problem. A thorough study of these techniques structured in [4]. Even though in most of the papers, manually extracted features are used, the results produced are far better and efficiently dealt with semantic gap [33] and seen many states of the art techniques and baseline produced during this era [9, 34, 8]. The research area of these years focused mainly in the field of finding the correlation between visual and textual features. We will explore these methods 6 ACCEPTED MANUSCRIPT in the following sections. We called time span 2002 − 2008 as the medieval era of the research and development on image annotation and retrieval techniques. In a medieval period, machine learning techniques used extensively. Therefore, this medieval era was dedicated to the use of machine learning techniques for T image annotation and its retrieval [35, 36, 37, 19, 38, 33]. Later, [4] reviewed the IP some of the most promising related work carried out during 2000 to 2008 and CR set the future direction in the image annotation and retrieval area. In the early year and medieval era, almost all the image annotation methods were based on supervised learning. In supervised learning, the training dataset provided with US the complete set of manually annotated labels. Most of the proposed methods in this era followed the CBIR technique based image retrieval. AN 2.3. Third Decade After 2010, deep learning techniques are extensively being used for the an- M notation and retrieval process [11, 39, 40, 27]. The use of convolution neural network (CNN) based features [39, 11] and features extracted by CNN based ED pre-trained AlexNet and VGGNet network [40, 27] are being utilized for AIA. Recently, a restricted Boltzmann machine (RBM) based model is designed for PT the automatic annotation of geographical images [13]. The limitation of CBIR led the research focus toward TBIR. In TBIR, semantic keywords are used for the retrieval of the images [15, 41, 42]. TBIR requires the image to be anno- CE tated with semantic keywords. Also, due to the limitation of supervised learning which requires a large number of labeled training data, researchers are explor- AC ing semi-supervised and unsupervised based learning methods. After 2010, researchers shifted their focus towards semi-supervised based image annotation where a model is trained using incomplete labels in training data for large-scale image annotation [43, 26, 24]. In semi-supervised learning (SSL) based largescale image annotation methods, the real challenge is to deal with the massive, noisy dataset in which the number of images is increasing more rapidly than ever thought. Recently, researchers have started exploring unsupervised image annotation techniques [15, 44, 45, 46] where training dataset is not labeled at 7 ACCEPTED MANUSCRIPT all, and only metadata (URL, surrounding texts, filename, etc.) are provided with a training dataset. The unsupervised learning based annotation methods are in its early stage of evolution. Discussion: The semantic gap problem realized in the early era was mini- T mized or almost solved in the medieval era in a supervised learning framework. IP Over the years, a shift from handmade feature extraction techniques to the cor- CR relation between textural and visual features to deep learning based features can be realized. Also, we have shifted our focus from CBIR to TBIR. Then, the problem of large-scale database arisen which continues to grow. Currently, US researchers are trying to explore the possibility of SSL to deal with increasing size of the dataset. Recent advancement in deep learning and capabilities of the AN high-performance computer system has given us the opportunity to think differently and do away with handmade feature extraction techniques. The image annotation has also been applied for medical images [47, 48, 49, 50] and satellite M images [51, 13] and provided excellent results using the current state of the art presented in table 1. ED methods. A highlight of image annotation methods in the last three decades is PT 3. Image Processing Features The contents of the image correspond to the objects present in the image. CE The objects in the images are presented at different levels of abstraction and can symbolically be represented by moments, shapes, contours, etc., described AC by location and relative relationships with other objects. The tags predicted by annotation model are associated with the image for semantic retrieval of the image. The main idea behind AIA is to use semantic learning with high-level semantic concepts [30]. Localization of the objects is one of the essential aspects for describing the contents of the image [52, 12, 22, 7]. Although, there have been various instances where the annotation is performed efficiently without localization of object [53, 5, 54, 38, 14]. However, either the object is segmented or not, low-level features like texture, color, shape, corners, etc. are extracted 8 ACCEPTED MANUSCRIPT Table 1: Some of the highlights of image annotation methods in last three decades. Medieval Era Last Decade handmade features correlation of visual correlation of visual and textual features and textual features, extraction T Features Early Era supervised learning Retrieval semi-supervised learning∗ , CBIR TBIR semantic gap image based query learning from noisy and Most of the methods followed this path only. few of the methods started following this path. M + unorganized texts AN ∗ unsupervised leaning+ CBIR US Problem faced supervised learning CR Learning IP deep learning features ∗ to describe the contents of the image. The challenge is to model this low-level ED features to high-level semantic concepts. The content of the image can be described at the various levels each with PT its worthiness. Image annotation in whole doesn’t rely on the visual contents of the image. It may process some metadata associated with the image and find some association between visual contents and metadata to annotate an image. CE However, keeping in mind only visual contents, it can be represented at different levels vis textual level, color level, object level and salient point level. AC Texture Features: Texture plays an important role in the human visual perception system. In the absence of any single definition of texture, surface characteristics and appearance of an object can be considered as the texture of that object [55]. While describing the texture, the computer vision researchers are describing it as everything of an image after extracting the color components and local shapes [2]. The various methods for extraction of texture features include histogram based first order texture features [56], gray tone spatial dependence matrix [57] and its variants [58, 59], local binary pattern (LBP) [60] 9 ACCEPTED MANUSCRIPT and its variant [61], local directional pattern [62] and its variant [63], wavelet features [64], etc. According to [65], if we get the same histogram twice, then the same class of spatial stimulus has been encountered. For a 512 × 512 image, there are 26 × 106 possible permutations of pixels so, the probability of T getting the same histogram twice for different texture is almost impossible [65]. IP The texture is an important property of the image. However, it alone can’t CR describe the image in its entirety. Although, it has been asserted that texture pair agreeing in their second-order statistics can’t be discriminated [66]. Even then, texture features alone is not sufficient to annotate any image. US Color Features: Color-based features of the image are one of the most powerful representations of the contents of the image. An image is composed of AN RGB colors, hence, extracting RGB based features have been a very active area of image content representation. Opponent color representation of RGB has also been reported in the literature [67]. The color histogram has been used M by several authors to represent the image contents [5, 10, 68, 6]. An excellent review of color image processing can be seen in [2]. Out of the various visual ED features of an image, color is the most straightforward feature. Color plays an indispensable role for human in the recognition of objects in an image as hu- PT man eyes are sensitive to colors. Color features are invariant to the rotation and transformation which makes it one of the most dominant visual feature. If CE normalization is used then color features are also invariant to scaling. A comparative analysis of color features can be seen in [69]. A small variation in color value is not noticeable by the human eye as compared to the variation in gray AC level values. It helps in designing more compact color-based features. The color space can be represented using RGB, HSL, HSI, HSV, etc. and are invariant to rotation and translation. No matter what color space representation is used, color information can be represented in 3-D space (one for each color) [70]. An MPEG7 based scalable color descriptor, represented in HSV color space and encoded by a Haar transform is presented in [71]. Object Level Features: Recognition of objects present in the image is one of the most important aspects of image annotation. The best way to recognize 10 ACCEPTED MANUSCRIPT objects is to segment the objects present in the image and then extract features of those segmented regions. Recognition of the objects helps in the annotation of image and region label matching as well. However, segmentation of objects is itself a complex task. Moreover, an unsupervised segmentation algorithm is T a brittle task. Semantic segmentation requires each pixel to be labeled with an IP object class [22], which can be implemented in a supervised manner [72, 73] or CR weakly supervised method [74, 75] or in the unsupervised manner [76]. We have also seen a deep neural network is being extensively used for the segmentation of objects [77, 78]. The accuracy of segmentation algorithm plays a crucial role US in the semantic annotation of the image. Various image annotation methods have been reported which uses segmentation techniques for the recognition of AN objects [22, 79, 52, 80, 12]. Robust segmentation is essential to annotate image semantically, which is hard to achieve. When one segmented region contains all the pixels of one object, and no pixels of other object is considered as robust M segmentation. Therefore, researchers tried to implement weak segmentation. Alternatively, segmentation can be omitted, and other types of features can ED be used for image annotation [81, 15, 82, 24, 7]. Although accurate, robust semantic segmentation is difficult to achieve, segmented regions are useful and PT powerful features for the annotation. Salient Image Point Features: Salient attributes present in the image, usu- CE ally identified by color, texture, or local shapes, are used to produce salient features. Although, color and texture features are commonly used to represent the contents of the image. But, salient points deliver more discriminative fea- AC tures. In the absence of object level segmentation, salient points act as weak segmentation and play an eternal role in the representation of an image. Salient points may be present at the different locations in the image and need not be corners, i.e., it can be smooth lines as well. Ref [83] compared the salient points extracted using wavelet with corner detection algorithm. A salient point detection and scale estimation method are proposed using the local minima based on fractional Brownian [84]. Although salient points are the substitution for the segmentation (as segmentation is a brittle task), if salient points are used along 11 ACCEPTED MANUSCRIPT with segmentation it gives much more discriminative features. Salient points are used by various authors for image annotation [52]. Recently, the scale invariant feature transform (SIFT) [85] based features have become much more popular. SIFT is scale and rotation invariant local T feature descriptor based on the edge oriented histogram which extracts vital IP points (interest points) and its descriptors. A similar method which computes CR a histogram of the direction of gradients in a localized portion of an image called histogram of oriented gradients (HOG) [86] is a robust feature descriptor. Later, a speed-up version of SIFT called speeded-up robust features (SURF) US introduced in [87]. For real-time applications, a machine learning based corner detection algorithm called features from accelerated segment test (FAST) intro- AN duced in [88]. SIFT and SURF descriptors are usually converted into binary strings to speed-up the matching process. However, binary robust independent elementary features (BRIEF) [89] provides a shortcut for finding binary strings M directly without computing the descriptors. It is worth mentioning that BRIEF is a feature descriptor, not feature detector. SIFT uses a 128-dimensional de- ED scriptor, and SURF has a 64-dimensional descriptor, i.e., a high computational complexity. An alternative of SIFT and SURF called oriented FAST and rotated PT BRIEF (ORB) is proposed in [90]. SIFT and SURF are patented algorithms however ORB is not patented and has a comparable performance with low com- CE putational complexity compared to SIFT and SURF. Discussion: Feature extraction is one of the essential components of the image annotation system. Visual features have a significant role in the identi- AC fication and recognition of objects and image content representation. Different types of features (texture, color, SIFT, etc.) have different characteristics. The local features (SIFT, SURF, shape, etc.) describe image patches whereas global features represent an image as a whole. So, local features are a specification of an image and used for object recognition, global features are a generalization of an image and suitable for object detection. The recent feature extraction methods (SIFT, SURF, HOG, etc.) are compelling object representation methods and have become standard feature representation methods. However, global 12 ACCEPTED MANUSCRIPT (b) (c) (e) (d) CR IP T (a) US Figure 3: Image feature representation at different levels of abstraction. (a) original image taken from [91]. (b) manually segmented image for object level feature representation. (c) 100 SIFT keypoints, its size, and orientation. (d) texture representation of (a) using LBP. (e) AN BGR color histogram of (a). features like texture, color, etc. have their significance and will continue to be M in practice. An example of features at different levels shown in figure 3. ED 4. Classification of Image Annotation Techniques Human has incredible visual interpretation system. The human visual sys- PT tem interprets the image incredibly while making the association between subject and objects present in the image [22]. The human being can effortlessly CE describe the objects and their associated attributes present in the image. Researchers of the computer vision have tried endlessly since last few decades to AC make computer system capable of imitating this ability of human [22]. AIA is a step forward in this direction where the aim is to detect each object present in the image and assign the corresponding tags to describe the image contents. We have seen intensive studies in last two decades (although there have been few instances before as well) and many techniques have been proposed for image annotation. The proposed image annotation approaches can be categorized in many ways. We have classified the annotation approaches according to Figure 2. We have 13 (a) (b) IP (c) T ACCEPTED MANUSCRIPT Figure 4: Example of visual correlation. A strong correlation is represented by a solid arrow, CR and a weak correlation is represented by a dotted arrow. (a) an instance airplane of the image can be either flying or parked. (b) if the airplane is flying, then it has a strong correlation with objects sky and clouds and weak correlation with ground and grass. (c) if the airplane is parked, then it will have a strong correlation with instances ground and grass and a weak US correlation with sky and clouds instance. figure 2 in the following section. AN followed the structure of this flow and presented the details of each subpart of M 4.1. Model-Based Annotation Approach Model-based approach trains an annotation model from the training set fea- ED tures (either visual, textual or both) to annotate the unknown images. The model-based approach can be broadly categorized in the generative model, dis- PT criminative model, graph-based model and nearest neighbor based models. All of these models and the prominent methods used by these models are presented in figure 5. The primary objective of the model-based approach is to learn an CE annotation model from the training data through visual and textual correlation so that the learned model can accurately annotate the unknown image. AC The visual and textual correlation is a relationship established between visual features and textual metadata (labels, surrounding texts, etc.) between a pair of instances or multiple instances. A relationship among instances is shown in figure 4. Multiple instances in an image may not be independent of each other [92]. There are some dependencies between instances. A strong correlation shows high interdependence among instances while a weak correlation points low dependence among images. 14 ACCEPTED MANUSCRIPT Model Based Approach SVM Topical model Graph based Model NN based Model KNN [9] Random walk [112, 110] Multi-label KNN [105] RWR [110] PLSA [28] Multiclass SVM [35, 54] LDA [33] MKL SVM [14] cLDA [93] Tr-mmLDA [94] HR-SVM [100] OVA-SVM [101] TagProp [106] SMMTC [95] OVA-SSVM [102] Distance metirc learning Mixture model Weighted KNN [106] Adaptive RW [111] 2PKNN [107, 11] Bi-partite graph [17] KML [109] Laplacian SVM [103] CR RKML [108] Two layer MKL [81] FMM [52] EM [96] Multi-layer MKL [81] Adaptive EM [52] Linear logistic regression [21] GMM [37] Hypergraph [116] Diffusion graph [113] Graph learning [114] Sparse graph reconstruction [115] US Kernel logistic regression [104] Relevance model T Discriminative Model IP Generative Model Multiple Bernoulli relevance model [97] CMRM [98, 12] Dual CMRM [99] AN Extended CMRM [12] 4.1.1. Generative Model M Figure 5: Categorization of the model based annotation approach. ED The generative model aims at learning a joint distribution over visual and contextual features so that the learned model can predict the conditional prob- PT ability of tags given the image features [117]. The generative model to work correctly, the model has to capture dependency between visual features and associated labels accurately. The generative model based image annotation is CE presented in [28, 33, 80, 51, 95, 93, 37, 96, 52, 99, 12]. Generative models are usually based on topical models [95, 93], mixture models [96, 80] and relevance AC models [97, 99]. Several works based on topical models have been reported in the literature [28, 33, 94, 51, 95, 93]. A topic model, which uses contextual clues to connect words with similar meaning, can find meaning from the large volume of unlabeled texts. A topical model is a powerful unsupervised tool for analyzing texts documents. A Topical model can predict future texts and consists of topics which are distributed over words and assumes that each document can be described as a mixture of these topics [118]. In one of the early use of the topical model for image annotation, probabilistic latent semantic analysis (PLSA) 15 ACCEPTED MANUSCRIPT is used to annotate Corel images [28]. The authors used both textual (distribution of textual labels) and visual (global and regional) features to annotate any unknown image [28, 12]. The work carried out in [33] uses Latent Dirichlet Allocation (LDA) to reduce the noise present in the image for application in T web image retrieval. Furthermore, the work in [93] extended the LDA to corre- IP spondence LDA (cLDA) and later in [94], LDA is extended to topical regression CR multi-modal Latent Dirichlet Allocation (tr-mmLDA) to capture statistical association between image and text. LDA is also used to reduce the semantic gap in the annotation of satellite images [51]. In [95], the authors presented an US extension of sparse topical coding (STC) which is the nonprobabilistic formulation of probability topical model (PTM). This extended method is called sparse AN multi-modal topical coding (SMMTC) [95]. The mixture model is based on the parametric model, where the basic idea is to learn the missing model parameters based on expectation maximization M methods [119]. If the image features and tags are given, then the mixture model computes joint distribution, and for an unknown image, the conditional proba- ED bility of tags is obtained based on the visual features of the image. The work in [80, 52, 96, 37] uses a mixture model to annotate images. The work carried PT out in [52] is based on finite mixture model (FMM) and proposes an adaptive expectation maximization algorithm. The proposed adaptive expectation CE maximization algorithm is used for the selection of optimal model and model parameters estimation. The authors proposed multi-level annotation where FMM is used at the concept level, and the content level, the segmented regions of the AC image is classified using support vector machine (SVM) with an optimal model parameter search scheme. Later, [96] proposed a probabilistic semantic model having hidden layers and used expectation maximization (EM) based model to determine the probabilities of visual features and words in a concept class. In [37], the authors used a Gaussian mixture model (GMM) for feature extraction and proposed the sparse coding framework for image annotation. The proposed GMM is inspired by a discrete cosine transform (DCT) features and used the subspace learning algorithm to utilize the multi-label information for feature 16 ACCEPTED MANUSCRIPT extraction [37] efficiently. The relevance model calculates the joint probability distribution over the tags and features vector [98]. The idea is to find the relevance of visual and textual features to learn a non-parametric model for image annotation. A non- T parametric annotation model uses a vocabulary of blobs to describe an image IP where each image is generated using a certain number of blobs. The relevance CR models can integrate features from infinite dimensions and treat visual and textual information as different features to find relationships among them. Moreover, learning the co-occurrence among global, regional and contextual features US helps in tagging an image as a whole entity with semantic meaning. Crossmedia relevance models (CMRM) based image annotation have been proposed AN in [98, 42, 99, 12]. Firstly, Jeon et al. in [98] proposed CMRM which learns joint distribution over blobs, where blobs are clustered regions which form a set of vocabulary. The proposed model can be used to rank the images as well as M to generate a fixed length annotation from the ranked images. In [97], multiple Bernoulli relevance model is proposed which used the visual features to find the ED correspondence between the features vector and tags. Later in [99], the authors extended CMRM to analyze word to word relation as well. This CMRM is PT called dual CMRM. The authors integrated word relation, image retrieval, and web search techniques together to solve the annotation problem [99]. In a sim- CE ilar work, [12], the idea of dual CMRM is used and proposed extended CMRM which can integrate image patches as a bag of visual words representation and AC textual representation together. 4.1.2. Discriminative Model The discriminative models consider each tag as an independent class. Thus, for the image annotation, a classifier based approach is followed. Here, a separate classifier is trained for each tag using the visual features of the image. Later, for the test image, the trained classifier predicts particular tags for that image. The critical issues are the subset selection of visual features and multiple labels for an image [117]. The high-dimensionality of various visual features raises the 17 ACCEPTED MANUSCRIPT concerns of its organization and selection of potent discriminative subset from this heterogeneous features [117]. Different methods have been proposed in the literature for efficient feature selection [120, 121, 26]. As images are labeled with multiple tags, discriminative models can also be T seen as a multi-label classifier. The image annotation framework based on dis- IP criminative model are presented in [35, 122, 50, 123, 54, 47, 124, 125, 14, 41, 126]. CR The discriminative model requires a large number of training data with accurate and complete annotated keywords [101]. Generating a fully manually annotated training dataset is an expensive and very time-consuming process. Recent re- US search focuses towards training the classifier in a semi-supervised manner to alleviate this problem [101, 21, 100, 104, 81]. AN Most of the discriminative models are based on support vector machine (SVM) or its variants [35, 122, 14, 124, 127, 50, 47]. The decision trees (DT) are also used for image annotation [126, 41]. The use of multiclass SVM (where M several classifiers are trained and their results are combined) is reported in [35, 54] to classify the images in one of the predefined classes. Later in [122], for ED multiclass annotation, the binary SVM (support vector classification) is used for semantic prediction, and one class SVM (support vector regression) is used PT for the prediction of confidence factor of the predicted semantic tags. Multiple kernels learning SVM [14], where some different kernels (color histogram kernel, CE wavelet filter bank kernel, interest point matching kernel), are used to find a certain type of visual properties of the image and to approximate the underlying visual similarity relationships between images more precisely. In [123], the AC authors proposed a kernel based classifier that can bypass the annotation task for text-based retrieval. The discriminative models are used extensively for the medical image annotation [50, 47] where SVM is used as a classifier. The rule-based approach using decision tree (DT) and rule induction is reported in [126] for the automatic annotation of web images. Later in [41], the decision tree is enhanced to have both classification and regression to store semantic keywords and their corresponding ranks. In [54], annotation is performed using the idea of clustering based on visually similar and semantically 18 ACCEPTED MANUSCRIPT related images. Artificial neural network (ANN) based discriminative model [125, 128, 129] has also been applied for image annotation. A five-layer convolutional neural network with five hundred thousand neurons are used over the ImageNet 2010 dataset to classify images in 1000 predefined classes [129]. T A fusion method for multiple pretrained deep learning based model and its IP parameter tuning is presented in [128]. CR In the last decade, focus of researchers has shifted towards semi-supervised training due to the unavailability of a sufficient number of the fully annotated dataset and its associated cost. To achieve the SSL, the Hessian regularization US (HR) is used with SVM in [100]. HR can exploit intrinsic local geometry of data distribution, where the intrinsic structure of two data sets may roughly follow AN the similar structure from outside but have quite different local structure, by fitting the data within the training domain and predicting data point beyond the boundary of training data [130]. In [102], the authors extended one versus all M SVM (OVA-SVM) to one versus all SVM with structured output learning capability (OVA-SSVM) which is further extended by [101] to adapt the multi-label ED situation for image annotation. Although Laplacian regularization (LR) can be used with SVM [103], the advantage of Hessian regularization (HR) is that it PT is more suitable for SSL framework. To deal with missing labels, a smoothing function which is a similarity measured between images and between classes is CE proposed, and linear logistic regression has re-weighted least square methods is used for the classification and label assignment [21]. Inspired from manifold regularization [103], annotation framework applicable to the multiclass problem AC using manifold regularized kernel logistic regression (KLR) with a smooth loss function is proposed in [104]. Here, for SSL, Laplacian regularization (LR) used which exploits the intrinsic structure of the data distribution [104]. Later in [81], the authors proposed the extension of two-layer multiple kernel learning (MKL) for SSL. MKL learns the best linear combination of elementary kernels that fits given distribution of data. The proposed methods extend standard (two layers) MKL to multi-layer MKL (deep kernel learning) and the learned kernel is used with SVM (as well as its Laplacian variants) [81]. 19 ACCEPTED MANUSCRIPT 4.1.3. Nearest Neighbor based Model Nearest neighbor (NN) based models primarily focus on selecting the similar neighbors and then propagating the labels to the test image. The similar neighbors can be defined by the image to image similarity (visual similarity) or T image to label similarity or both. A distance matric is used for selecting similar IP neighbors. The efficiency of a distance metric plays a vital role in choosing the CR relevant and appropriate neighbors, hence on the overall performance of the NN method. NN is a non-parametric classifier that can directly work on the data without any learning parameters [131]. US Various image annotation works using NN based models are presented in [9, 106, 132, 133, 134, 108, 135, 136, 137, 11, 10, 40]. Image annotation by AN finding the neighborhood of test image based on color and texture features is presented in [9] where authors proposed a classical label transfer scheme for query image. The nearest neighbor of the query image is computed by M combining the basic distance metrics and then by ranking the keywords based on its frequency, a fixed number of keywords (n) are assigned to the test image ED [9]. In 2009, a novel model for AIA based on NN model called tag propagation (TagProp) was designed [106]. The authors proposed tag propagation based on PT the weighted NN model where the weights of the neighbor are assigned based on its ranking or distance [106]. A comparative analysis of the variants of TagProp CE and its effectiveness is shown in [133]. Learning the distance matrix is key to the NN models. Various methods for learning the distance have been proposed in the literature [132, 138, 134, 107]. In [132, 138], learning a Mahalanobis AC distance matrix is proposed to design the large margin nearest neighbor (LMNN) classifier. Later, the LMNN is extended for multi-label annotation in [107]. Later in [134], a label specific distance metric learning method is proposed to distinguish the multiple labels of an image and neighbors are fetched based on the distance between each label. This approach helps in reducing the false positive and negative labels. The main problem with NN model is that it requires entirely manually an- 20 ACCEPTED MANUSCRIPT notated training set. Also, each label should have a sufficient number of images along with an almost same number of images per label. A very novel algorithm called two pass KNN (2PKNN) is proposed [107, 11] to alleviate the problem of class imbalance and weak labeling. The 2PKNN uses the two types of sim- T ilarity in two passes. In the first pass, image to label similarity is used, and IP image to image similarity is used in the second pass [107]. A combined fea- CR ture set of TagProp [106], convolution neural network (CNN), fisher vector, and vector of locally aggregated descriptors (VLAD), and its combination created using canonical correlation analysis (CCA) and kernel CCA [109] is used with US 2PKNN for image annotation [11]. A KCCA [109] based image annotation is presented in [137] where KCCA finds the correlation between visual and tex- AN tural features and then NN based model is used for annotation. A comparison between the pre-trained CNN network (AlexNet and VGG16) and 15 different manual features for the 2PKNN model is presented in [40]. In [108], an exten- M sion of kernel metric learning (KML) [109] called robust kernel metric learning (RKML), which is a distance calculation technique based on regression, is used ED to find the visually similar neighbors of an image and then majority based ranking is used to propagate the labels to a query image. A flexible number of PT neighborhood selection strategy proposed in [136] where selected neighbors are within a predefined range (rather than the fixed number of neighbors) and are CE tag dependent. The proposed methods give the flexibility to choose only relevant neighbors. A multi-label KNN introduced in [105], and utilized in [10] for multi-label annotation, combines various features using feature fusion technique AC and then used multi-label KNN to annotate images automatically. 4.1.4. Graph based Model The basic idea behind the graph-based model is to design a graph from the visual and textual features in such a way that the correlation between visual and textual features can be represented in the form of vertices and edges and their dependency can be explained. The data points (visual features of images) and the labels can be represented as separate subgraphs, and edges represent the 21 ACCEPTED MANUSCRIPT Label Subgraph L3 L1 L4 Strong correlation IP Weak correlation Vertex representing a single label i X5 X1 Xi X4 X3 AN Data Subgraph Vertex representing a single data instance i US X2 CR Li X6 T L5 L2 Figure 6: Example of graphical model. Label subgraph and data subgraph represent label to label correlation and visual correlation respectively. The correlation between labels and data M instances are represented by the edges connecting label and data subgraph. ED correlation among subgraphs [139]. The semantical correlation between labels can be represented using interconnected nodes which helps in multi-label image annotation. The graphical model can also be used to find the correlation among PT labels. In such a case, vertices represent labels and edges represent correlation among labels. Figure 6 shows an example of correlation representation among CE visual and textual features using a graphical model. The graph-based model can be used with both supervised and semi-supervised AC framework to model the intrinsic structure of image data from both labeled and unlabeled images [23]. Significant works on graph-based models are presented in [112, 17, 114, 116, 140, 139, 115, 111, 141, 113, 79, 92, 23, 25]. The refinement of the results may improve the overall accuracy when the results of an image annotation method are unsatisfactory. A random walk based refinement process, where the obtained candidate tags are re-ranked, is presented in [112][110]. The authors used a random walk with restart algorithm, which is a probabilitybased graphical model that either selects a particular edge or jumps to another 22 ACCEPTED MANUSCRIPT node with some probability and re-ranks the original candidate tags and then top-ranked tags are chosen as the final annotation of the image [112]. Further, in [111], an adaptive random walk method is proposed, where the choice of next route is determined not only by the probability but also consider confidence T factor based on the number of connected neighbors. A bipartite graph based IP representation of candidate annotations and then use of reinforcement algorithm CR in the graph to re-ranked the candidate tags are presented in [17]. The obtained top-ranked tags are used as the final annotation. In the further application of bipartite graph and random walk, the work carried out in [139] constructed US two subgraph (data graph and label graph) as part of a graph called bi-relation graph (BG) and then using random walk with restart on BG, a class to class AN asymmetric relationship is calculated which improve the label prediction. A graph learning based annotation method is presented in [114] where imagebased graph learning based on nearest spanning chain (NSC) is used to cap- M ture the similarity and structural data distribution information. The proposed image-based graph learning produces the candidate tags which are further re- ED fined using word-based graph learning [114]. Image annotation can be considered as a kind of multi-label classification. A graph-based multi-label classifi- PT cation approach is presented in [116]. A hypergraph is constructed where each instance is represented with a vertex, and corresponding labels are connected CE with edges to describe the correlation between the image and its various labels. In [140], the authors proposed a weighted graph, where weights are assigned based on neighborhood construction, approach in the semi-supervised frame- AC work for image annotation. A sparse graph reconstruction method using only the K nearest neighbor is proposed in [115]. The proposed method used only the K number of neighbors (one-vs-KNN) instead of all the remaining image (one-vs-all) to construct a sparse graph to model the relationships among images and its concepts. The proposed method is implemented in the semi-supervised framework and efficiently deals with noisy tags. A hybrid of graph learning and KNN based annotation method [141] uses K nearest neighbors of the test image to propagate the labels 23 ACCEPTED MANUSCRIPT to query image. The proposed method uses the image label similarity as a graph weight and using the image to label distance a final list of tags are generated and assigned to query image. The proposed method can be implemented efficiently even when the number of labels is substantial [141]. A technique based T on records of common interest between users on social networking site is pro- IP posed for AIA [113]. The proposed method is called social diffusion analysis CR where a diffusion graph is constructed based on common social diffusion records of users to represent common interest of users. By utilizing this diffusion graph, learning to rank technique is used to annotate images. US Discussion: The focus of model-based learning is to train a model using training data so that the trained model can tag unknown images. A shift from AN supervised model to semi-supervised model can be noticed in the last decade. The progress made in SSL, based image annotation model, is remarkable. But, the performance of real-time semi-supervised annotation model is not up to the M mark. However, these models paved the way and gave a hope that the real-time annotation model is semi-supervised framework is achievable. The simplicity ED and performance of nearest neighbor model can’t be ignored while designing a model based annotation system. At the same time, representation of corre- PT lation among images and labels by the generative model is at its best. The discriminative power of classifier can play a very prominent role in the design of CE annotation system. For refinement of candidate tags, the graphical models can be used. The hybrid models of generative and discriminative [142], generative and nearest neighbor [135], and graph learning and nearest neighbor [141] have AC shown good results. No matter what kind of model is used, the focus of future research should be in the direction of designing a semi-supervised and unsupervised learning based annotation model with satisfactory accuracy, and that can be implemented in real time. 4.2. Learning Based Annotation Approach The visual features are not sufficient to annotate all the objects presents in the image. Moreover, we can say that the visual elements do not represent 24 ACCEPTED MANUSCRIPT Learning Based Approach Multi-label Learning Multi-instance Multi-label Learning Multi-view Learning Distance Metric Learning MIMLBoost [110] MSE [145] DCA [148] MIMLSVM [151] m-SNE [143] KDCA [148] D-MIMLSVM [161] MISL [144] RLSIM [152] MVVVMR [146] Multi-label sparse coding [37, 20] Ranking preserving low rank factorization MIMLfast [156] HD-MSL [147] M3LDA [153] mHR [130] KML [108] Robust KML [108] IP Graph based model [159, 160] MLDC [53] T Discriminative model [157, 158] NN based model [105, 10] UDML [150] C2I distance [149] LMNN [132] CMIML [92] CR MIML-Gaussian [155] MIML-KNN [154] Multi-label LMNN [107] US Figure 7: Categorization of the learning based annotation approach. the contents of the image in its entirety. To label the contents of the image AN accurately, the model should learn features from various available sources. The correlation among labels is one such source of features. The exploitation of tags M correlation also helps in dealing with noisy and incomplete tag. The learning based models primarily focus on to learn a model from multiple sources of the ED training set. Most of the learning models exploit the inputs features space and have the capability to deal with incomplete tags. Learning from various PT sources (visual features, textual features) improves the generalization capability of the model. When a model explores multiple views of features and stores the features of each view separately in the feature vector, these feature vectors need CE to be concatenated systematically so that the obtained single feature vector is physically meaningful and has more discriminative power than features from AC only one view. The learning based model can be classified into four different categories - multi-label learning, multi-instance multi-label learning, multi-view learning and distance metric learning (figure 7). Figure 7 also shows all the prominent methods used by learning based models. 4.2.1. Multi-label Learning Most of the images contain more than one objects hence a single label doesn’t describe that image’s content properly. Multi-label learning (MLL) means the 25 ACCEPTED MANUSCRIPT assignment of multiple tags to an instance. The model based on multi-label learning deals with multiple class of labels and assigns multiple labels to an image based on the contents of the image [162]. Multi-class learning is different form multi-label learning as in former, out of many classes only one class is T assigned to an instance whereas in later, out of many classes multiple classes IP can be assigned to an instance. MLL usually assumes that the training dataset CR is completely labeled [163]. However recent progress in semi-supervised MLL is remarkable, and various methods have been proposed for semi-supervised based MLL [162, 164, 165, 166] where an MLL model is trained from miss- US ing labels. The images downloaded from Flickr or any other internet sources are not fully labeled. Semi-supervised based learning methods are extremely AN desirable due to the unavailability of a large number of fully manually annotated dataset. A detailed review of MLL methods can be seen in [167]. MLL techniques can be implemented using discriminative models [157, 158], near- M est neighbor based model [105, 10], graph-based model [159, 160] and various other techniques [20, 37, 53, 168, 27]. The discriminative model [157, 158] try ED to learn a classifier from the incomplete training data. The training data has incomplete label information with some of the noisy labels. In [21], the miss- PT ing labels are filled automatically using a smoothing function. The similarity measured between images and its labels in smoothing function fills the missing CE labels. The smoothing function assumes smoothness at two levels: image level and class label. Image level smoothness assumes that if two images have similar features, then labels represented by these two images are close enough. Class- AC level smoothness assumes that if two labels have close semantic meaning, then there exists a similar instantiation. In [157], an image is represented as an overlapping window at a different scale and then using parameter estimation and cutting plane algorithm; all the objects are simultaneously labeled. A machine learning technique to handle the densely correlated labels without incorporating one to one correlation is proposed in [158] which can deal with different training and testing label correlation. A graph based model, for the multi-label framework, called a multi-label 26 ACCEPTED MANUSCRIPT Gaussian random field (ML-GRF) and multi-label local and global consistency (ML-LGC) inspired from single label Gaussian random field (GRF) [169] and single label local and global consistency (LGC) [170] respectively is proposed in [159]. The proposed model exploits the labels correlations and labels con- T sistency over the graph. A bidirectional fully connected graph is introduced to IP tackle MLL problem [160]. The graph is based on the generalization of condi- CR tional dependency network which calculates conditional joint distribution over output labels given the input features. Although nearest neighbor based binary classifiers have proved its effectiveness for binary classification, its multi-label US version [105] paved its way for MLL. Nearest neighbor based MLL using multi-label KNN [105, 10] is easy to AN implement with low complexity. The method provides very effective classification accuracy with various features fusion techniques [10]. A subspace learning method to tackle the multi-label information and sparse coding based image M annotation method is proposed in [37]. An extension of sparse graph based SSL [115] called multi-label sparse graph based SSL framework is proposed in ED [20] with optimal sample selection strategy to incorporate semantic correlation for multi-label annotation. A regression based method for structured feature PT selection method which selects a subset of the extracted features is proposed in [117]. The proposed method uses tree structured grouping sparsity to represent CE hierarchical correlation among output labels to speed up the annotation process. In [53], dictionary learning is used in input feature space to enable MLL. The author’s proposed multi-label dictionary learning (MLDC) which exploits AC the label information to represent an identical label set as a cluster and partially identical label sets connected [53]. An MLL model in a semi-supervised framework which can efficiently deal with noisy tags is proposed in [168]. The proposed method used graph Laplacian for unlabeled images, and trace norm regularization to reduce the model complexity and to find the co-occurrence of labels and its dependencies. Finding and utilizing the tag correlation is one of the leading aspects of the MLL framework. In [27], the authors exploit the tag correlations where missing labels are dealt with using tag ranking under the 27 ACCEPTED MANUSCRIPT regularization of tag correlation and sample similarity. 4.2.2. Multi-instance Multi-label Learning Multi-instance learning (MIL) is a way of dealing with objects which are de- T scribed by numerous instances. The training set in MIL usually has incomplete IP labels. Multi-instance means an object has various instances with a single label. In complex applications where an object has various instances, hence multiple CR feature vectors, and only one of these vectors represent that object is called MIL [171]. The ambiguity associated with the training dataset, where out of US various instances at least one instance present the bag represent that object can be solved efficiently using MIL [172]. Presence of at least one instance in the bag means that the bag is positive otherwise bag is labeled as negative [172]. AN One the other, MLL deals with the situation where one object is described by more than one class label [162]. In MLL, an instance can be classified in more M than one class, which means classes are not mutually exclusive. Multi-instance multi-label (MIML) learning is combination MIL and MLL. MIML is essen- ED tially a supervised learning problem with ambiguous training set where each object in the training set has multiple instances, and it belongs to multiple PT classes [151, 161]. One of the early works proposed in the MIML framework is MIMLBoost and MIMLSVM methods [161] for scene classification. A single instance single label supervised learning is a kind of degenerated CE version of multi-instance multi-label learning, which in turn is the degenerated version of MIML learning [161]. In this process of degeneration, some of the AC important information in the training set may be lost. To tackle this problem, an extended version of [161], called direct MIMLSVM (D-MIMLSVM), is proposed in [151] which can handle the class imbalance as well by utilizing either the MIL or MLL as a bridge. Later in [154], the nearest neighbor based MIML method (MIML-KNN) is proposed to deal with the degeneration problem. The proposed method utilized the K number neighbors and its citers to effectively deal with the degeneration problem. Finding the co-occurrence among labels, and instances and labels are key to MIML learning. To model the relationship 28 ACCEPTED MANUSCRIPT between instances and label, and between labels, an efficient MIML method using Gaussian process prior is proposed in [155]. An annotation method by utilizing labeled data only at bag level, instead of predicting labels for previously unseen bags, uses an optimized regularized rank-loss function is called rank-loss T support instance machine is proposed in [152]. An LDA based MIML learning IP method can be seen in [153], where the proposed method exploits both visual and CR textual features by combining MIML learning and generative model. Finding the correlation between labels, and instances and labels simultaneously is a time consuming process. To reduce the complexity of the MIML system, MIMLfast US [156] can be used which is much faster than existing MIML learning methods. In [22], for a given image level labels and objects attributes, the image is first AN over segmented into superpixels and, object and its attribute association with the segmented object are performed using the Bayesian model by generating non-parametric Indian Buffet Process (IBP) which is weakly supervised and M hierarchical. The proposed method is called weakly supervised Markov Random Field Stacked Indian Buffet Process (WS-MRF-SIBP) can be deemed as MIML image as a bag. ED learning method where each superpixel can be considered as instance and each PT Multi-instance multi-label learning framework based image annotation method, called context aware MIML (CMIML) [92], works in three stages. In stage one, CE multiple graph structures are constructed for each bag to model the instance context. A multi-label classifier with the kernel is designed in stage two using a mapping graph to Reproducing Kernel Hilbert Space. In the third stage, a AC test image is annotated using the stage one and stage two. Recently, MIML learning framework based medical image annotation method [49] utilizes CNN features. CNN is used to extract the region based features, and then author used sparse Bayesian MIML framework which uses a basic learner and then learns the weight using relevant vector machine (RVM). 29 ACCEPTED MANUSCRIPT 4.2.3. Multi-view Learning An object has various characteristics and it can be represented using multiple feature sets. These feature sets are obtained from different views (texture, shape, color, etc.). These different feature sets are usually independent but T having complementary nature, and one view alone is not sufficient to classify an IP object. Combinations of all feature sets give much more discriminative power CR to characterize an object. When we combine all the feature sets directly in a vector form (without following any systematic concatenation rule), the concatenation is not meaningful as each feature set has specific statistical property. The US multi-view learning is concerned with the problem of machine learning which provides systematic concatenation of features from different views (multiple dis- AN tinct feature sets) in such a way that the obtained feature vector is physically meaningful. In multi-view learning, heterogeneous features from different views are integrated to exploit the complementary nature of the different views in M the training set. The concatenation of features from two or more views can be implemented using various methods [173, 145, 143, 144, 147, 174, 175, 130]. ED The various methods explored in [173] are primarily based on machine learning techniques and presented supervised and semi-supervised multi-view learning PT techniques. The complementary property of different views can be explored using [145] with low-dimensional embedding where the integration of features from CE different views has an almost smooth distribution over all views. A probabilistic framework based multi-view learning (m-SNE) [143] uses pairwise distance to obtain probability distribution by learning the optimal combination coefficient AC in all the views. Later in [147], this pairwise distance is replaced with high-order distance using hypergraph where each data sample is represented with a vertex in hypergraph and a centroid and its K number of nearest neighbor is used to connect a vertex with other vertices using hyperedges. The proposed probabilistic framework based method (HD-MSL) [147] produced excellent classification accuracy. A detailed review of different multi-view learning approaches can be seen in [176, 177]. In [176], various subparts of multi-view learning have been 30 ACCEPTED MANUSCRIPT discussed in detail. Manifold regularization based multi-view learning approach [146, 175] explores the intrinsic local geometry of different views and produces low-dimensional embedding. Inspired from vector valued manifold regularization, a multi-view vector valued manifold regularization (MV3MR) is proposed T in [146], which assumes that different views have different weightage, and learns IP combination coefficient to integrate all the multiple views for multi-label image CR classification. Semi-supervised based multi-label learning methods [175, 174] can exploit the unlabeled set as well to improve the annotation performance. In [175], the correlation among labels, multiple features, data distributions are ex- US plored simultaneously using manifold regularization and the proposed method is called manifold regularized multi-view feature selection (MRMVFS). A bipartite AN ranking framework to handle the class imbalance in semi-supervised multi-view learning based annotation method [174] learns views specific ranker from labeled data and then these rankers are improved iteratively. Multi-view learning M can also be implemented using Hessian regularization [130, 178], which has the advantages of providing unbiased classification function. The multi-view Hes- ED sian regularized (mHR) multi-view features can be used with any classifier for the classification (used with SVM [130]). In [178], Hessian regularization with PT discriminative sparse coding, which has a maximum margin for class separability, for multi-view learning is proposed for AIA. In [23], multiple features from CE different views are concatenated and used with graph based semi-supervised annotation model to annotate the images. To deal with the problem of a large volume of storage space required for a large number of images, the authors AC generated prototype in feature space and concept space using a clustering algorithm. Then, using a feature fusion method, the best subset of features in both spaces are chosen. For any test image, its nearest cluster is chosen in both feature space and concept space. 4.2.4. Distance Metric Learning When image annotation is performed based on visual features, the problem of semantic gap arises. Another approach for annotation is the retrieval based 31 ACCEPTED MANUSCRIPT annotation. In retrieval based annotation, a set of similar images are retrieved, and an unknown image is tagged based on the tags of retrieved images. To find the similar images in a training set, the distance between images has to be measured. Learning these distances among images is called distance metric T learning. The accuracy and efficiency of learned distance metric play important IP role in neighbor selection or selection of similar images. If the distance matrix is CR inaccurate, selected similar images may not be similar, and this will lead to the assignment of irrelevant tags to a new image. Various methods for distance metric learning has been proposed [148, 132, 107, 134, 106, 150, 149, 108, 153]. The US distance among images is learned based on visual similarity and label specific similarity. AN When the dataset is provided with contextual constraints, a linear data transformation function (discriminative component analysis [148]) is learned to optimize the Mahalanobis distance metric. For non-linear distance metric, a M discriminative component analysis (DCA) is used with a kernel, and this extended method is called kernel DCA (KDCA) [148]. A linear metric is often ED unable to accurately capture the complexity of the task (multimodal data, nonlinear class boundaries). Whereas, a non-linear metric can represent complex PT multi-dimensional data and can capture nonlinear relationships between data instances with the contextual information. For nearest neighbor classifier, accu- CE rate calculation of distance among images or tags plays a significant role in the overall accuracy of the method. A classical method for learning Mahalanobis distance metric for KNN classifier is proposed in [132] and called it large mar- AC gin nearest neighbor (LMNN). The proposed LMNN uses labeled training set and using Euclidean distance the target nearest neighbors are determined which is iteratively changed during the learning process. Later in [107], the LMNN is extended for the multi-label problem and is used with proposed two pass KNN (2PKNN) for AIA. Inspired by linear discriminant metric learning which is based on Mahalanobis distance, a label specific distance metric where for each specific label a distance is calculated is proposed in [134]. The proposed method is used with weighted KNN for multi-label annotation and, as claimed, 32 ACCEPTED MANUSCRIPT reduced false positive and false negative labels. A weighted nearest neighbor based model (TagProp) [106] used either ranking or distance metric to update the weight. When the distance is used to update the weight, a distance metric is used to calculate the weight by directly maximizing the log-likelihood using T a gradient algorithm. An inductive metric learning, which is inspired by [132], IP and a tranductive metric learning, where both textual and visual are exploited CR to learn distance, are unified together to learn the distance metric [150]. This combined technique is called unified distance metric learning (UDML). In [138], two types of classifiers are used with distance learning metric where LMNN [132] US is used with KNN and multiclass logistic regression based distance metric learning is used with the nearest class mean (NCM). For multi-label classification, a AN class to image (C2I) distance learning method [149] uses large margin constraint for error term and L1-norm for regularization in the objective function. While learning the distance metric, if the linear transformation is used (as in linear M Mahalanobis distance metric learning [132]), the non-linear relations among images cannot be represented. To deal with non-linear relationships among images, ED usually, a kernel is used with a linear function. The use of kernel metric learning (KML) based distance calculation technique is proposed in [108] where authors PT proposed a robust KML (RKML) to measure the distance. RKML is based on regression to deal with high dimensionality problem of KML. An AIA approach CE using probability density function (PDR) based distance metric learning is proposed in [179]. The proposed distance learning technique can deal with some mission labels as well. AC Discussion: The ease and efficiency of learning based image annotation techniques and its good accuracy have paved the way for future research in this direction. On the one hand, multi-label learning is the desirable facet of image annotation. On the other hand, multi-view learning accords more discriminative feature selection capability. The power of multi-view learning can be best utilized if we incorporate multi-view features with MIML and distance learning techniques. The combination of visual and textual features in the nearest neighbor method is easy to implement and produces higher accuracy with ex33 ACCEPTED MANUSCRIPT cellent efficiency. Although distance metric learning based annotation has not been explored to its full extent, we believe, it can produce excellent results for annotation. We hope that there will be a lot more work in AIA using learning T based methods in the near future. IP 4.3. Image Annotation Based on Tag Length CR When we consider the length of the tag for AIA, there can be a fixed number of tags or a variable number of tags associated with images. When an image is annotated, it is assigned labels based on the contents of the image. The fixed US number of tags refers to the assignment of n number of labels to the images. Once n is fixed, it remains constant for all images. Here, the number of objects AN present in the image doesn’t have much significance. However, the contents of the image may be utilized for annotation. This type of annotation may result in the higher false positive rate especially when the number of objects present M in the image is less than n. In reality, an image may contain many objects, so while labeling this type of image, labels have to be assigned to the relevant ED objects. The labeling of all the relevant objects may result in a variable number of labels of different images. While it is comparatively easy to implement image PT annotation for a fixed number of tags, variable length tags represent the realistic contents of an image. Both fixed length tags and variable length tags are part CE of multi-label annotation as long as n > 1. 4.3.1. Fixed Length Tags AC The conventional annotation approach follows fixed length annotation where the number of tags is fixed for all the test images irrespective of their content. In the fixed length tag annotation approach, a set of candidate labels are obtained, and final list of tags are fixed from the candidate labels. When the candidate labels are obtained, selecting the final list of annotation tags is essentially a label transfer problem. The label transfer schemes deal with decision making process where a list of n keywords are selected as final tags for any query image. A probability value is assigned to each candidate label, and all the candidate 34 ACCEPTED MANUSCRIPT labels are ranked according to its probability. This probability indicates the confidence score for the candidate annotation. Ranking based annotation approaches revolve around the following methods: (i) The candidate annotations are ranked according to their probabilities, and top n images with the highest T probability are selected as the final annotation. (ii) A threshold based approach IP can be followed where all the candidate labels whose probability is greater than CR the threshold are selected as final labels. When top n labels are selected as final annotation, the annotation approach is easy to evaluate. The value of n is variable, but once it is fixed, it selects the same number of labels for all the US images. Selecting the appropriate value of n is a point of concern here. If n is too small, the recall capability of the annotation will be degraded. If n is too AN large, the precision of the annotation method will fall. The size of n is usually fixed at 5 [12, 97, 180, 27], or sometimes it may be variable [181, 108, 113, 15]. In the threshold based label transfer approach, each candidate label is as- M signed a probability, and a threshold is used which acts as a boundary line. All the labels, having a probability greater than the threshold, are selected as the ED final annotation. The threshold based approach may produce a different number of tags for different images. Selecting the appropriate value for the threshold PT is again a matter of concern. The optimal value for threshold can be decided by an expert or on expert advice as it may be variable from image to image. CE There are various works where threshold based approach have been mentioned [112, 110, 17]. A modified threshold based approach is proposed in [17] where the number of labels is decided according to the number of final labels in the AC candidate set. A greedy based approach for label transfer is proposed in [9]. The authors proposed two-stage label transfer algorithm where neighbors are arranged according to their distance from the query image, and if possible, the method first tries to create final label set from the first neighbor, and if not, the proposed approach uses other neighbors as well. The labels in the neighbor are ranked based on its frequency in the training set [9]. The popularity if fixed length tag annotation approaches are as a result of easy and simple evaluation criteria. Also, various state-of-arts for fixed length 35 ACCEPTED MANUSCRIPT annotation and lack of baseline and state-of-arts for arbitrary length annotation methods are the reason behind the inclination towards fixed length annotation approach. The state-of-arts help in the comparative evaluation of the proposed T methods. IP 4.3.2. Variable Length Tags All the relevant objects present in the image should be assigned a label. CR Thus, the length of the label is determined by the contents of the image. Annotating an image with all the relevant labels is the realistic way of image annota- US tion. Predicting the proper length of tags is foremost task of any variable length tags annotation approach so that appropriate tags can be assigned given a list of tags. Variable length tags can be predicted using a slightly modified threshold AN based label transfer approach [9, 17]. When a threshold value is fixed, all the labels having confidence score greater than the threshold are selected as final M labels. The number of labels having confidence score greater than the threshold may vary from image to image which may result in variable length tags. But, ED finding the optimal threshold is something that may cause worry. Recently deep learning techniques [182, 183, 184] are being used for variable length label pre- PT diction. The basic idea behind the deep learning based arbitrary length tags annotation method is to use a neural model that can automatically generate a list of labels. Inspired by the image caption generation methods, recurrent CE neural network (RNN) based variable length annotation method is proposed in [183]. The proposed RNN model uses long short term memory network (LSTM) AC as its sub-module which has extensively been used for image caption generation [185]. As LSTM requires ordered sentences in training set for caption generation, tags in the training set of annotation dataset have to be ordered. The four types of method to order the tags have been suggested and implemented in [183]. The authors used a tag to tag correlation where prediction of the next keyword is inspired by immediate previous prediction. CNN-RNN based multilabel classification [182] extracts visual features using CNN, and semantic label dependency is obtained using RNN having LSTM as its extension. Then, mul- 36 ACCEPTED MANUSCRIPT tilayer perceptron is used to predict the labels. Later in [184], an extension of [182, 183] is presented to make the learning more stable and faster. A semantic regularized layer is put in between CNN and RNN which regularizes the CNN network and also provides the interface between them to produce more accurate T result [184]. IP Discussion: Fixed length annotation methods tried to solve the problem of class CR imbalance, noisy dataset, incomplete labels, etc. by utilizing the correlation among labels and visual features. The multi-label classification based previous methods use one-vs-all classifier hence multiple classifiers have to be trained for US multi-label classification. Variable length tags based annotation method is a realistic way for label assignment. It allows labeling all the relevant contents AN present in the image. However, arbitrary length tag annotation methods have not been explored in details. The success of [183, 182] have given us a hope to use deep learning based model to annotate images with variable length tags; it M is still in its early year of research. Also, the time taken to train a deep learning based model is very high. However, the increasing capabilities of the modern ED day computer system made it easy to train the deep models. We believe that a lot more work will be carried out for variable length tags annotation approach PT in the near future. 4.4. Image Annotation Based on Training Dataset CE To design an annotation model, the first thing to consider is the training dataset. The type of training dataset has a significant influence on the final an- AC notation model. The dataset can be categorized into three sets - (i) the training set has complete labels. This indicates that all the images in the training set are manually annotated and contains a complete set of tags. Training with this types of the dataset is known as supervised learning. The availability of fully manually annotated dataset is expensive and time consuming, but once it is obtained, a very efficient model can be designed using supervised learning. (ii) The training set has incomplete labels. This types of the dataset are either manually annotated, or annotation is performed by users over social sites, but the images 37 ACCEPTED MANUSCRIPT are not fully annotated. As the images are labeled by the users, the training images may contain noisy tags or have an incomplete set of labels. Training with this kind of dataset is known as semi-supervised or weakly supervised or active learning. SSL based annotation models are very impressive as they can adapt T themselves to very large scale and expanding dataset. (iii) The training dataset IP is not annotated at all. The images in the training dataset are not labeled with CR any tags, but each image is equipped with some metadata like GPS data, URL, etc. In such cases, the candidate annotation tags have to be mined from the metadata associated with images. Training with this kind of dataset is known US as unsupervised learning. It is difficult to design an unsupervised learning based annotation model with high efficiency and accuracy. AN The representational accuracy of the training dataset has a direct influence on the performance of any learning model [186]. If the dataset contains any noise, then it has to be removed before training a model. The refinement of ED 4.4.1. Supervised Learning M noisy or incomplete dataset leads to active learning based approach. When the training dataset is provided with corresponding output labels, the PT designing of the model is somewhat straightforward. The supervised training dataset is given as {(X1 ,Y1 ), (X2 ,Y2 ),....., (Xm ,Ym )} where Xi is input set and Yi is the corresponding output set. In the multiclass scenario ith output label CE Yi =yi1 , yi2 , ....., yiP where P is the number of output class for ith input data. The input data can also have multiple features which is represented as Xi =x1i , AC th x2i , ....., xN input data. For image i where N is the number of features for i annotation, an image may be annotated with multiple keywords hence a single j 1 2 P training data item is given as (x1i , x2i , ....., xN i , yi , yi , ....., yi ) where xi is a an element of the set Xi of input image features and yij is an element of set Yi of output keywords for ith training image. The input features can be either feature extracted from input images, correlated features extracted from image and tags collectively. The size of the training dataset may also influence the overall performance of the model as more training images mean less generalization error. 38 ACCEPTED MANUSCRIPT Obtaining a large number of fully manually annotated dataset is burdensome, time consuming and expensive. The supervised dataset for the annotation are [187, 188, 189, 190]. The advantage of supervised learning is that the training dataset enables the model to learn the concepts characteristics and classification T rule. Thus, a large number of training sets will enhance the capability of the IP trained model. The basic idea behind all the image annotation models based on CR supervised learning [180, 191, 18, 120, 192, 1, 193, 121, 92, 142, 110, 194] is to effectively exploit the labeled data in training set. The discriminative supervised annotation models [192, 180, 194] trains a classifier by exploiting visual and tex- US tual features of labeled images. The imbalance class is also an issue faced in the fully annotated dataset. Each label should have a sufficient and almost similar AN number of the training set to avoid the underfitting of decision boundary. Thus, it poses the scalability problem. A supervised annotation model where both the retrieval and annotation are considered as classification problem which is de- M fined by the database rather than query is proposed in [180]. The dataset [187] used in [180] is fully manually annotated, and the proposed probabilistic model ED calculates the conditional probability density given the feature vector and class label. An extension of the regression model into a classification model [60], by PT exploring the annotations and class label jointly, modified the supervised LDA (sLDA) to multiclass sLDA for simultaneous classification and annotation. The CE multi-label annotation task where the number of labels is variable for different images further complicate the design and complexity of the supervised model. If there are a vast number of output labels, then the training time for the an- AC notation model is very high. In supervised learning, once the model is trained, the training dataset be- comes obsolete. So, the training time of the model is usually considered as the computational complexity of the model. The supervised image annotation models [92] uses multi-instance learning technique to model the instance context and kernel based classifier is used for multi-label annotation. The graph based model [110] utilized the co-occurrence of semantic concepts of training images and the using visual features and semantic concepts the probability of 39 ACCEPTED MANUSCRIPT the candidate tags are obtained. 4.4.2. Semi-supervised Learning The main drawback of supervised learning is that it requires a large number T of annotated training images which is difficult to obtain. Also, for large scale IP dataset, the training time of the supervised models is usually very high. To deal with these complications, a new method to train the model called SSL or weakly CR supervised learning or active learning is proposed in literature [195, 43]. The SSL uses only a small number of labeled data and makes use of unlabeled data to US train a model. The SSL can deal with noisy, incomplete and imbalanced training dataset. The basic motive behind SSL based model is to reduce the size of the labeled training set. When the image dataset is annotated by user [188, 190], AN the tagged images are usually noisy where the tagged labels may not represent the contents of images accurately and user-tagged are either incomplete or over- M tagged. Over-tagging is a kind of noisy tags which have to be removed. The noisy tags have to be replaced with relevant tags that accurately reflect the ED concepts of the image. Various method for denoising the tags [26, 168, 24] have been proposed in the literature. To deal with incomplete tags, various PT techniques [21, 24] have been proposed. A statistical model that explores the word to word correlation [19] can be used directly with any active learning to reduce the required number of annotated example significantly. The use of a CE mixture model in semi-supervised framework [80] is shown good results in the early age of semi-supervise learning. Green’s function is very effective for single AC label tagging in the semi-supervised framework and its extended version [196], which exploits the label correlations, can be used for multi-label annotation in the semi-supervised framework. The popularity of SSL based image annotation is because of its effectiveness over the growing size of the image database with noisy and incomplete tags. An SSL based training dataset contains an only small number of fully labeled images and has a majority of unlabeled or noisy labeled images. An SSL based annotation methods either first trains the model on a small set of labeled data 40 ACCEPTED MANUSCRIPT and then using various correlation noisy labels are refined or first the noisy labels are refined and then a model is trained on the whole dataset. Hessian regularization based annotation models [100, 130, 178] which operates in the semi-supervised framework have produced good results. The Multiview Hessian T regularization (mHR) [130] combined multiple kernels and HR obtained from IP various views. The sparse coding based SSL model [178] uses mHR to annotate CR images produced competitive results. The use of graph based models in the SSL framework [159, 140, 6, 139, 174] has exploited both labeled and unlabeled images. Graph based models can US also be used to find the consistency among labels and label correlation [159]. Usually, the SSL based annotation models can’t handle a very large number of AN unlabeled images and have a limitation on the maximum number of unlabeled images. In [140], large scale multi-label propagation (LSMP) is proposed to handle the large number of unlabeled images in the SSL framework. A graph M Laplacian based method for labeled and unlabeled images is proposed in [6]. The proposed method is called structural feature selection with sparsity (SFSS) ED where a graph Laplacian is constructed on the visual features of images to find the label consistency. Random walk with restart (RWR) used for SSL over Bi- PT relational graph (BG) [139] measures class to image correlation and is a very good example of graph based model for AIA in SSL framework. CE To learn an agreement between keywords and image region, the multiple instance learning (MIL) techniques have been exploited [197] with deep learning features in SSL framework. A simple assumption with all the keywords associ- AC ated with training set is that at least one keyword provided with initial data is correct and then using deep neural network and MIL, the class of the image is predicted. The use of instance selection approach to reducing the complexity of large scale image annotation system in SSL framework is proposed in [198]. The proposed method used single label prototype selection based instance selection method and extended it to multi-label purpose. The proposed method [198] removes those instances which do not affect the classification. The association of image objects and its attributes in a semi-supervised framework 41 ACCEPTED MANUSCRIPT [22, 199] achieved an excellent performance. By using non-parametric in SSL, the authors tried to find the association between object and attribute. The fusion of multigraph learning, matrix factorization and multimodal correlation using a hash function for learning the annotation model using only a T small number of labeled tags is proposed in [25]. The proposed method can CR 4.4.3. Unsupervised Learning IP handle a very large number of unlabeled images. Unsupervised learning based methods are one of the most attractive anno- US tation methods. It is perfectly suited for a large number of unorganized images available these days. The unsupervised learning based annotation methods do not require labeled training images (strongly or weakly), which is one of the AN strongest points for these methods. Although, it is not that unsupervised learning based annotation methods can annotate images out of nowhere. It also M needs text to label any unlabeled image, but the candidate labels are mined from the metadata. As we know that, each image on the web has a URL, some ED text surrounding the image and some other information associated with the image. This information and text associated with the image are called meta- PT data. An unsupervised learning based annotation method mines label from these metadata and annotate the image. As candidate labels are obtained from image metadata, unsupervised learning based methods don’t require fully or CE partially labeled training dataset and can annotate image without training a model [15]. The metadata such as URL, GPS info, surrounding text, filename, AC etc. of the image usually provides a significant clue about the concepts of the image. Although all the texts/words present in the metadata are not relevant for annotation, metadata contains almost all the candidate labels that can perfectly describe the contents and concepts of the image. Mining the candidate labels from the metadata and finding association among candidate labels to produce the final labels for image is a challenging task. Mining the labels from metadata also enables to have variable length tags which are again a driving factor for the attractiveness of unsupervised learning based image annotation. 42 ACCEPTED MANUSCRIPT The metadata of images may be noisy and unstructured thus candidate label detection method should be robust enough. In one of the early work on unsupervised learning based annotation [28], PLSA based model is used which is first trained on the text and then on visual T features and then using inference the images are annotated. A reinforcement IP learning based graphical model [17] obtains initial candidate annotation from CR surrounding texts (e.g., filename, URL, ALT text, etc.), then mining the semantically and visually related images from large scale image database the initial candidate annotations are further refined, and then ranking technique is used US to obtain the final annotation. The annotation of personal photo collection [16] used GPS, timestamps and other metadata to find the correlation between AN scene labels and scene labels and event labels. A restricted Boltzmann machine (RBM) based model for geographical image annotation is presented in [13]. Since 2012, ImageCLEF is organizing scalable image annotation task, where M various expert teams from all over the world participate. The idea is to find a robust relationship between images and its surrounding text. The focus is ED mainly given towards obtaining the annotation from web metadata. The motive is to design an image annotation system where number of keywords are PT scalable. A large number of teams participate in the event, and some teams have come up with excellent proposals [200, 46, 201, 202, 45, 44] to deal with CE noisy and irrelevant texts associated with web images. Although, each year’s annotation challenge include various subtasks, scalable concept detection, and tagging remains center of the challenge. Overall, significant progress has been AC made towards unsupervised learning based image annotation. Still, finding the correlation between metadata and images is a very challenging task as metadata provides very weak supervision. Discussion: Although supervised annotations have reached the advanced stage and have achieved an excellent level of accuracy, it is not desirable in the current situation because of a number of reasons: (i) It requires a large number of fully manually annotated images which is a time consuming process and difficult to obtain. (ii) The training time of the supervised model is high. (iii) 43 ACCEPTED MANUSCRIPT if training data changes, we need to retrain the model. (iv) It is very difficult to scale the model, hence, not suitable for large scale growing database. (v) Length of the tag is fixed. (vi) It can’t handle noisy or incomplete training set. To deal with these problems, SSL based annotation model has been introduced T which gained significant popularity. Although it can deal with almost all the IP problems of supervised model mentioned above, it still needs some manually CR annotated images. Also, usually, a semi-supervised model has limitation about the maximum number of unlabeled images it can handle. The length of the tags remains a challenge for the SSL based model as it is usually fixed. Fixed length US tags may not describe the contents accurately. All the images on the internet are associated with metadata (e.g., filename, alternate texts, surrounding texts, AN URL, etc.). This metadata can be mined to obtain the tags for the image. The unsupervised learning based models can produce variable length tags and don’t require labeled data. Training the model over the image metadata is a M challenging task as most of the texts in the metadata are unrelated and /or redundant. These noisy metadata have to be mined and refined to obtain the ED tags. Although, the task is challenging it has become very much popular as it suites the real life situation of dealing with large scale database and web images. PT The recent advancement in the processing capabilities of the chip, increased size of training data, and advances in machine learning research led to the CE widespread application of deep learning techniques. A deep learning model consists of multiple stages of non-linear information processing units to represent features at the successively higher level of abstraction in supervised or unsuper- AC vised manner [203]. To tackle the problem of the semantic gap, deep learning based feature representation techniques [204] can be utilized which contains richer semantic representations than the handcrafted feature extraction methods. One of the key requirement for deep learning is to have a large number of training data. Also, the training of deep neural network from scratch is a very time-consuming process. Nowadays, there exist several publically available datasets with a very large number of the training image. Also, we can generalize the pre-trained deep learning models to tackle the problems of the requirement 44 ACCEPTED MANUSCRIPT of a large number of training data and to reduce the training time. The use of deep learning in semi-supervised learning framework for AIA can be seen in [197]. Authors proposed MIL based deep learning architecture for AIA and achieved commendable results. Although deep learning models are extensively T being used for various computer vision tasks and shown a benchmarking per- IP formance, However, the application of deep learning for AIA is still in its early 4.5. Image Annotation Based on User Interaction CR stage and needs the active participation of research community. US Another way to view an image annotation system is through a user’s interaction with the system. If the system is fully automatic then there is no AN interaction with the user once the data is submitted, the system automatically performs annotation. If the system is fully manual, then a human expert performs the annotation which itself is a tedious, time consuming and very ex- M pensive process. Annotation can also be performed in a semi-automatic manner where most of the processing is handled by system and user can interact with ED the system using relevance feedback or other mechanisms to improve the confidence of the model. The focus of the most of the researchers is towards AIA PT [174, 26, 25, 19, 142, 98, 150, 141, 10, 192, 50]. AIA can be achieved through CBIR provide that the problem of the semantic gap is addressed effectively. The advantages of AIA includes efficient linguistic indexing where a query can CE be specified in natural language. The basic idea of AIA is to minimize user intervention during annotation. Although a human can annotate images with AC high accuracy, one can’t expect that all human have the same intelligence [205]. A novice user has very little knowledge about the contents of the image in association with correlated keywords. Hence, an image annotated by novice user may contain a completely different set of labels form the labels assigned by an expert. However, when a machine performs annotation, this subjectivity of the annotated labels doesn’t make any sense, and the same set of labels are produced on every run. Although the AIA removes the subjectivity, once the model produces a final 45 ACCEPTED MANUSCRIPT annotation, it can’t be changed or improved. Also, AIA requires accurate training dataset so that the classifier can be trained accurately. If training dataset is of low quality, the performance of the model will degrade [206]. This led the focus of researchers towards semi-automatic image annotation. The perfor- T mance of an annotation model can be improved greatly with little intervention IP of the user. The user intervention can be in the form of relevance feedback CR or identification of the relevant areas of the image to be annotated. Also, the semi-supervised annotation system doesn’t require high quality ground truth dataset. Some of the promising work in the field of semi-automatic image an- US notation is presented in [207, 208]. A semi-automatic annotation system called Photocopian [207] takes the information from camera metadata, GPS data, cal- AN endar data, etc. to annotate an image and has the scaling ability to adopt new annotation or classification services. To represent the semantic of image tags can be linked using predicate words [208]. This semantic links among image M tags are represented semi-automatically. Relevance feedback based annotation systems try to improve the candidate ED annotation based on user’s feedback [209, 208, 210]. In this kind of models, first a set of candidate labels are obtained automatically and then using the relevance PT feedback, the candidate tags are further refined to generate final annotation. Here, relevance feedback is one of the forms of a user’s interaction with the CE system during the annotation process that is why we call it a form of a semiautomatic system. A feature selection strategy to annotate image [209, 211] also incorporates the keyword similarity and relevance feedback to find the similar AC keyword and visual contents that can be used for retrieval of similar and relevant images. A relevance feedback based model [210] for medical images uses a classification approach for annotation. The authors used keyword based image retrieval using relevance feedback. No matter what strategy is used, the semi-automatic annotation model has the advantage of on-site user interaction to improve the quality of annotation. However, it requires a user to be there during the annotation process in one form or other. But it has the advantage of dealing with the incomplete dataset. 46 ACCEPTED MANUSCRIPT One the other hand AIA doesn’t require the user’s presence during annotation, and also the system has consistency. Because of availability of a large number of dataset and advancement in AIA, the AIA has been much more popular than IP even deal with a noisy, incomplete and unstructured dataset. T semi-automatic annotation. Due to the advancement in AIA techniques, it can 4.5.1. Image Annotation via Crowdsourcing CR Image annotation via crowdsourcing is a collaborative approach to obtain annotated images. The images are by non-expert users (paid or volunteer users). US To obtain the fully annotated image database for the training and ground truth purpose, either the dataset is labeled by expert annotators which is a time consuming and expensive or it can be annotated via crowdsourcing which requires AN much more less time and is less costly. One of the most popular platforms for the crowdsourcing is Amazon Mechanical Turk (MTurk). MTurk is an online M crowdsourcing platform that allows a requester to post the task, called human intelligence task (HIT) on MTurk and turkers (workers) across the world ex- ED ecute the given task in a very short span of time with less amount of money expended by the requester [212]. The MTurk turkers are non-expert annotator, PT and a large number of workers usually performs the annotation. The annotation on MTurk is performed by thousands of non-expert turkers hence quality control is an essential part to obtain the high-quality annotated CE data. MTurk has no effective built-in mechanism for quality control and offers minimal control over participants (turkers) who is allowed to perform annotation AC [212]. If the images are labeled with free-form text, it will result in a very large number of random collection of labels which is undesirable. Even if the image is tagged from the set of concepts provided by the requesters, images may be annotated with irrelevant and noisy labels primarily due to lack of motivation, knowledge, and carelessness of the turkers. Also, as the image may be annotated by a large number of turkers with different expertise and background, consistency of labels cannot be guaranteed. MTurk provides a quality test for workers, which may improve the quality 47 ACCEPTED MANUSCRIPT of the result if a careful quality test for each worker is performed. We can also reduce the diversity of labels by setting a limit on the number of workers. Apart from these, there is various inter-annotator agreement test to check the quality of annotated data [213, 214]. Often different turkers labels the same image with T a different set of labels and if the number of workers is odd we can break the IP tie, else we can use the inter-annotator agreement test to judge the quality of US 5. Image Annotation Evaluation Methods CR annotated image data. The keywords assigned to an image represent semantic contents of the image. When an image is assigned with only one keyword, it can be considered as single AN label annotation or simply a binary classification where a classifier ascertains only the presence or absence of keyword. Only one label can’t represent the M true contents of the image. Hence, image annotation methods usually assign multiple keywords to indicate the presence of multiple objects in the image. ED This type of method is called multi-label annotation system, or it can also be considered as a multi-class classification system. To ascertain the accuracy of the annotation system, there exists two broad PT class of evaluation measure: (i) the qualitative measure and (ii) the quantitative measure. The qualitative measure [215] deals with human subject based assess- CE ment. The subjects are asked to evaluate the performance of the system so that a more comprehensive picture of the annotation system can be obtained. The AC quantitative evaluation of the system deals with system level evaluation where ground truth dataset is used to ascertain the accuracy of the system. For a single label annotation system, the accuracy of the system can be considered as the performance evaluation criteria [151]. Here, accuracy refers merely to the overall percentage of correctly classified test images over the total number of test images. But, in the case of multi-label annotation, the performance evaluation criteria is much more complicated. Also, many annotation systems are ranking based models where labels are ranked based on some con- 48 ACCEPTED MANUSCRIPT fidence factor which may require a different set of evaluation criteria [216]. For single label annotation system, the system is provided with test dataset and obtained result is evaluated for its precision (1), recall (2) and F1 score (3) TP TP + FP TP TP + FN P recision × recall F 1 score = 2 × P recision + Recall CR Recall R = IP P recision P = T as performance evaluation criterion[183]. (2) (3) US where, (1) TP (True Positive) = Both, actual and obtained results are same and indicate AN the presence of label. TN (True Negative) = Both, actual and obtaince results are same and indicate the absence of label. M FN (False Negative) = Although, the label is present in actual annotation (ground truth), the obtained result shows absence of label. ED FP (False Positive) = The obtained results show presence of label even though label is absent in ground truth. PT For multi-label annotation , three evaluation measures equation (4-6) have been proposed [217]. The multi-label annotation system can produce fixed CE length keywords or variable length keywords. In fixed length annotation multiple keywords are assigned to images, however number of keywords are fixed. Whereas, in variable length annotation the number of assigned keywords vary AC from image to image. The ”one-error” is multi-label version of the accuracy calculation of the single-label annotation system. It measures how many times the top ranked labels are not in the set of possible labels. That is, it measures the false-positive. The second one, coverage, evaluates performance of system in terms of its label ranking. Coverage measures the performance of system for all possible labels. It calculates the average of the maximum positive confidence factor which indicates how all the possible labels of the dataset can be covered [217]. The average precision, an image retrieval (IR) performance measurement 49 ACCEPTED MANUSCRIPT system, can be used for label ranking which calculates the average fraction of labels ranked above a particular ranked output label. Here, a particular refers to a fixed value which is used as a threshold point and all the labels ranked above the threshold are considered as the output of the system. The image retrieval T (IR) evaluation method measures whether the retrieved images are relevant to IP query image or not? How many retrieved images are relevant to the user’s CR query? Whereas, the image annotation evaluation method measures prediction of accurate labels. It checks how many tags are accurately predicted, how many tags are missing from the result, etc. for a query image. US Let m be the number of training samples. The input dataset is in the form of <(X1 ,Y1 ), (X2 ,Y2 ), ......, (Xm ,Ym )>, where Xi is and Yi is the input and AN output set respectively for ith training data. yij is an element of set Yi , h(Xi ) represents the set of top K predicted labels for Xi and rank h(Xi , l) is real value confidence factor for lable l from the top K predicted labels for xi . m M 1 X h(Xi ) ∈ / Yi m i=1 ED One − error = (4) m Coverage = m 1 X 1 X |{l0 ∈ Yi |rank(Xi , l0 ) ≤ rank(Xi , l)}| m i=1 |Yi | rank(Xi , l) PT Avg. precision = 1 X maxl∈Yi rank h(Xi , l) − 1 m i=1 (5) (6) l∈Yi CE Later in [216], the author used equation (4-6) along with four other evaluation criteria for multi-label annotation (7)-(9) and label ranking (10). Hamming loss AC counts the misclassification of an image-label pair. That is, it takes the average of whenever actual and predicted labels are different. ”Macro-F1” averages the F1-measure on the predictions of different annotations. ”Micro-F1” calculates the F1-measure on the predictions of different labels as a whole. A low value of hamming loss and a large value for micro-F1 and macro-F1 is the realization of an excellent performance of the system. ”Ranking loss” is used to evaluate the performance of label ranking and calculates the average fraction of label pairs 50 ACCEPTED MANUSCRIPT that are not correctly ordered. m 1 X kh(Xi ) ⊕ Yi k1 Hamming loss = m i=1 N CR T IP Pm i N i 1 X 2 × j=1 h (Xj )yj Pm i Pm i M acro − F 1 = N i=1 j=1 yj + j=1 h (Xj ) Pm T 2 × i=1 kh(Xi ) Yi k1 Pm M icro − F 1 = Pm i=1 kYi k1 + i=1 kh(Xi )k1 (7) m Ranking loss = 1 X 1 m i=1 |Yi | Yi yt , ys ∈ Yi × Yi |h(Xi , yt ) ≤ h(Xi , ys ) (8) (9) (10) US where, N=number of labels for an image and k.k1 is the l1 norm and ⊕ is the logical XOR operation. The complement of Yi is denoted by Yi . AN All these evaluation measures have been frequently used for the performance evaluations of annotation system [113, 151, 58, 155, 183, 92]. For evaluating M the performance of large scale image annotation system, two complimentary evaluation metrics [218] ascertain the robustness and stability of the annotation ED system. To assess the robustness of the annotation system, the authors proposed zero-rate annotation accuracy which calculates the number of keywords that has never been predicted accurately. To assess the stability of the system, the PT authors proposed the coefficient of variation (CV) which measures variation in the annotation accuracy among keywords. For a system to be stable, the value CE of CV should be low which indicates all the annotation accuracy are close to each other for an image. It will help in the retrieval of similar images while the AC query can be any of the keyword [218]. 6. The Database for Image Annotation There are several publically available database for the training and evalua- tion of image annotation and retrieval system. For the annotation system to be effective, it is necessary that the system is trained with a large number of the balanced dataset. However, efforts have been made to train the system even if the dataset is unbalanced [219]. The creation of a balanced and manually 51 ACCEPTED MANUSCRIPT annotated dataset is an expensive and time-consuming process. Thus, efforts have been made to develop a semi-supervised model that can be trained using noisy or incomplete tags [27, 101, 82]. Nevertheless, to train a model, manually annotated images are needed one way or other. Efforts have been made by the T various community to develop a standard balanced database for the training and IP evaluation purpose. Details about some of the globally accepted and standard CR database are given below. 6.1. The Corel Database US The Corel database is created from the Corel Photo Gallery [187]. Several research groups [27, 183, 81, 220, 221] used Corel data for evaluation of their an- AN notations system. Corel dataset is entirely manually annotated dataset labeled by human experts. There are various versions of the Corel dataset as explained below. M Corel5K : Corel5K contains 5000 manually annotated images. Each image is either 192 × 128 or 128 × 192 pixels. There are total 371 unique words, and each ED image is annotated with 1 to 5 keywords [220]. The size of the Corel5K dataset is tiny; also, it has a small number of vocabulary. Hence when an annotation PT system is evaluated on the Corel5K dataset, it is difficult to determine that the proposed system has good generalization capability. Corel30K : It is an extension of the Corel5k dataset having 31695 images. CE The images are 384 × 256 or 256 × 384 pixels. The size of the vocabulary is also increased to 5587. Each image is labeled with 1 to 5 keywords. An average AC number of words per image is around 3.6 [127]. Corel60K : There has been another extension of the Corel database known as Corel60K. It is a balanced dataset. It contains 60000 images of 600 different categories. Each category has around 100 images that are 384 × 256 pixels or 256 × 384 pixels. There are 417 distinct keywords, and each image is tagged with 1 to 7 keywords. Although Corel dataset is a standard dataset, Ref [191] pointed out some of the disadvantages of the Corel dataset. The authors implemented three auto52 ACCEPTED MANUSCRIPT image annotation methods (CSD-prop, SvdCos and CSD-svm) [222] on Corel dataset and compared the results with some of the states of the art methods (translation model [223], CRM model [98], MBRM model [97], MIX-Hier model [42]). The authors tried to show that, when using training set and testing set T from Corel dataset itself, it is relatively easy to perform annotation [191]. They IP also argued that the Corel dataset contains redundant training information and CR a model can be trained even only using the 25% of training information. 6.2. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) US Since 2010, ImageNet is organizing a competition every year for the detection, classification, localization, etc. of the objects from an image [188]. AN ILSVRC-10 is a competition targeted for the evaluation of the effectiveness of image annotation (multi-label classification) methods where the goal is to annotate each image with at most five labels in descending order of the confidence. M There are 1,000 object categories, and labels are organized hierarchically with three levels and contain 1,000 nodes at leaf level without overlapping. The ED dataset comprises 120,000 training images each having at most five labels. The dataset contains 50000 validation images and 150000 test images collected from PT Flickr and other sites. All the training images are manually labeled without any segmentation. In ILSVRC-2011, the organizers added one extra task, that is, to classify and localize the objects. The goal of this new task is to pre- CE dict the top five class labels and five bounding boxes, each for one class label. Later in ILSVRC-2013, the organizer organized the competition for two tasks: AC the object detection (a new task) and classification and localization (same as ILSVRC-2011). The object detection challenge is same as PASCAL VOC challenge [224] but with a more substantial number of object categories and image dataset. The goal of object detection task is to identify the object class (200 categories) from an entirely manually annotated images with bounding boxes. Later, object detection for video and scene classification tasks have been included in the competition (ILSVRC-2015, ILSVRC-2016, ILSVRC-2017). 53 ACCEPTED MANUSCRIPT 6.3. IAPR TC-12 Benchmark IAPR TC-12 benchmark consists of 20000 natural images [189]. The database contains images used during 2006 to 2008 ImageCLEF evaluation campaign. T The images in IAPR TC-12 dataset has multiple objects and includes complete IP annotations (full text as well as English, German and Random) as well as light annotations (all annotations except for the description). There are around 291 CR unique labels, and each image has approximately 1 to 23 labels. On an average, there is 5.7 number of labels per image and 153 to 4999 images per labels. There are 347 images per label on average. The dataset is publically available without US any copyright restriction. There is an extended version of the IAPR TC-12 benchmark called seg- AN mented and annotated IAPR TC-12 (SAIAPR TC-12) [91]. SAIAPR TC-12 includes all the images of IAPR TC-12 along with the segmented mask and segmented images. SAIAPR TC-12 contains region wise extracted features along M with labels assigned to each region. Region-level annotations according to hierarchy and spatial relationships information are also listed in the dataset. SA- ED IAPR TC-12 benchmark is a publically available dataset without any copyright PT restriction. 6.4. ImageCLEF Photo Annotation CE Launched in 2003 as part of cross language evaluation forum (CLEF) for the performance evaluation of concept detection, annotation, and retrieval method, ImageCLEF started organizing visual concept detection and annotation chal- AC lenge for photo images in 2008 [225]. Although, ImageCLEF started organizing medical image annotation and retrieval since 2005. At first, only a few numbers of photo images were available for training a model (1800 in ImageCLEF-2008, 5000 in ImageCLEF-2009, 8000 in ImageCLEF-2010 and ImageCLEF-2011) and all these images were manually annotated. Later since 2012, the organization introduced a new challenge called scalable image annotation task. The idea is to rank the annotated keywords and decide the number of keywords that can 54 ACCEPTED MANUSCRIPT be assigned to an image. Also, the training dataset contains only textual features (URL, surrounding texts, etc.) that can be mined and used as labels. The size of the training set has also been increased significantly. ImageCLEF 2013 photo annotation task, is a benchmark for visual concepts detection an- T notation and retrieval of photos [226]. The dataset contains 250,000 training IP images downloaded from the internet, 1,000 development set images and 2,000 CR test set images which belong to 95 categories, and ground truth is provided only on the development set. The dataset is designed with an intention to check the scalability of the proposed annotation system. Hence, the list of the concept US is different for training, and development and test set. There are no manually annotated images in the training set. The organization has provided textual as AN well as visual features with the dataset. Textual features include the complete web in XML form, a list of the word-score pair, image URL, the rank of images during searching of the image through a search engine. Visual features include M four types of SIFT features (SIFT, C-SIFT, RGB-SIFT, OPPONENT-SIFT), two kinds of GIST features (GIST, GIST2), LBP center features and two types ED of color features (COLOR HIST, HSVHIST) [227] and GETLF. The dataset has been used in [15, 81]. PT 6.5. NUS-WIDE Database The dataset is created by NUS’s Lab for media search for the annotation CE and retrieval from Flickr [190]. The dataset contains 269,648 images with 5,018 unique tags. The dataset also includes six types of low-level features. The AC dataset is divided into 161,789 training images and 107,859 testing images. For the evaluation purpose, the ground truth for 81 concepts is also provided. The six features include a 64-D color histogram, 144-D color correlation, 73-D edge direction histogram, 128-D wavelet texture, 255-D block-wise color moments and a 500-D bag of words based on SIFT descriptions. The NUS-WIDE dataset also comes with three different version of the dataset, (i) a light version called NUS-WIDE LITE, (ii) NUS-WIDE OBJECT dataset which contains only one object in each image and (iii) NUS-WIDE SCENE dataset where each image 55 ACCEPTED MANUSCRIPT has one scene. Some research groups have used NUS-WISE dataset [27, 198, 101, 82, 168, 48] to evaluate the performance of their proposed method. 6.6. ESP Game Database T ESP Game dataset contains images from ESP Game where images are la- IP beled collaboratively by users. ESP Game dataset contains images of various CR ranges from drawing to logos to different other images including a human. The dataset includes 67,796 images with labels. There have been multiple instances where a subset of this dataset has been used [1, 21, 53, 27, 183]. US There exist various other datasets like PASCAL VOC, MIR Flickr, LabelMe, MS COCO, etc. All these datasets can be used to check the competitiveness AN of proposed methods. Table 2 shows several primary statistics of the datasets AC CE PT ED M mentioned above. 56 CR IP T US Table 2: Some of the predominant database (Corel [187], ILSVRC [188], IAPR TC-12 [189], SAIAPR TC-12 [91], ImageCLEF [225, 226], NUS-WIDE Number of test images test images 15000 Number of concepts 371 5587 417 1000 fully manually annotated a Partially annotated 4b 4 4 4 4 4 4 4 1.2m 50000 10000 1000 4 4 1800 5000 8000 8000 1000 1300 10000 10000 2000 16 53 93 99 4 4 4 4 4 4 4 4 250000 - - - 4 4 20000 20000 269648 67769 - 107859 - 1000 4 4 4 4 4 4 CE ED Number of validation images 50000 Corel5K Corel30K Corel60K ILSVRC-10 ILSVRC-11 to ILSVRC-14 (C&L)d ImageCLEF-2008 ImageCLEF-2009 ImageCLEF-2010 ImageCLEF-2011 ImageCLEF-2012 to ImageCLEF-2016 (web data)e IAPR TC-12 SAIAPR TC-12 NUS-WIDE ESP Game AC 57 Number of training images 5000 31695 60000 1.2mc PT Dataset M AN [190], ESP Game [1]) used for the training and evaluation of image annotation method. a = Yes. No. c 1.2m=1.2 millions. d (C&L)= Classification and Localization. e (web data)= Collection of automatically obtained web images and its associated metadata. b 4= Unannotated ACCEPTED MANUSCRIPT Discussion: It is vital that the system is evaluated on the well balanced unbiased dataset to check the competitiveness of any proposed method. The demerits of the Corel dataset is pointed out in [191]. Most of the available datasets are manually labeled by an expert or by users. Manually annotated T images are subjective [205], i.e., set of labels assigned by a person to an image IP may vary from person to person. As the number of images is increasing, the CR need of the hours is to develop a scalable system using weakly supervised and unsupervised learning. The scalable annotation system has the capabilities to easily change or scale the number of keywords used for image annotation. For US the development of a scalable system, very few datasets exist (ImageCLEF 2012 to 2016 Photo Annotation and NUS-WIDE). These dataset doesn’t contain AN drawing, logos or some complex hidden objects labels. We think, if we have more dataset with the intention of the scalable system and a variety of images, M it would be suitable for the development of a full fledge annotation system. ED 7. Conclusion This article presented a comprehensive study of the swift methods and emerging directions evolved in the exciting field of the image annotation. The PT study started with the early years of the image annotation and presented that the semantic gap was the absolute challenging problem faced in about all ap- CE proaches in the first decade. However, approaches proposed in the second decade almost resolved the semantic gap problem, but the hindrance of supervised AC learning based AIA, large-scale image annotation, etc. are unfolded in this decade. Although this article attained every aspect of the image annotation quoted in the last three decades, much attention is paid towards the progress made in the previous decade with the current trends in AIA. The article discussed the new approaches to do away with the need of fully labeled training data, explored current trends of unsupervised learning based approaches for AIA and conclude that the field of unsupervised learning based AIA is contemporary and 58 ACCEPTED MANUSCRIPT very impressive and needs to be explored meticulously. This article also presented the details about some of the predominant database used for the training and evaluation of the annotation method. Most of the available databases are intended for the training and evaluation of supervised T learning based image annotation methods, and only a handful of the database IP exists for the training and evaluation of semi-supervised and unsupervised meth- CR ods. Apart from that, we have also described the performance evaluation measures. The evaluation measures for multi-label annotation method are much more different and complicated than that of single label annotation system. A US performance evaluation measures for multi-label annotation and ranking based annotation methods have also been described in this paper. AN The section/subsection of the article is concluded with a discussion detailing the prominent challenges faced and conjectured specific future directions. We believe that discourse will help the reader gain some intuitions in the exploring M the applicable field in the foreseeable future. ED References [1] J. H. Su, C. L. Chou, C. Y. Lin, V. S. Tseng, Effective semantic an- PT notation by image-to-concept distribution model, IEEE Transactions on CE Multimedia 13 (3) (2011) 530–538. [2] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Contentbased image retrieval at the end of the early years, IEEE Transactions AC on Pattern Analysis and Machine Intelligence 22 (12) (2000) 1349–1380. doi:10.1109/34.895972. [3] D. Zhang, M. M. Islam, G. Lu, A review on automatic image annotation techniques, Pattern Recognition 45 (1) (2012) 346 – 362. doi:https: //doi.org/10.1016/j.patcog.2011.05.013. [4] R. Datta, D. Joshi, J. Li, J. Z. Wang, Image retrieval: Ideas, influences, 59 ACCEPTED MANUSCRIPT and trends of the new age, ACM Comput. Surv. 40 (2) (2008) 5:1–5:60. doi:10.1145/1348246.1348248. [5] M. Ivasic-Kos, M. Pobar, S. Ribaric, Two-tier image annotation model T based on a multi-label classifier and fuzzy-knowledge representation IP scheme, Pattern Recognition 52 (Supplement C) (2016) 287 – 305. CR [6] Z. Ma, Y. Yang, F. Nie, J. Uijlings, N. Sebe, Exploiting the entire feature space with sparsity for automatic image annotation, in: Proceedings of the 19th ACM International Conference on Multimedia, MM ’11, ACM, New US York, NY, USA, 2011, pp. 283–292. doi:10.1145/2072298.2072336. [7] R. Hong, M. Wang, Y. Gao, D. Tao, X. Li, X. Wu, Image annotation AN by multiple-instance learning with discriminative feature mapping and selection, IEEE Transactions on Cybernetics 44 (5) (2014) 669–680. M [8] A. Makadia, V. Pavlovic, S. Kumar, Baselines for image annotation, International Journal of Computer Vision 90 (1) (2010) 88–105. ED [9] A. Makadia, V. Pavlovic, S. Kumar, A new baseline for image annotation, in: D. Forsyth, P. Torr, A. Zisserman (Eds.), Computer Vision – ECCV PT 2008, Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 316–329. [10] S. Xia, P. Chen, J. Zhang, X. Li, B. Wang, Utilization of rotation-invariant CE uniform {LBP} histogram distribution and statistics of connected regions in automatic image annotation based on multi-label learning, Neurocom- AC puting 228 (2017) 11 – 18, advanced Intelligent Computing: Theory and Applications. doi:https://doi.org/10.1016/j.neucom.2016.09.087. [11] Y. Verma, C. V. Jawahar, Image annotation by propagating labels from semantic neighbourhoods, International Journal of Computer Vision 121 (1) (2017) 126–148. doi:10.1007/s11263-016-0927-0. [12] Y. Wang, T. Mei, S. Gong, X.-S. Hua, Combining global, regional and contextual features for automatic image annotation, Pattern Recognition 60 ACCEPTED MANUSCRIPT 42 (2) (2009) 259 – 266, learning Semantics from Multimedia Content. doi:https://doi.org/10.1016/j.patcog.2008.05.010. [13] K. Li, C. Zou, S. Bu, Y. Liang, J. Zhang, M. Gong, Multi-modal feature T fusion for geographic image annotation, Pattern Recognition 73 (Supple- IP ment C) (2018) 1 – 14. doi:https://doi.org/10.1016/j.patcog.2017. CR 06.036. [14] J. Fan, Y. Gao, H. Luo, Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image anno- US tation, IEEE Transactions on Image Processing 17 (3) (2008) 407–426. [15] L. Pellegrin, H. J. Escalante, M. Montes-y Gómez, F. A. González, Local AN and global approaches for unsupervised image annotation, Multimedia Tools and Applications 76 (15) (2017) 16389–16414. M [16] L. Cao, J. Luo, H. Kautz, T. S. Huang, Image annotation within the context of personal photo collections using hierarchical event and scene ED models, IEEE Transactions on Multimedia 11 (2) (2009) 208–219. [17] X. Rui, M. Li, Z. Li, W.-Y. Ma, N. Yu, Bipartite graph reinforcement PT model for web image annotation, in: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, ACM, New York, NY, USA, CE 2007, pp. 585–594. doi:10.1145/1291233.1291378. [18] A. Ulges, M. Worring, T. Breuel, Learning visual contexts for image anno- AC tation from flickr groups, IEEE Transactions on Multimedia 13 (2) (2011) 330–341. [19] R. Jin, J. Y. Chai, L. Si, Effective automatic image annotation via a coherent language model and active learning, in: Proceedings of the 12th Annual ACM International Conference on Multimedia, MULTIMEDIA ’04, ACM, New York, NY, USA, 2004, pp. 892–899. 10.1145/1027527.1027732. 61 doi: ACCEPTED MANUSCRIPT [20] J. Tang, Z. J. Zha, D. Tao, T. S. Chua, Semantic-gap-oriented active learning for multilabel image annotation, IEEE Transactions on Image Processing 21 (4) (2012) 2354–2360. T [21] B. Wu, S. Lyu, B.-G. Hu, Q.Ji, Multi-label learning with missing labels for IP image annotation and facial action unit recognition, Pattern Recognition 48 (7) (2015) 2279 – 2289. CR [22] Z. Shi, Y. Yang, T. M. Hospedales, T. Xiang, Weakly-supervised image annotation and segmentation with objects and attributes, IEEE Transac- US tions on Pattern Analysis and Machine Intelligence 39 (12) (2017) 2525– 2538. AN [23] M. Hu, Y. Yang, F. Shen, L. Zhang, H. T. Shen, X. Li, Robust web image annotation via exploring multi-facet and structural knowledge, IEEE M Transactions on Image Processing 26 (10) (2017) 4871–4884. [24] L. Wu, R. Jin, A. K. Jain, Tag completion for image retrieval, IEEE 716–727. ED Transactions on Pattern Analysis and Machine Intelligence 35 (3) (2013) PT [25] J. Wang, G. Li, A multi-modal hashing learning framework for automatic image annotation, in: 2017 IEEE Second International Conference on CE Data Science in Cyberspace (DSC), 2017, pp. 14–21. [26] T. Uricchio, L. Ballan, L. Seidenari, A. D. Bimbo, Automatic image annotation via label transfer in the semantic space, Pattern Recognition AC 71 (Supplement C) (2017) 144 – 157. doi:https://doi.org/10.1016/ j.patcog.2017.05.019. [27] X. Li, B. Shen, B. D. Liu, Y. J. Zhang, Ranking-preserving low-rank factorization for image annotation with missing labels, IEEE Transactions on Multimedia PP (99) (2017) 1–1. [28] F. Monay, D. Gatica-Perez, Plsa-based image auto-annotation: Constraining the latent space, in: Proceedings of the 12th Annual ACM Interna62 ACCEPTED MANUSCRIPT tional Conference on Multimedia, MULTIMEDIA ’04, ACM, New York, NY, USA, 2004, pp. 348–351. doi:10.1145/1027527.1027608. [29] A. Hanbury, A survey of methods for image annotation, Journal of Visual T Languages & Computing 19 (5) (2008) 617 – 627. doi:https://doi.org/ IP 10.1016/j.jvlc.2008.01.002. [30] Y. Liu, D. Zhang, G. Lu, W.-Y. Ma, A survey of content-based image CR retrieval with high-level semantics, Pattern Recognition 40 (1) (2007) 262 – 282. doi:https://doi.org/10.1016/j.patcog.2006.04.045. US [31] Q. Cheng, Q. Zhang, P. Fu, C. Tu, S. Li, A survey and analysis on automatic image annotation, Pattern Recognition 79 (2018) 242 – 259. AN doi:https://doi.org/10.1016/j.patcog.2018.02.017. [32] X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M. Snoek, A. D. Bimbo, M Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval, ACM Comput. Surv. 49 (1) (2016) ED 14:1–14:39. doi:10.1145/2906152. [33] C. Wang, L. Zhang, H.-J. Zhang, Learning to reduce the semantic gap in PT web image retrieval and annotation, in: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’08, ACM, New York, NY, USA, 2008, pp. CE 355–362. doi:10.1145/1390334.1390396. [34] X.-J. Wang, L. Zhang, F. Jing, W.-Y. Ma, Annosearch: Image auto- AC annotation by search, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, 2006, pp. 1483–1490. doi:10.1109/CVPR.2006.58. [35] C. Cusano, G. Ciocca, R. Schettini, Image annotation using svm, Proc.SPIE 5304 (2003) 5304 – 5304 – 9. doi:10.1117/12.526746. [36] C. Yang, M. Dong, J. Hua, Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning, in: 63 ACCEPTED MANUSCRIPT 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, 2006, pp. 2057–2063. [37] C. Wang, S. Yan, L. Zhang, H. J. Zhang, Multi-label sparse coding for IP Vision and Pattern Recognition, 2009, pp. 1643–1650. T automatic image annotation, in: 2009 IEEE Conference on Computer [38] X. Qi, Y. Han, Incorporating multiple svms for automatic image an- CR notation, Pattern Recognition 40 (2) (2007) 728 – 741. doi:https: //doi.org/10.1016/j.patcog.2006.04.042. US [39] J. Johnson, L. Ballan, L. Fei-Fei, Love thy neighbors: Image annotation by exploiting image metadata, in: 2015 IEEE International Conference on AN Computer Vision (ICCV), 2015, pp. 4624–4632. [40] M. B. Mayhew, B. Chen, K. S. Ni, Assessing semantic information in M convolutional neural network representations of images via image annotation, in: 2016 IEEE International Conference on Image Processing (ICIP), ED 2016, pp. 2266–2270. [41] A. Fakhari, A. M. E. Moghadam, Combination of classification and re- PT gression in decision tree for multi-labeling image annotation and retrieval, Applied Soft Computing 13 (2) (2013) 1292 – 1302. doi:https: CE //doi.org/10.1016/j.asoc.2012.10.019. [42] G. Carneiro, N. Vasconcelos, Formulating semantic image annotation as a supervised learning problem, in: 2005 IEEE Computer Society Conference AC on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, 2005, pp. 163–168 vol. 2. [43] P. Niyogi, Manifold regularization and semi-supervised learning: Some theoretical analyses, J. Mach. Learn. Res. 14 (1) (2013) 1229–1250. [44] L. Pellegrin, H. J. Escalante, M. Montes-y Gómez, Evaluating termexpansion for unsupervised image annotation, in: A. Gelbukh, F. C. Espinoza, S. N. Galicia-Haro (Eds.), Human-Inspired Computing and Its 64 ACCEPTED MANUSCRIPT Applications, Springer International Publishing, Cham, 2014, pp. 151– 162. [45] L. Pellegrin, J. A. Vanegas, J. E. A. Ovalle, V. Beltrán, H. J. Escalante, T M. M. y Gómez, F. A. González, Inaoe-unal at imageclef 2015: Scalable IP concept image annotation, in: CLEF, 2015. [46] T. Uricchio, M. Bertini, L. Ballan, A. D. Bimbo, Micc-unifi at imageclef shop, Valencia, Spain, 2013, (Benchmark). CR 2013 scalable concept image annotation, in: Proc. of ImageCLEF Work- US [47] T. Tommasi, F. Orabona, B. Caputo, Discriminative cue integration for medical image annotation, Pattern Recognition Letters 29 (15) (2008) doi:https://doi.org/10.1016/j. AN 1996 – 2002, image CLEF 2007. patrec.2008.03.009. M [48] A. Kumar, S. Dyer, J. Kim, C. Li, P. H. Leong, M. Fulham, D. Feng, Adapting content-based image retrieval techniques for the semantic an- ED notation of medical images, Computerized Medical Imaging and Graphics 49 (Supplement C) (2016) 37 – 45. doi:https://doi.org/10.1016/j. PT compmedimag.2016.01.001. [49] G. Zhang, C.-H. R. Hsu, H. Lai, X. Zheng, Deep learning based feature representation for automated skin histopathological image annotation, CE Multimedia Tools and Applicationsdoi:10.1007/s11042-017-4788-5. [50] A. Mueen, R. Zainuddin, M. S. Baba, Automatic multilevel medical image AC annotation and retrieval, Journal of Digital Imaging 21 (2007) 290–295. [51] D. Bratasanu, I. Nedelcu, M. Datcu, Bridging the semantic gap for satellite image annotation and automatic mapping applications, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 4 (1) (2011) 193–204. [52] J. Fan, Y. Gao, H. Luo, G. Xu, Automatic image annotation by using concept-sensitive salient objects for image content representation, in: Pro65 ACCEPTED MANUSCRIPT ceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’04, ACM, New York, NY, USA, 2004, pp. 361–368. doi:10.1145/1008992.1009055. T [53] X. Y. Jing, F. Wu, Z. Li, R. Hu, D. Zhang, Multi-label dictionary learn- IP ing for image annotation, IEEE Transactions on Image Processing 25 (6) CR (2016) 2712–2725. [54] S. Lindstaedt, R. Mörzinger, R. Sorschag, V. Pammer, G. Thallinger, Automatic image annotation using visual content and folksonomies, Mul- US timedia Tools and Applications 42 (1) (2009) 97–113. doi:10.1007/ s11042-008-0247-7. AN [55] P. Brodatz, Textures: A Photographic Album for Artists and Designers, Peter Smith Publisher, Incorporated, 1981. M [56] G. Leboucher, G. Lowitz, What a histogram can really tell the classifier, Pattern Recognition 10 (5) (1978) 351 – 357. doi:https://doi.org/10. ED 1016/0031-3203(78)90006-7. [57] R. M. Haralick, K. Shanmugam, I. Dinstein, Textural features for im- PT age classification, IEEE Transactions on Systems, Man, and Cybernetics SMC-3 (6) (1973) 610–621. doi:10.1109/TSMC.1973.4309314. CE [58] V. Kovalev, M. Petrou, Multidimensional co-occurrence matrices for object recognition and matching, Graphical Models and Image Processing AC 58 (3) (1996) 187 – 197. doi:https://doi.org/10.1006/gmip.1996. 0016. [59] K. Valkealahti, E. Oja, Reduced multidimensional co-occurrence histograms in texture classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1) (1998) 90–94. doi:10.1109/34.655653. [60] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture classification with local binary patterns, IEEE 66 ACCEPTED MANUSCRIPT Transactions on Pattern Analysis and Machine Intelligence 24 (7) (2002) 971–987. doi:10.1109/TPAMI.2002.1017623. [61] X. Tan, B. Triggs, Enhanced local texture feature sets for face recognition T under difficult lighting conditions, IEEE Transactions on Image Processing IP 19 (6) (2010) 1635–1650. CR [62] T. Jabid, M. H. Kabir, O. Chae, Local directional pattern (ldp) for face recognition, in: 2010 Digest of Technical Papers International Conference on Consumer Electronics (ICCE), 2010, pp. 329–330. doi:10.1109/ICCE. US 2010.5418801. [63] M. H. Kabir, T. Jabid, O. Chae, A local directional pattern variance AN (ldpv) based face descriptor for human facial expression recognition, in: 2010 7th IEEE International Conference on Advanced Video and Signal M Based Surveillance, 2010, pp. 526–532. doi:10.1109/AVSS.2010.9. [64] I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Ap- ED plied Mathematics, Philadelphia, PA, USA, 1992. [65] G. E. Lowitz, Can a local histogram really map texture information?, PT Pattern Recognition 16 (2) (1983) 141 – 147. doi:https://doi.org/10. 1016/0031-3203(83)90017-1. CE [66] B. Julesz, Experiments in the visual perception of texture, Scientific American 232 (4) (1975) 34–43. AC [67] M. J. Swain, D. H. Ballard, Color indexing, International Journal of Computer Vision 7 (1) (1991) 11–32. doi:10.1007/BF00130487. [68] A. Bahrololoum, H. Nezamabadi-pour, A multi-expert based framework for automatic image annotation, Pattern Recognition 61 (Supplement C) (2017) 169 – 184. doi:https://doi.org/10.1016/j.patcog.2016.07. 034. 67 ACCEPTED MANUSCRIPT [69] S. R. Kodituwakku, S. Selvarajah, Comparison of color features for image retrieval, Indian Journal of Computer Science and Engineering 1 (3) (2010) 207–2011. T [70] A. K. Jain, A. Vailaya, Image retrieval using color and shape, Pattern IP Recognition 29 (8) (1996) 1233 – 1244. doi:https://doi.org/10.1016/ CR 0031-3203(95)00160-3. [71] T. Deselaers, D. Keysers, H. Ney, Features for image retrieval: an doi:10.1007/s10791-007-9039-3. US experimental comparison, Information Retrieval 11 (2) (2008) 77–107. [72] S. Zheng, M. M. Cheng, J. Warrell, P. Sturgess, V. Vineet, C. Rother, AN P. H. S. Torr, Dense semantic image segmentation with objects and attributes, in: 2014 IEEE Conference on Computer Vision and Pattern M Recognition, 2014, pp. 3214–3221. doi:10.1109/CVPR.2014.411. [73] G. Singh, J. Kosecka, Nonparametric scene parsing with adaptive fea- ED ture relevance and semantic context, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3151–3157. doi: PT 10.1109/CVPR.2013.405. [74] A. Vezhnevets, V. Ferrari, J. M. Buhmann, Weakly supervised structured CE output learning for semantic segmentation, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 845–852. doi: AC 10.1109/CVPR.2012.6247757. [75] M. Rubinstein, C. Liu, W. T. Freeman, Annotation propagation in large image databases via dense image correspondence, in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 85–99. [76] H. Zhang, J. E. Fritts, S. A. Goldman, Image segmentation evaluation: A survey of unsupervised methods, Computer Vision and Image Under68 ACCEPTED MANUSCRIPT standing 110 (2) (2008) 260 – 280. doi:https://doi.org/10.1016/j. cviu.2007.08.003. [77] E. Shelhamer, J. Long, T. Darrell, Fully convolutional networks for seman- T tic segmentation, IEEE Transactions on Pattern Analysis and Machine IP Intelligence 39 (4) (2017) 640–651. doi:10.1109/TPAMI.2016.2572683. CR [78] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. L. Yuille, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE Transactions on Pat- US tern Analysis and Machine Intelligence PP (99) (2017) 1–1. doi:10.1109/ TPAMI.2017.2699184. AN [79] T. Li, B. Cheng, B. Ni, G. Liu, S. Yan, Multitask low-rank affinity graph for image segmentation and image annotation, ACM Trans. Intell. Syst. M Technol. 7 (4) (2016) 65:1–65:18. doi:10.1145/2856058. [80] J. Fan, Y. Gao, H. Luo, Multi-level annotation of natural scenes us- ED ing dominant image components and semantic concepts, in: Proceedings of the 12th Annual ACM International Conference on Multime- PT dia, MULTIMEDIA ’04, ACM, New York, NY, USA, 2004, pp. 540–547. doi:10.1145/1027527.1027660. CE [81] M. Jiu, H. Sahbi, Nonlinear deep kernel learning for image annotation, IEEE Transactions on Image Processing 26 (4) (2017) 1820–1832. AC [82] L. Tao, H. H. Ip, A. Zhang, X. Shu, Exploring canonical correlation analysis with subspace and structured sparsity for web image annotation, Image and Vision Computing 54 (Supplement C) (2016) 22 – 30. [83] N. Sebe, Q. Tian, E. Loupias, M. Lew, T. Huang, Evaluation of salient point techniques, Image and Vision Computing 21 (13) (2003) 1087 – 1095, british Machine Vision Computing 2001. doi:https://doi.org/ 10.1016/j.imavis.2003.08.012. 69 ACCEPTED MANUSCRIPT [84] K. S. Pedersen, M. Loog, P. Dorst, Salient point and scale detection by minimum likelihood, in: N. D. Lawrence, A. Schwaighofer, J. Q. Candela (Eds.), Gaussian Processes in Practice, Vol. 1 of Proceedings of Machine T Learning Research, PMLR, Bletchley Park, UK, 2007, pp. 59–72. Int. J. Comput. Vision 60 (2) (2004) 91–110. doi:10.1023/B:VISI. CR 0000029664.99615.94. IP [85] D. G. Lowe, Distinctive image features from scale-invariant keypoints, [86] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, US in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1, 2005, pp. 886–893 vol. 1. doi: AN 10.1109/CVPR.2005.177. [87] H. Bay, T. Tuytelaars, L. V. Gool, Surf: Speeded up robust features, in: M A. Leonardis, H. Bischof, A. Pinz (Eds.), Computer Vision – ECCV 2006, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 404–417. ED [88] E. Rosten, T. Drummond, Machine learning for high-speed corner detection, in: A. Leonardis, H. Bischof, A. Pinz (Eds.), Computer Vision PT – ECCV 2006, Springer Berlin Heidelberg, Berlin, Heidelberg, 2006, pp. 430–443. CE [89] M. Calonder, V. Lepetit, C. Strecha, P. Fua, Brief: Binary robust independent elementary features, in: K. Daniilidis, P. Maragos, N. Paragios (Eds.), Computer Vision – ECCV 2010, Springer Berlin Heidelberg, AC Berlin, Heidelberg, 2010, pp. 778–792. [90] E. Rublee, V. Rabaud, K. Konolige, G. Bradski, Orb: An efficient alternative to sift or surf, in: 2011 International Conference on Computer Vision, 2011, pp. 2564–2571. doi:10.1109/ICCV.2011.6126544. [91] H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villasenor, M. Grubinger, The segmented and annotated iapr tc-12 benchmark, Computer Vision 70 ACCEPTED MANUSCRIPT and Image Understanding 114 (4) (2010) 419 – 428, special issue on Image and Video Retrieval Evaluation. doi:https://doi.org/10.1016/j. cviu.2009.03.008. T [92] X. Ding, B. Li, W. Xiong, W. Guo, W. Hu, B. Wang, Multi-instance multi- IP label learning combining hierarchical context and its application to image CR annotation, IEEE Transactions on Multimedia 18 (8) (2016) 1616–1627. [93] D. M. Blei, M. I. Jordan, Modeling annotated data, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and US Development in Informaion Retrieval, SIGIR ’03, ACM, New York, NY, USA, 2003, pp. 127–134. doi:10.1145/860435.860460. AN [94] D. Putthividhy, H. T. Attias, S. S. Nagarajan, Topic regression multimodal latent dirichlet allocation for image annotation, in: 2010 IEEE M Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3408–3415. ED [95] L. Song, M. Luo, J. Liu, L. Zhang, B. Qian, M. H. Li, Q. Zheng, Sparse multi-modal topical coding for image annotation, Neurocomputing 214 005. PT (2016) 162 – 174. doi:https://doi.org/10.1016/j.neucom.2016.06. CE [96] R. Zhang, Z. Zhang, M. Li, W.-Y. Ma, H.-J. Zhang, A probabilistic semantic model for image annotation and multimodal image retrieval, in: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, AC Vol. 1, 2005, pp. 846–851 Vol. 1. [97] S. L. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance models for image and video annotation, in: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004., Vol. 2, 2004, pp. II–1002–II–1009 Vol.2. doi:10.1109/CVPR.2004.1315274. 71 ACCEPTED MANUSCRIPT [98] J. Jeon, V. Lavrenko, R. Manmatha, Automatic image annotation and retrieval using cross-media relevance models, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, ACM, New York, NY, USA, IP T 2003, pp. 119–126. doi:10.1145/860435.860459. [99] J. Liu, B. Wang, M. Li, Z. Li, W. Ma, H. Lu, S. Ma, Dual cross-media CR relevance model for image annotation, in: Proceedings of the 15th ACM International Conference on Multimedia, MM ’07, ACM, New York, NY, US USA, 2007, pp. 605–614. doi:10.1145/1291233.1291380. [100] D. Tao, L. Jin, W. Liu, X. Li, Hessian regularized support vector ma- AN chines for mobile image annotation on the cloud, IEEE Transactions on Multimedia 15 (4) (2013) 833–844. M [101] X. Xu, A. Shimada, H. Nagahara, R. i. Taniguchi, L. He, Image annotation with incomplete labelling by modelling image specific structured loss, ED IEEJ Transactions on Electrical and Electronic Engineering 11 (1) (2016) 73–82. PT [102] J. J. McAuley, A. Ramisa, T. S. Caetano, Optimization of robust loss functions for weakly-labeled image taxonomies, International Journal of Com- CE puter Vision 104 (3) (2013) 343–361. doi:10.1007/s11263-012-0561-4. [103] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, J. Mach. AC Learn. Res. 7 (2006) 2399–2434. [104] W. Liu, H. Liu, D. Tao, Y. Wang, K. Lu, Manifold regularized kernel logistic regression for web image annotation, Neurocomputing 172 (Supplement C) (2016) 3 – 8. doi:https://doi.org/10.1016/j.neucom. 2014.06.096. [105] M.-L. Zhang, Z.-H. Zhou, Ml-knn: A lazy learning approach to multi-label 72 ACCEPTED MANUSCRIPT learning, Pattern Recognition 40 (7) (2007) 2038 – 2048. doi:https: //doi.org/10.1016/j.patcog.2006.12.019. [106] M. Guillaumin, T. Mensink, J. Verbeek, C. Schmid, Tagprop: Discrimina- T tive metric learning in nearest neighbor models for image auto-annotation, CR pp. 309–316. doi:10.1109/ICCV.2009.5459266. IP in: 2009 IEEE 12th International Conference on Computer Vision, 2009, [107] Y. Verma, C. V. Jawahar, Image annotation using metric learning in semantic neighbourhoods, in: A. Fitzgibbon, S. Lazebnik, P. Perona, US Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012, Springer Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 836–849. AN [108] Z. Feng, R. Jin, A. Jain, Large-scale image annotation by efficient and robust kernel metric learning, in: 2013 IEEE International Conference on M Computer Vision, 2013, pp. 1609–1616. [109] D. R. Hardoon, J. Shawe-taylor, Kcca for different level precision in ED content-based image retrieval, in: In Submitted to Third International Workshop on Content-Based Multimedia Indexing, IRISA, 2003. PT [110] L. Feng, B. Bhanu, Semantic concept co-occurrence patterns for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Ma- CE chine Intelligence 38 (4) (2016) 785–799. [111] X. Zhu, W. Nejdl, M. Georgescu, An adaptive teleportation random walk AC model for learning social tag relevance, in: Proceedings of the 37th International ACM SIGIR Conference on Research &#38; Development in Information Retrieval, SIGIR ’14, ACM, New York, NY, USA, 2014, pp. 223–232. doi:10.1145/2600428.2609556. [112] C. Wang, F. Jing, L. Zhang, H.-J. Zhang, Image annotation refinement using random walk with restarts, in: Proceedings of the 14th ACM International Conference on Multimedia, MM ’06, ACM, New York, NY, USA, 2006, pp. 647–650. doi:10.1145/1180639.1180774. 73 ACCEPTED MANUSCRIPT [113] C. Lei, D. Liu, W. Li, Social diffusion analysis with common-interest model for image annotation, IEEE Transactions on Multimedia 18 (4) (2016) 687–701. T [114] J. Liu, M. Li, Q. Liu, H. Lu, S. Ma, Image annotation via graph learning, IP Pattern Recognition 42 (2) (2009) 218 – 228, learning Semantics from Multimedia Content. doi:https://doi.org/10.1016/j.patcog.2008. CR 04.012. [115] J. Tang, R. Hong, S. Yan, T.-S. Chua, G.-J. Qi, R. Jain, Image annota- US tion by knn-sparse graph-based label propagation over noisily tagged web images, ACM Trans. Intell. Syst. Technol. 2 (2) (2011) 14:1–14:15. AN [116] G. Chen, J. Zhang, F. Wang, C. Zhang, Y. Gao, Efficient multi-label classification with hypergraph regularization, in: 2009 IEEE Conference M on Computer Vision and Pattern Recognition, 2009, pp. 1658–1665. doi: 10.1109/CVPR.2009.5206813. ED [117] Y. Han, F. Wu, Q. Tian, Y. Zhuang, Image annotation by input-output structural grouping sparsity, IEEE Transactions on Image Processing PT 21 (6) (2012) 3066–3079. [118] J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, D. M. Blei, Reading CE tea leaves: How humans interpret topic models, in: Proceedings of the 22Nd International Conference on Neural Information Processing Systems, AC NIPS’09, Curran Associates Inc., USA, 2009, pp. 288–296. [119] A. P. Dempster, M. N. Laird, D. B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 39 (1977) 1–22. [120] D. Kong, C. Ding, H. Huang, H. Zhao, Multi-label relieff and f-statistic feature selections for image annotation, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2352–2359. 74 ACCEPTED MANUSCRIPT [121] X. Jia, F. Sun, H. Li, Y. Cao, X. Zhang, Image multi-label annotation based on supervised nonnegative matrix factorization with new matching measurement, Neurocomputing 219 (Supplement C) (2017) 518 – 525. T doi:https://doi.org/10.1016/j.neucom.2016.09.052. IP [122] K. S. Goh, E. Y. Chang, B. Li, Using one-class and two-class svms for Engineering 17 (10) (2005) 1333–1346. CR multiclass image annotation, IEEE Transactions on Knowledge and Data [123] D. Grangier, S. Bengio, A discriminative kernel-based approach to rank US images from text queries, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (8) (2008) 1371–1384. doi:10.1109/TPAMI.2007. AN 70791. [124] Y. Lin, F. Lv, S. Zhu, M. Yang, T. Cour, K. Yu, L. Cao, T. Huang, Large- M scale image classification: Fast feature extraction and svm training, in: CVPR 2011, 2011, pp. 1689–1696. doi:10.1109/CVPR.2011.5995477. ED [125] K. Kuroda, M. Hagiwara, An image retrieval system by impression words and specific object namesiris, Neurocomputing 43 (1) (2002) 259 – 276, PT selected engineering applications of neural networks. doi:https://doi. org/10.1016/S0925-2312(01)00344-7. CE [126] R. C. F. Wong, C. H. C. Leung, Automatic semantic annotation of realworld web images, IEEE Transactions on Pattern Analysis and Machine AC Intelligence 30 (11) (2008) 1933–1944. doi:10.1109/TPAMI.2008.125. [127] G. Carneiro, A. B. Chan, P. J. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 394–410. [128] S. Chengjian, S. Zhu, Z. Shi, Image annotation via deep neural network, in: 2015 14th IAPR International Conference on Machine Vision Applications (MVA), 2015, pp. 518–521. doi:10.1109/MVA.2015.7153244. 75 ACCEPTED MANUSCRIPT [129] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: F. Pereira, C. J. C. Burges, L. Bottou, K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105. T URL http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-n IP pdf CR [130] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEE Transactions on Image Processing 22 (7) (2013) 2676–2687. US [131] O. Boiman, E. Shechtman, M. Irani, In defense of nearest-neighbor based image classification, in: 2008 IEEE Conference on Computer Vision and AN Pattern Recognition, 2008, pp. 1–8. doi:10.1109/CVPR.2008.4587598. [132] K. Q. Weinberger, L. K. Saul, Distance metric learning for large margin M nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244. [133] J. Verbeek, M. Guillaumin, T. Mensink, C. Schmid, Image annotation ED with tagprop on the mirflickr set, in: Proceedings of the International Conference on Multimedia Information Retrieval, MIR ’10, ACM, New PT York, NY, USA, 2010, pp. 537–546. doi:10.1145/1743384.1743476. [134] X. Xu, A. Shimada, R.-i. Taniguchi, Image annotation by learning label- CE specific distance metrics, in: A. Petrosino (Ed.), Image Analysis and Processing – ICIAP 2013, Springer Berlin Heidelberg, Berlin, Heidelberg, AC 2013, pp. 101–110. [135] M. M. Kalayeh, H. Idrees, M. Shah, Nmf-knn: Image annotation using weighted multi-view non-negative matrix factorization, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 184– 191. [136] Z. Lin, G. Ding, M. Hu, Image auto-annotation via tag-dependent random search over range-constrained visual neighbours, Multimedia Tools Appl. 74 (11) (2015) 4091–4116. doi:10.1007/s11042-013-1811-3. 76 ACCEPTED MANUSCRIPT [137] L. Ballan, T. Uricchio, L. Seidenari, A. D. Bimbo, A cross-media model for automatic image annotation, in: Proceedings of International Conference on Multimedia Retrieval, ICMR ’14, ACM, New York, NY, USA, 2014, T pp. 73:73–73:80. doi:10.1145/2578726.2578728. IP [138] T. Mensink, J. Verbeek, F. Perronnin, G. Csurka, Metric learning for large scale image classification: Generalizing to new classes at near-zero cost, CR in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012, Springer Berlin Heidelberg, Berlin, Hei- US delberg, 2012, pp. 488–501. [139] H. Wang, H. Huang, C. Ding, Image annotation using bi-relational graph AN of images and semantic labels, in: CVPR 2011, 2011, pp. 793–800. [140] X. Chen, Y. Mu, S. Yan, T.-S. Chua, Efficient large-scale image annotation M by probabilistic collaborative multi-label propagation, in: Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, ACM, ED New York, NY, USA, 2010, pp. 35–44. doi:10.1145/1873951.1873959. [141] F. Su, L. Xue, Graph learning on k nearest neighbours for automatic image PT annotation, in: Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, ICMR ’15, ACM, New York, NY, USA, 2015, CE pp. 403–410. [142] P. Ji, X. Gao, X. Hu, Automatic image annotation by combining generative and discriminant models, Neurocomputing 236 (Supplement C) AC (2017) 48 – 55, good Practices in Multimedia Modeling. doi:https: //doi.org/10.1016/j.neucom.2016.09.108. [143] B. Xie, Y. Mu, D. Tao, K. Huang, m-sne: Multiview stochastic neighbor embedding, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 41 (4) (2011) 1088–1096. doi:10.1109/TSMCB.2011. 2106208. 77 ACCEPTED MANUSCRIPT [144] C. Xu, D. Tao, C. Xu, Multi-view intact space learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (12) (2015) 2531– 2544. doi:10.1109/TPAMI.2015.2417578. T [145] T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding, IEEE IP Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) CR 40 (6) (2010) 1438–1446. doi:10.1109/TSMCB.2009.2039566. [146] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, Y. Wen, Multiview vector-valued manifold regularization for multilabel image classification, IEEE Transac- doi:10.1109/TNNLS.2013.2238682. US tions on Neural Networks and Learning Systems 24 (5) (2013) 709–722. AN [147] J. Yu, Y. Rui, Y. Y. Tang, D. Tao, High-order distance-based multiview stochastic learning in image classification, IEEE Transactions on Cyber- M netics 44 (12) (2014) 2431–2442. doi:10.1109/TCYB.2014.2307862. [148] S. C. H. Hoi, W. Liu, M. R. Lyu, W.-Y. Ma, Learning distance met- ED rics with contextual constraints for image retrieval, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition PT (CVPR’06), Vol. 2, 2006, pp. 2072–2078. doi:10.1109/CVPR.2006.167. [149] Z. Wang, S. Gao, L.-T. Chia, Learning class-to-image distance via large CE margin and l1-norm regularization, in: A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, C. Schmid (Eds.), Computer Vision – ECCV 2012, Springer AC Berlin Heidelberg, Berlin, Heidelberg, 2012, pp. 230–244. [150] P. Wu, S. C.-H. Hoi, P. Zhao, Y. He, Mining social images with distance metric learning for automated image tagging, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, WSDM ’11, ACM, New York, NY, USA, 2011, pp. 197–206. doi:10.1145/1935826.1935865. [151] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, Y.-F. Li, Multi-instance multi- 78 ACCEPTED MANUSCRIPT label learning, Artificial Intelligence 176 (1) (2012) 2291 – 2320. doi: https://doi.org/10.1016/j.artint.2011.10.002. [152] F. Briggs, X. Z. Fern, R. Raich, Rank-loss support instance machines for T miml instance annotation, in: Proceedings of the 18th ACM SIGKDD In- IP ternational Conference on Knowledge Discovery and Data Mining, KDD ’12, ACM, New York, NY, USA, 2012, pp. 534–542. CR 2339530.2339616. doi:10.1145/ [153] C.-T. Nguyen, D.-C. Zhan, Z.-H. Zhou, Multi-modal image annotation US with multi-instance multi-label lda, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ’13, AAAI AN Press, 2013, pp. 1558–1564. [154] M. L. Zhang, A k-nearest neighbor based multi-instance multi-label learning algorithm, in: 2010 22nd IEEE International Conference on Tools with M Artificial Intelligence, Vol. 2, 2010, pp. 207–212. doi:10.1109/ICTAI. ED 2010.102. [155] J. He, H. Gu, Z. Wang, Bayesian multi-instance multi-label learning using gaussian process prior, Machine Learning 88 (1) (2012) 273–295. doi: PT 10.1007/s10994-012-5283-x. [156] S. Huang, Z. Zhou, Fast multi-instance multi-label learning, CoRR CE abs/1310.2049. [157] C. Desai, D. Ramanan, C. C. Fowlkes, Discriminative models for multi- AC class object layout, International Journal of Computer Vision 95 (1) (2011) 1–12. doi:10.1007/s11263-011-0439-x. [158] B. Hariharan, S. V. N. Vishwanathan, M. Varma, Efficient max-margin multi-label classification with applications to zero-shot learning, Machine Learning 88 (1) (2012) 127–155. doi:10.1007/s10994-012-5291-x. [159] Y. Guo, S. Gu, Multi-label classification using conditional dependency networks, in: Proceedings of the Twenty-Second International Joint 79 ACCEPTED MANUSCRIPT Conference on Artificial Intelligence - Volume Volume Two, IJCAI’11, AAAI Press, 2011, pp. 1300–1305. doi:10.5591/978-1-57735-516-8/ IJCAI11-220. T [160] Z.-J. Zha, T. Mei, J. Wang, Z. Wang, X.-S. Hua, Graph-based semi- IP supervised learning with multiple labels, Journal of Visual Communication and Image Representation 20 (2) (2009) 97 – 103, special issue on CR Emerging Techniques for Multimedia Content Sharing, Search and Understanding. doi:https://doi.org/10.1016/j.jvcir.2008.11.009. US [161] Z. h. Z., M. l. Zhang, Multi-instance multi-label learning with application to scene classification, in: B. Schölkopf, J. C. Platt, T. Hoffman (Eds.), AN Advances in Neural Information Processing Systems 19, MIT Press, 2007, pp. 1609–1616. M [162] Y. Liu, R. Jin, L. Yang, Semi-supervised multi-label learning by constrained non-negative matrix factorization, in: Proceedings of the 21st ED National Conference on Artificial Intelligence - Volume 1, AAAI’06, AAAI Press, 2006, pp. 421–426. PT [163] F. Kang, R. Jin, R. Sukthankar, Correlated label propagation with application to multi-label learning, in: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, CE 2006, pp. 1719–1726. doi:10.1109/CVPR.2006.90. [164] S. S. Bucak, R. Jin, A. K. Jain, Multi-label learning with incomplete class AC assignments, in: CVPR 2011, 2011, pp. 2801–2808. doi:10.1109/CVPR. 2011.5995734. [165] H.-F. Yu, P. Jain, P. Kar, I. S. Dhillon, Large-scale multi-label learning with missing labels, in: Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, JMLR.org, 2014, pp. I–593–I–601. 80 ACCEPTED MANUSCRIPT [166] H. Yang, J. T. Zhou, J. Cai, Improving multi-label learning with missing labels by structured semantic correlations, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Computer Vision – ECCV 2016, Springer International T Publishing, Cham, 2016, pp. 835–851. IP [167] M. L. Zhang, Z. H. Zhou, A review on multi-label learning algorithms, 1819–1837. doi:10.1109/TKDE.2013.39. CR IEEE Transactions on Knowledge and Data Engineering 26 (8) (2014) [168] S. Feng, C. Lang, Graph regularized low-rank feature mapping for multi- US label learning with application to image annotation, Multidimensional Systems and Signal Processingdoi:10.1007/s11045-017-0505-9. AN [169] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of the Twentieth In- M ternational Conference on International Conference on Machine Learning, ICML’03, AAAI Press, 2003, pp. 912–919. ED [170] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Proceedings of the 16th Interna- PT tional Conference on Neural Information Processing Systems, NIPS’03, MIT Press, Cambridge, MA, USA, 2003, pp. 321–328. CE [171] T. G. Dietterich, R. H. Lathrop, T. Lozano-Prez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89 (1) (1997) 31 – 71. doi:https://doi.org/10.1016/S0004-3702(96) AC 00034-3. [172] O. Maron, A. L. Ratan, Multiple-instance learning for natural scene classification, in: Proceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1998, pp. 341–349. [173] Learning with multiple views (2005). 81 ACCEPTED MANUSCRIPT [174] A. Fakeri-Tabrizi, M. R. Amini, P. Gallinari, Multiview semi-supervised ranking for automatic image annotation, in: Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, ACM, New York, NY, T USA, 2013, pp. 513–516. doi:10.1145/2502081.2502136. IP [175] Y. Li, X. Shi, C. Du, Y. Liu, Y. Wen, Manifold regularized multi-view feature selection for social image annotation, Neurocomputing 204 (Sup- CR plement C) (2016) 135 – 141, big Learning in Social Media Analytics. doi:https://doi.org/10.1016/j.neucom.2015.07.151. abs/1304.5634. arXiv:1304.5634. US [176] C. Xu, D. Tao, C. Xu, A survey on multi-view learning, CoRR AN [177] S. Sun, A survey of multi-view machine learning, Neural Computing and Applications 23 (7) (2013) 2031–2038. M s00521-013-1362-6. doi:10.1007/ [178] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative ED sparse coding for image annotation, Computer Vision and Image Understanding 118 (Supplement C) (2014) 50 – 60. doi:https://doi.org/10. PT 1016/j.cviu.2013.03.007. [179] C. Jin, S.-W. Jin, Image distance metric learning based on neighborhood CE sets for automatic image annotation, Journal of Visual Communication and Image Representation 34 (Supplement C) (2016) 167 – 175. doi: AC https://doi.org/10.1016/j.jvcir.2015.10.017. [180] G. Carneiro, N. Vasconcelos, A database centric view of semantic image annotation and retrieval, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, ACM, New York, NY, USA, 2005, pp. 559–566. doi:10.1145/1076034.1076129. [181] C. Wang, F. Jing, L. Zhang, H. J. Zhang, Content-based image annotation 82 ACCEPTED MANUSCRIPT refinement, in: 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [182] J. W., Y. Y., J. Mao, Z. Huang, C. Huang, W. Xu, CNN-RNN: A uni- T fied framework for multi-label image classification, CoRR abs/1604.04573. IP arXiv:1604.04573. CR [183] J. Jin, H. Nakayama, Annotation order matters: Recurrent image annotator for arbitrary length image tagging, in: 2016 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 2452–2457. US [184] F. Liu, T. Xiang, T. M. Hospedales, W. Yang, C. Sun, Semantic regularisation for recurrent image annotation, in: 2017 IEEE Conference on AN Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4160–4168. doi:10.1109/CVPR.2017.443. M [185] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell: A neural image caption generator, in: 2015 IEEE Conference on Computer Vision ED and Pattern Recognition (CVPR), 2015, pp. 3156–3164. doi:10.1109/ CVPR.2015.7298935. PT [186] Y. Zhou, Y. Wu, Analyses on influence of training data set to neural network supervised learning performance, in: D. Jin, S. Lin (Eds.), Advances CE in Computer Science, Intelligent System and Environment, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011, pp. 19–25. AC [187] J. Z. Wang, J. Li, G. Wiederhold, Simplicity: semantics-sensitive integrated matching for picture libraries, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (9) (2001) 947–963. [188] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, Vol. 115, 2015, pp. 211–252. doi:10.1007/s11263-015-0816-y. 83 ACCEPTED MANUSCRIPT [189] M. Grubinger, Analysis and evaluation of visual information systems performance, thesis (Ph. D.)–Victoria University (Melbourne, Vic.), 2007 (2007). T [190] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, Nus-wide: A IP real-world web image database from national university of singapore, in: Proceedings of the ACM International Conference on Image and Video CR Retrieval, CIVR ’09, ACM, New York, NY, USA, 2009, pp. 48:1–48:9. doi:10.1145/1646396.1646452. US [191] J. Tang, P. H. Lewis, A study of quality issues for image auto-annotation with the corel dataset, IEEE Transactions on Circuits and Systems for AN Video Technology 17 (3) (2007) 384–389. [192] Y. Xiang, X. Zhou, Z. Liu, T. S. Chua, C. W. Ngo, Semantic context mod- M eling with maximal margin conditional random fields for automatic image annotation, in: 2010 IEEE Computer Society Conference on Computer ED Vision and Pattern Recognition, 2010, pp. 3368–3375. [193] S. Gao, L. T. Chia, I. W. H. Tsang, Z. Ren, Concurrent single-label image PT classification and annotation via efficient multi-layer group sparse coding, IEEE Transactions on Multimedia 16 (3) (2014) 762–771. CE [194] W. Chong, D. Blei, F. F. Li, Simultaneous image classification and annotation, in: 2009 IEEE Conference on Computer Vision and Pattern AC Recognition, 2009, pp. 1903–1910. doi:10.1109/CVPR.2009.5206800. [195] F. Nie, D. Xu, I. W. H. Tsang, C. Zhang, Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction, IEEE Transactions on Image Processing 19 (7) (2010) 1921–1932. doi: 10.1109/TIP.2010.2044958. [196] H. Wang, H. Huang, C. Ding, Image annotation using multi-label correlated green’s function, in: 2009 IEEE 12th International Conference 84 ACCEPTED MANUSCRIPT on Computer Vision, 2009, pp. 2029–2034. doi:10.1109/ICCV.2009. 5459447. [197] J. Wu, Y. Yu, C. Huang, K. Yu, Deep multiple instance learning for T image classification and auto-annotation, in: 2015 IEEE Conference on IP Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3460–3469. CR [198] H. K. Shooroki, M. A. Z. Chahooki, Selection of effective training instances for scalable automatic image annotation, Multimedia Tools and Applications 76 (7) (2017) 9643–9666. doi:10.1007/s11042-016-3572-2. US [199] Z. Shi, Y. Yang, T. M. Hospedales, T. Xiang, Weakly supervised learning of objects, attributes and their associations, CoRR abs/1504.00045. AN arXiv:1504.00045. [200] J. Sánchez-Oro, S. Montalvo, A. S. Montemayor, J. J. Pantrigo, A. Duarte, M V. Fresno-Fernández, R. Martı́nez-Unanue, Urjc&uned at imageclef 2013 photo annotation task, in: CLEF, 2012. ED [201] S. Stathopoulos, T. Kalamboukis, T.: Ipl at imageclef 2014: Scalable concept image annotation, in: In: CLEF 2014 Evaluation Labs and Work- PT shop, Online Working, 2014. [202] P. Budı́ková, J. Botorek, M. Batko, P. Zezula, DISA at imageclef CE 2014 revised: Search-based image annotation with decaf features, CoRR abs/1409.4627. arXiv:1409.4627. AC [203] L. Deng, D. Yu, Deep learning: Methods and applications, Foundations and Trends in Signal Processing 7 (34) (2014) 197–387. doi:10.1561/ 2000000039. URL http://dx.doi.org/10.1561/2000000039 [204] J. Wan, D. Wang, S. C. H. Hoi, P. Wu, J. Zhu, Y. Zhang, J. Li, Deep learning for content-based image retrieval: A comprehensive study, in: Proceedings of the 22Nd ACM International Conference on Multimedia, 85 ACCEPTED MANUSCRIPT MM ’14, ACM, New York, NY, USA, 2014, pp. 157–166. URL http://doi.acm.org/10.1145/2647868.2654948 [205] S. Nowak, S. Rüger, How reliable are annotations via crowdsourcing: A T study about inter-annotator agreement for multi-label image annotation, IP in: Proceedings of the International Conference on Multimedia Information Retrieval, MIR ’10, ACM, New York, NY, USA, 2010, pp. 557–566. CR doi:10.1145/1743384.1743478. [206] J. Moehrmann, G. Heidemann, Semi-automatic image annotation, in: US R. Wilson, E. Hancock, A. Bors, W. Smith (Eds.), Computer Analysis of Images and Patterns, Springer Berlin Heidelberg, Berlin, Heidelberg, AN 2013, pp. 266–273. [207] M. Tuffield, S. Harris, D. P. Dupplaw, A. Chakravarthy, C. Brewster, M N. Gibbins, K. O’Hara, F. Ciravegna, D. Sleeman, Y. Wilks, N. R. Shadbolt, Image annotation with photocopain, in: First International Work- ED shop on Semantic Web Annotations for Multimedia (SWAMM 2006) at WWW2006, 2006, event Dates: May 2006. PT [208] D.-H. Im, G.-D. Park, Linked tag: image annotation using semantic relationships between image tags, Multimedia Tools and Applications 74 (7) CE (2015) 2273–2287. doi:10.1007/s11042-014-1855-z. [209] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, D. N. Metaxas, Automatic image annotation using group sparsity, in: 2010 IEEE Computer Society AC Conference on Computer Vision and Pattern Recognition, 2010, pp. 3312– 3319. [210] B. C. Ko, J. Lee, J.-Y. Nam, Automatic medical image annotation and keyword-based image retrieval using relevance feedback, Journal of Digital Imaging 25 (4) (2012) 454–465. doi:10.1007/s10278-011-9443-5. [211] S. Zhang, J. Huang, H. Li, D. N. Metaxas, Automatic image annotation 86 ACCEPTED MANUSCRIPT and retrieval using group sparsity, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42 (3) (2012) 838–849. [212] C. Rashtchian, P. Young, M. Hodosh, J. Hockenmaier, Collecting im- T age annotations using amazon’s mechanical turk, in: Proceedings of the IP NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’10, Association for Com- CR putational Linguistics, Stroudsburg, PA, USA, 2010, pp. 139–147. URL http://dl.acm.org/citation.cfm?id=1866696.1866717 US [213] S. Nowak, S. Rüger, How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation, AN in: Proceedings of the International Conference on Multimedia Information Retrieval, MIR ’10, ACM, New York, NY, USA, 2010, pp. 557–566. doi:10.1145/1743384.1743478. M URL http://doi.acm.org/10.1145/1743384.1743478 ED [214] P. Welinder, P. Perona, Online crowdsourcing: Rating annotators and obtaining cost-effective labels, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, 2010, pp. PT 25–32. doi:10.1109/CVPRW.2010.5543189. [215] C.-F. Tsai, K. McGarry, J. Tait, Qualitative evaluation of automatic CE assignment of keywords to images, Information Processing & Management 42 (1) (2006) 136 – 154, formal Methods for Information Retrieval. AC doi:https://doi.org/10.1016/j.ipm.2004.11.001. [216] Y. Zhang, Z.-H. Zhou, Multilabel dimensionality reduction via dependence maximization, ACM Trans. Knowl. Discov. Data 4 (3) (2010) 14:1–14:21. doi:10.1145/1839490.1839495. [217] R. E. Schapire, Y. Singer, Boostexter: A boosting-based system for text categorization, Machine Learning 39 (2) (2000) 135–168. doi:10.1023/A: 1007649029923. 87 ACCEPTED MANUSCRIPT [218] W.-C. Lin, S.-W. Ke, C.-F. Tsai, Robustness and reliability evaluations of image annotation, The Imaging Science Journal 64 (2) (2016) 94–99. [219] X. Ke, M. Zhou, Y. Niu, W. Guo, Data equilibrium based automatic T image annotation by fusing deep model and semantic propagation, Pattern IP Recognition 71 (Supplement C) (2017) 60 – 77. doi:https://doi.org/ CR 10.1016/j.patcog.2017.05.020. [220] L. Sun, H. Ge, S. Yoshida, Y. Liang, G. Tan, Support vector description of clusters for content-based image annotation, Pattern Recognition 47 (3) US (2014) 1361 – 1374, handwriting Recognition and other PR Applications. doi:https://doi.org/10.1016/j.patcog.2013.10.015. AN [221] J. Li, J. Z. Wang, Real-time computerized annotation of pictures, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (6) (2008) M 985–1002. doi:10.1109/TPAMI.2007.70847. [222] J. Tang, P. H. Lewis, Image auto-annotation using ’easy’ and ’more chal- ED lenging’ training sets, in: 7th International Workshop on Image Analysis for Multimedia Interactive Services, 2006, pp. 121–124, event Dates: April PT 19-21. [223] P. Duygulu, K. Barnard, J. F. G. de Freitas, D. A. Forsyth, Object recog- CE nition as machine translation: Learning a lexicon for a fixed image vocabulary, in: A. Heyden, G. Sparr, M. Nielsen, P. Johansen (Eds.), Computer Vision — ECCV 2002, Springer Berlin Heidelberg, Berlin, Heidelberg, AC 2002, pp. 97–112. [224] M. Everingham, L. V. Gool, C. K. I. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc) challenge, International Journal of Computer Vision 88 (2) (2010) 303–338. [225] H. Mller, P. Clough, T. Deselaers, B. Caputo, ImageCLEF: Experimental Evaluation in Visual Information Retrieval, 1st Edition, Springer Publishing Company, Incorporated, 2010. 88 ACCEPTED MANUSCRIPT [226] M. Villegas, R. Paredes, B. Thomee, Overview of the ImageCLEF 2013 Scalable Concept Image Annotation Subtask, in: CLEF 2013 Evaluation Labs and Workshop, Online Working Notes, Valencia, Spain, 2013. T [227] J. Sánchez-Oro, S. Montalvo, A. S. Montemayor, J. J. Pantrigo, A. Duarte, IP V. Fresno, R. Martı́nez, URJC&UNED at ImageCLEF 2013 Photo An- AC CE PT ED M AN US Working Notes, Valencia, Spain, 2013. CR notation Task, in: CLEF 2013 Evaluation Labs and Workshop, Online 89