2282 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Task-Dependent Visual-Codebook Compression Rongrong Ji, Member, IEEE, Hongxun Yao, Member, IEEE, Wei Liu, Student Member, IEEE, Xiaoshuai Sun, Student Member, IEEE, and Qi Tian, Senior Member, IEEE Abstract—A visual codebook serves as a fundamental component in many state-of-the-art computer vision systems. Most existing codebooks are built based on quantizing local feature descriptors extracted from training images. Subsequently, each image is represented as a high-dimensional bag-of-words histogram. Such highly redundant image description lacks efficiency in both storage and retrieval, in which only a few bins are nonzero and distributed sparsely. Furthermore, most existing codebooks are built based solely on the visual statistics of local descriptors, without considering the supervise labels coming from the subsequent recognition or classification tasks. In this paper, we propose a task-dependent codebook compression framework to handle the above two problems. First, we propose to learn a compression function to map an originally high-dimensional codebook into a compact codebook while maintaining its visual discriminability. This is achieved by a codeword sparse coding scheme with Lasso regression, which minimizes the descriptor distortions of training images after codebook compression. Second, we propose to adapt our codebook compression to the subsequent recognition or classification tasks. This is achieved by introducing a label constraint kernel (LCK) into our compression loss function. In particular, our LCK can model heterogeneous kinds of supervision, i.e., (partial) category labels, correlative semantic annotations, and image query logs. We validated our codebook compression in three computer vision tasks: 1) object recognition in PASCAL Visual Object Class 07; 2) near-duplicate image retrieval in UKBench; and 3) web image search in a collection of 0.5 million Flickr photographs. Our compressed codebook has shown superior performances over several state-of-the-art supervised and unsupervised codebooks. Index Terms—Image retrieval, indexing, local feature, object classification, supervised quantization, visual codebook. I. INTRODUCTION V ISUAL-CODEBOOK representations are widely used in many computer vision tasks such as visual search, object categorization, scene recognition, and video event detection. In a typical scenario, given the local feature descriptors ex- Manuscript received October 27, 2010; revised March 02, 2011 and June 03, 2011; accepted June 04, 2011. Date of publication November 22, 2011; date of current version March 21, 2012. This work was supported in part by the National Science Foundation of China (Key Program) under 61133003 and in part by the Natural Science Foundation of China under Grant 61071180 and Grant 60775024. The work of Q. Tian was supported in part by the NSF IIS 1052851, by the Faculty Research Awards of Google FXPAL, and by the NEC Laboratories of America. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Mark Liao. R. Ji and X. Sun are with the Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China. H. Yao is with the Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China (e-mail: H.Yao@hit.edu.cn). W. Liu is with the Department of Electrical Engineering, Columbia University, New York, NY 10027 USA. Q. Tian is with the Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249 USA. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2011.2176950 tracted from training images, a visual codebook is built by quantizing these descriptors into discrete codeword regions in the descriptor feature space. Subsequently, each image is represented as a bag-of-words (BoW) histogram, which is robust against the photographing variances in different viewpoints, scales, and occlusions. One important issue of this visual-codebook representation is its dimensionality. Generally speaking, most existing codebooks require more than 10 000 codewords to be tuned optimal [1]–[5]. Such a high dimensionality usually introduces obvious computational cost in both processing time (to match visual descriptors online) and storage space (to maintain the search model in the memory). This is extremely crucial for the state-ofthe-art mobile visual search scenario [6], [7], where mobile devices directly extract the BoW histogram and transmit over the wireless link. On the contrary, since the number of local features extracted from each image is limited (typically hundreds), each image is represented as a sparse histogram of nonzero codewords. Moreover, due to the dimension curses in the subsequent classifier training procedure, most object recognition systems expect the BoW histogram to be compact. Inspired by the above contradictions, our first goal is to compress the state-of-the-art high-dimensional codebooks [1], [3], [4], [8], [9] to improve their storage and retrieval efficiency while maintaining the visual discriminative capability. Our second goal is to introduce the supervised labels from the subsequent recognition or classification tasks to guide our codebook compression. Our inspirations are twofold: First, most existing visual codebooks [1], [3]–[5], [8]–[11] are built based solely on the visual statistics of training images, without considering their semantic labels to improve their discriminability. Second, recent works in supervised codebook learning [12]–[15] have directly incorporated the supervised labels in the initial codebook building procedure, which are computational burdensome and not scalable for new labels, new data sets, or new tasks (even in the same data set). In addition, by binding supervision over codebook building, the existing supervised codebooks learning strategies cannot be reused among each other [12]–[15]. An alternative approach is to refine an initial codebook. However, existing works [16]–[18] are not sufficiently flexible and scalable to model heterogeneous kinds of supervised labels, such as correlative semantic labels and image query logs. In this paper, we present a task-dependent visual-codebook compression framework to achieve the above goals. We introduce a supervised sparse coding model to learn a compression function, which maps an originally high-dimensional codebook into a compact basis dictionary. In this model, we integrate both visual discriminability and task-dependent semantic discriminability (from supervised labeling) into the Lasso-based regression cost [19] for compression function learning. Using the 1057-7149/$26.00 © 2011 IEEE JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2283 Fig. 1. Proposed task-dependent visual-codebook compression framework. learned basis dictionary, each original BoW histogram is nonlinearly transformed into a low-dimensional reconstruction parametric vector, which serves as the compressed histogram for a given image. This histogram is more compact and discriminative for the subsequent recognition classifier training, retrieval model indexing, or a similarity ranking procedure. To integrate the visual discriminability into our basis dictionary learning, we model the codewords’ importance using their term frequencies and inverted document frequencies [20], which are embedded as a weighting vector into the compression cost function. To integrate the task-dependent discriminability into our basis dictionary learning, we introduce a label constraint kernel (LCK) to model the labeling consistency loss in the compression function. In particular, we show how to specify the LCK for different tasks even in an identical database, including: 1) (partial) category labels for object recognition; 2) correlative semantic annotations for image understanding; and 3) image query logs (obtained from the ground-truth image ranking labeling) for web image ranking. Fig. 1 outlines the proposed framework. II. RELATED WORK Visual-codebook construction: A visual codebook is typically constructed based on unsupervised vector quantization such as k-means clustering [8], [21], which subdivides the local feature space into discrete codeword regions. Such a division represents an image as a BoW histogram, in which each bin counts how many local features of this image fall into the corresponding codeword region of this bin. In recent years, there have been numerous vector quantization approaches to build visual codebooks, such as spatial pyramids [22], vocabulary trees (VTs) [1], approximate k-means [4], and their variations [3], [9]. In addition to direct quantization, hashing-based approaches are also well exploited in the literature, e.g., locality sensitive hashing [23] and its kernelized version [10]. Recent works have also investigated the codeword uncertainty and ambiguity using methods such as Hamming embedding [11], soft assignments [24], [25], and kernelized codebook histograms [5]. Supervised codebook construction: Rather than using solely visual content statistics, works in [12]–[15] also proposed to integrate the semantic labels to supervise the codebook construction. For instance, Mairal et al. [12] used category-aware sparse coding to build a supervised vocabulary for object categorization. Lazebnik et al. [13] adopted minimizing mutual information loss to build a supervised codebook based on a fully labeled local feature set. Moosmann et al. [14] proposed to build supervised indexing trees using an ERC-Forest that considered semantic labels as stopping tests. Ji et al. [15] adopted a hidden Markov random field to use correlative web labeling for supervised vocabulary construction. In image and video compression, there are also similar works proposed for learning-based vector quantization [26]–[28]. Methods such as self-organizing maps [27] and regression loss minimization [28] are utilized to reconstruct the original input signals that minimized visual quantization distortions. Finally, there are also works in learning visual parts [29], [30] from the images of identical categories by clustering local patches with spatial configurations. Learning-based codeword refinement: Rather than directly supervising the codebook building procedure, works in [16]–[18] and [31] proposed to merge or split the initial codewords for the subsequent classifier training. For instance, Perronnin et al. [16] integrated category labels to adapt an initial vocabulary into several class-specific vocabularies. Winn et al. [31] also learned class-specific vocabularies from an initial vocabulary by merging codeword pairs, in which the codeword distribution was modeled by the Gaussian mixture model. Although works in [16]–[18] and [1] worked well for limited category numbers, these methods cannot be scaled up to general scenarios that contain numerous and correlative category labels. To a certain degree, the topic models [e.g., probabilistic latent semantic analysis (pLSA) [32] and latent Dirichlet allocation (LDA) [33]] can be also treated as unsupervised codeword refinements, which work in a generative manner constrained by superparameters. Sparse coding: Sparse coding theory has been recently well investigated for effective and efficient high-dimensional signal representation and compression [19], [34], [35]. Its main idea is to represent a signal as a linear combination of sparse basis from an overcomplete dictionary. Although the exact recovery of this dictionary is NP hard, a sufficiently sparse and linear representation can be efficiently approximated by convex optimization [34]. For instance, Tibshirani et al. [19] proposed the Lasso regression strategy to achieve coefficient sparsity by adding -norm penalty to the regression loss. Works in [12] and [36] also adopted sparse coding to directly build codebooks from the original local patch set. However, facing millions of local patches, the computational efficiency of direct sparse coding [12], [36] retains to be an opening problem. Compared with the above works, our novelties are twofold. First, our codeword-level sparse coding improves the essential computational efficiency: We aim to learn a compression transform to 2284 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Fig. 2. Graphical model of task-dependent codebook compression. Fig. 3. Visualized flowchart of our codebook compression pipeline. compress the quantized BoW histograms, rather than directly encoding the local patch collection. It is very important for scalable applications that contain hundreds of millions of local patches, which are unsuitable to directly use spare coding to learn codebooks. Second, the existing supervised sparse coding model only allows supervision on the local patch level. On the contrary, the coding model proposed in this paper enables the modeling of heterogeneous types of semantic labels based on our proposed LCK. Our contributions: Compared with the previous works that directly supervised the codeword quantization procedure, we emphasize on compressing an initially high-dimensional codebook into an extremely compact codebook (see Figs. 2 and 3). Our first contribution is the efficiency: We only operate on the BoW histograms and subsequently avoid the burden to directly supervise the quantization of millions or billions of local features. Our second contribution is the extensibility: Our approach can be deployed to most existing codebooks [1], [3], [4], [8], [9]. We treat their BoW histograms as input, and our subsequent compression is independent to the initial codebook building strategies. Our last consideration is the flexibility: We propose a “task dependent” discriminability embedding to model different kinds of semantic labels, either within an identical database or among different databases. This is achieved by introducing a flexible labeling correlation kernel in Section IV-B to model heterogeneous kinds of semantic labels. III. PROBLEM FORMULATION We denote scalars as italic letters, e.g., ; denote vectors as bold italic letters, e.g., ; denote instance space for instances as ; denote an inner product between two vectors and as ; denote norms over with as ; and denote the transposition of matrix as and the inverse of as . The input of our codebook compression algorithm is an initial codebook containing codewords, which represents the training images as a set of BoW histograms , in which each denotes an -bin BoW histogram. This initial codebook can be derived from most existing approaches such as [1], [3], [4], [8], and [9]. Then, suppose a subsequent task (such as image retrieval or object recognition) provides its labels as . To incorporate partial labels in our subsequent supervised codebook compression, we further assign to indicate that the th image is unlabeled. We aim to learn a basis dictionary to compress the initial codebook , which serves as a compression function to transfer each original BoW histogram into a more compact histogram . Therefore, for new image , we first use to generate its original BoW histogram and then use to map into a compact histogram . In general, we also have and . For instance for many widely used codebooks, and is at a hundred level in our subsequent experiments. Our first goal is to ensure that the basis dictionary can still preserve the visual discriminability of the original codebook . To maintain such discriminability, we should slightly compress the “important” codewords and heavily compress the “chaoslike” codewords. To this end, we model the visual discriminability of each codeword into our compression loss function to learn in Section IV-A. Our second goal is to integrate the task-dependent label constraints to supervise our compression. Such labels come from numerous ways, such as object classes, correlative image annotations, and image ranking orders from the user-query logs. “Task dependent” means that the compression process should depend on its specific task even in an identical database, e.g., search semantically similar images or search near-duplicate (ND) objects. Task-dependent embedding is a challenging issue that remains unexplored in all previous works. In Section IV-B, we further show how to integrate heterogeneous labels from different tasks into our codebook compression framework. IV. TASK-DEPENDENT CODEBOOK COMPRESSION Using the initial BoW histogram set extracted from training images, we learn the basis dictionary from the original codebook by minimizing the following cost: Cost (1) where Loss denotes the loss function to measure if the current basis dictionary is “good” at reconstructing an initial BoW histogram . To this end, we are trying to learn the best compression function to transfer each high-dimensional sparse into the compact descriptor . To evaluate the compression distortion, we reversely measure the recovery residual of from . Furthermore, considering that each histogram is sparse with respect to the few nonzero bins, we hope that our reconstructed is also sparse. We add JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION this sparse constraint into the loss function Loss an normalization form as follows: 2285 using (2) Loss is the coefficient vector to reconstruct from , which meanwhile serves as the new compressed BoW histogram for the th image. controls a tradeoff between the sparsity of the reconstructed signal and its precision to recover . While guaranteeing the real sparsity through is intractable [39], using a Lasso-based penalty [19] can still approximate a sparse solution for coefficients . To avoid arbitrarily small , we conduct a normalize operation on each column in after each-round optimization of in (1) as follows: s.t. (3) Note that Cost in (1) is not convex with respect to . Therefore, we resort to a joint optimization between the basis dictionary and the compressed BoW histogram iteratively. 1 That is, in each-round optimization, we fix one parameter set ( or ) and optimize the constraints as rest ( or ). We then rewrite (2) with Loss (4) One essential advantage is that, compared with direct sparse coding to build codebook as in existing works [12], [36], we only operate on the dictionary level (typically contains only 10 000 codewords), rather than on the entire local feature collection (which typically contains over millions or billions of local features). Hence, the computational efficiency is largely improved. Meanwhile, as would be proven in our subsequent experiments, our codebook compression performance is comparable with that of the state-of-the-art ones [12], [36] that directly operated on the entire initial local feature collection. A. Visual Discriminability Embedding Visual discriminability refers to whether a codeword is discriminative to distinguish each training image in the BoW histogram. We exploit this criterion to guide our codebook compression: Discriminative codewords should be slightly compressed, whereas chaos-like codewords should be heavily compressed. Following the principle of term frequency and the inverted document frequency [20], we quantitatively measure the visual discriminability of the th codeword as (5) where denotes the discriminative weighting of codeword , is the number of local features that is quantized into , is the total amount of local features extracted from all training images, is the total number of training images, and is the number of images that contain local features quantized into . In the right part of (5), the first term measures whether can 1Nevertheless, other strategies, such as linear programming, can be also used and the estimation of to accelerate the learning of represent many local features (the more the better), and the second term measures whether is only discriminative for a few images (the less the better). The ensemble of forms a visual discriminability weighting vector , which is embedded to refine our compression loss in (4) as Loss (6) Equation (6) highlights whether a given BoW histogram contains discriminative local features, which is measured by its weighting. If so, the learning of would emphasize more on the compression loss of and vice versa. B. Task-Dependent Discriminability Embedding Task-dependent discriminability refers to whether the compressed codebook is suitable for its subsequent recognition or retrieval task. In this subsection, we integrate the task labels to supervise the learning of our basis dictionary . This is achieved by introducing an LCK into our compression loss Loss for each BoW histogram to evaluate its labeling consistency, as shown in LCK Logistic (7) where Logistic represents a logistic loss function for scalar as , which enjoys the properties similar to the loss function of support vector machines (SVMs). defines a consistency measurement between two labels and , which would be revisited in Sections IV-B1 –IV-B3. denotes the labels of nearby BoW histograms that fall within of . Then, our goal turns out to learn with LCK constraints. To that effect, we propose a refined loss function of (6) as Loss (8) using the above refined In addition, learning the overall Cost Loss D is still not convex with respect to . Similarly, we resort to a joint optimization between and , as stated before. It is worth mentioning that, before the compression procedure, we preestimate and store the LCK for each into a lookup table, which largely accelerates our subsequent iterative regression. Our LCK can be used in several kinds of recognition or retrieval tasks by modeling heterogeneous labels, such as oneclass, two-class, and multiple-class labels; partial labels; correlative labels; and the list of image ranking orders. We specify the LCK in (7) as follows. 1) Modeling Category Labels: In such case, the label vector of training images comes from several discrete categories. We assume that any two given categories are independent. Hence, only takes effect to once two identical labels fall into LCK Logistic (9) 2286 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Here, we specialize the consistency measurement of as the count of similar labels within : , in which produces 1 for two identical labels and 0 if otherwise. Intuitively, if contains more nearby and identical category labels, the learning of would “pay more attention to” the reconstruction loss of . It is worth to mention that, partial labeling, as well as one-class labeling, can be easily modeled using our category-based LCK in (9). 2) Modeling Correlative Semantic Annotations: We further remove the category independence constraints, which allow the labels to be correlative to each other. For instance, a “furniture” annotation is more closer to a “chair” annotation, compared with an “airplane” annotation. WordNet [40] is a powerful electronic lexical database to provide such semantic relationships. We adopt the WordNet::similarity [41] measurement to quantitatively measure the semantic closeness between two relative annotations and , which refines the LCK as LCK C. Iterative Dictionary Learning Learning the basis dictionary in is an optimization problem with respect to . As stated before, while the jointly convex optimization of both and is unavailable, and we resort to an iterative convex optimization between using block coordinate descent [12], [42]–[44]. It iteratively (denoted as supervised dictionary optimizes the learning of learning) and the inference of (denoted as task-dependent sparse coding), as outlined in Algorithm 1. Algorithm 1: Iterative Dictionary Learning 1 Input: Training images with initial BoW histograms , initial visual codebook task-dependent labels , iteration number , and maximum iteration number . 2 Initialization: Set to a random Gaussian matrix with normalized columns. Use and to calculate the initial 3 while {the Logistic (10) Here, we specialize the consistency degree of as its accumulated WordNet distance within : . contains more correlative annotations, Intuitively, if is more semantically sensitive. Therefore, will receive less compression loss in generating and . On the contrary, more diverse semantic annotations in will lead to heavier compression of . 3) Modeling Image Ranking Lists: Rather than category or semantic labels, in some other cases, the task supervision cannot be directly obtained. For instance, content-based image search engines (e.g., TineEye [51] and Google Goggles [52]) usually collect large amount of user-query logs, which offer similarity ranking orders instead of the direct semantic labels. Suppose that there are in total groups of user-query logs and each gives the ranking order list for the image query . We give the following ranking consistency modeling of as (11) denotes the ranking order of . We set once or is outside , therefore enfor unlabeled (unranked) images. suring Subsequently, our LCK is refined as follows: in which , or . } do 4 For a given task, calculate and store LCK response for each using respective scheme in Section IV-B. 5 Supervised Dictionary Learning: 6 for each { 7 Use Lasso [19] to learn on the following loss: in the iteration } do at the iteration based 8 Loss LCK (13) 9 end 10 Task-Dependent Sparse Coding: 11 for each { in the iteration } do 12 Adopt to estimate using (14) 13 (14) 14 end 15 (12) with Based on (12), we specialize the measurement of the sum of positive or negative sequential ranking pairs within as . Intuitively, if we get more positive sequential ranking pairs in , should then be ranked higher. Hence, our LCK will give less in generating basis dictionary and compression loss for . 16 end 17 Output: The basis dictionary compressed histogram ; The new . It is worth to mention that, in practice, we learn the compresand the compressed histogram simultanesion function ously using Algorithm 1 with iteratively optimization between JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION and . In this viewpoint, our proposed approach can be regarded as a single-stage processing. V. EXPERIMENTAL COMPARISONS In this section, we compare our task-dependent codebook compression with alternative approaches and five state-of-the-art works [1], [16], [33], [36], [50]. We carry out our comparisons in two benchmark databases and a large-scale web photo data set: 1) object recognition in PASCAL Visual Object Class (VOC) 07, which aims to classify whether a given image contains a certain object (output a binary judgement); 2) ND image retrieval in UKBench, which aims to find ND images that contain identical objects or identical scenes without regard to the photographing variances; and 3) a 0.5 million Flickr data set, which provides quantitative performance evaluations for three kinds of task-dependent codebook compressions, including category image search, semantic similar image search, and image ranking. A. Benchmark Databases and Evaluations 1) PASCAL VOC 07 Object Recognition Benchmark: The PASCAL VOC 2007 benchmark [45] contains 20 image categories, including Aeroplane, Bicycle, Bird, Boat, Bottle, Bus, Car, Cat, Chair, Cow, Diningtable, Dog, Horse, Motorbike, Person, Pottedplant, Sheep, Sofa, Train, and Tvmonitor. There are around 2500 training images, 2500 validation images, and 5000 test images in total. We use PASCAL VOC 07 to evaluate our category-based LCK (see Section IV-B1) with comparisons to [16] and [36]. 2) UKBench ND Retrieval Benchmark: The UKBench data set [1] contains over 10 000 images with 2500 objects. There are four images per object to offer sufficient variances in viewpoints, lighting conditions, scales, occlusions, and affine transforms. These category labels are modeled into our category LCK for each four images. To offer the MSER SIFT baseline, as reported in [1], identical to the setting of [1], we build our retrieval model over the entire UKBench database and then select the first image per object to form a query set. It ensures that the top returning photo would be the query itself. Subsequently, the performance depends on whether we can rank the remaining three photos of this object in the highest possible position. This is evaluated by the “correct returning” measurement [1], which shows the average correctly returned images in the top four results. The UKBench data set is used to evaluate our codebook compression with the category-based LCK in Section IV-B1, with comparisons to that in [1] and [33]. 3) 0.5 Million Flickr Database: We crawled over 500 000 collaboratively labeled photos from Flickr, which gives a realworld evaluation for our task-dependent compression. It contains over 180 million local features with over 450 000 labels (over 3600 unique keywords). Within this 0.5 million Flickr data set, we carry out three task-dependent codebook compressions for the three following tasks: 1) ND-0.5Million Evaluation: Our first task is the ND image search in this 0.5 million Flickr data set. Following the Tineye search evaluation methodology [51], we collected and manually labeled 23 groups of web images (1100 in total) from both Tineye and Google image search. The images in each group are partial duplicates among each other. 2287 We then add these ground-truth images into our 0.5 million data set. This enriched data set is referred to as (ND evaluation in a 0.5 million database) (ND-0.5Million).2 We quantitatively evaluate our task-dependent codebook compression using a category-aware LCK, with comparison to the related works in [1] and [50]. We measure our retrieval performance by mean average precision (MAP) at . It represents the mean precision of queries, where each of which reveals its position-sensitive ranking precision in the top positions as follows: MAP of relevant (15) where is the number of queries, is the rank, is is the the number of related images for query , and precision at the cutoff rank of the th related image. 2) SS-0.5Million Evaluation. Our second task is the semantic sensitive image retrieval in this 0.5 million Flickr data set. We select 20 labels identical to the categories in PASCAL VOC 07. For each label, we randomly pick up two images from our 0.5 million Flickr data set and rank the top 100 similar photos for each. For each ranking list, we ask a group of volunteers to identify whether each ranked image comes from the identical category of the query. If so, we treat it as a “correct” image, otherwise as an “incorrect” image. Similar to the ND-0.5Million Evaluation, we also use MAP to measure our performance. We denote this evaluation as a semantic similar evaluation in 0.5 million database (SS-0.5Million), which is used to quantize performances between our semantic-annotation-based LCK (see Section IV-B2) and the work in [50]. 3) RS-0.5Million Evaluation. Our third task is the image ranking based on user-query log learning. For each query in the above SS-0.5Million Evaluation (containing 40 queries), we collect the correct returning images and their ranking orders as user-query logs, which are used to evaluate our ranking-based LCK in Section IV-B3. We denote this group as a ranking-sensitive evaluation in 0.5 million database) (RS-0.5Million evaluation), which evaluates our ranking consistency LCK (see Section IV-B3) with the normalized discounted cumulative gain (NDCG) measurement as (16) where is the NDCG at the rank position of and is the last position of the relevant sample within the first samples in the ranked list. represents the rate of relevant between the th sample in the ranked list and the query bounded in [0, 1]. is the normalized constant for a given ranked list. Compared with precision and recall, NDCG is sensitive to position of the highest rated image, regardless of the variable lengths in different ranking lists. 2Since the original 0.5 million Flickr data set could also contain partial duplicate images of our 23 ground-truth categories, we use every image in the ground-truth set to query the database and, subsequently, to identify and remove any duplicates from the original 0.5 million Flickr data set. 2288 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 Fig. 4. Parameter tuning of best compressed codebook size in three databases. The tuning performance in (a) is measured by Precision@10. The tuning performance in (b) is measured by corrected returning. The tuning performances in (c) and (d) are measured by MAP. (a) Tuning (K, , ) in PASCAL 07. (b) Tuning (K, , ) in UKBench. (c) Tuning (K, , ) in ND-0.5Million Evaluation. (d) Tuning (K, , ) in SS-0.5Million Evaluation. B. Baselines and Alternative Approaches We summarize our baselines [1], [12], [16], [33], [50] and alternative approaches compared in the subsequent quantitative comparisons 1) VT [1]: The first baseline comes from the unsupervised visual-codebook generation based on k-means clustering. We resort to its hierarchical version [1] to ensure scalable indexing and search. [33]: Bosch et al. [33] reported the 2) state-of-the-art recognition performances on Caltech 101. They use pLSA to extract topics from the original BoW histograms, then train an SVM over the topic-level features. Since [33] also compressed an initial codebook, we directly compare our performance to the quantitative results in [33]. In Section V-E, we will further discuss the theoretical differences between [33] and this paper. 3) Learning-based codebook refinement [16], [31] (BoW Learning): To build supervised vocabulary for object recognition tasks, we also reimplement the work in [16], which incorporated category learning to adapt an initial vocabulary into several class-specific vocabularies. We compare our work to [16] in both the object categorization task (in PASCAL VOC 07) and the ND Image Retrieval task (in UKBench). 4) Sparse-coding-based visual codebook [36] (Sparse Coding): We further compare our codebook compression with a recent work in [36]. Different from our scheme that operates over the BoW histogram, the work in [36] used fast sparse coding to generate a compact dictionary from local patch collections. Since [36] is also deployed on the PASCAL VOC 07, we can directly offer quantitative comparisons to [36] in the object categorization task. 5) Aggregating local features (AggreSIFT): Jegou et al. [50] proposed to aggregate local descriptors with principal component analysis (PCA) and hashing indexing, which produced an approximate 32-bit descriptor per image, as a successive work of [53]. We also compare our scheme to [50] in our 0.5 million Flickr data set. 6) Compressed codebook without visual discriminability embedding (without visual): As an alternative approach, we do not embed both visual discriminability weighting and LCK to our loss function [see (8)]. Hence, only direct sparse coding is used to compress the initial codebook [use (2) to replace (8)]. This baseline demonstrates our effectiveness in visual discriminability embedding; Fig. 5. Parameter tuning of the best hierarchical layers for the initial visual codebooks within three benchmark databases, respectively. (a) Tuning H in PASCAL07. (b) Tuning H in UKBench. (c) Tuning H in 0.5 million Flickr database. 7) Compressed codebook without task-dependent embedding (without task): As another alternative approach, we only but not embed the visual discriminability weighting LCK, to compress the initial codebook [using (6) to replace (8)]. - : A straightfor8) ward approach is to compress the high-dimensional BoW histograms using PCA. In such a case, PCA can reduce the dimensionality while maintaining the main characteristics of the BoW histogram. In Section V-D, we compare our codebook compression with BoW PCA in both UKBench and the 0.5 million Flickr data set. C. Parameter Tunings 1) Codebook Construction Tuning: For each benchmark data set, we extract SIFT [46] features from its training images. Using all SIFT features, we build a VT [1] to get the initial for each codebook , which generates a BoW histogram training image . We denote the hierarchical level (decide how and the branching factor many layers to build the VT) as (decide how many children clusters for each parent node) as . We stop the quantization division once there are less than 1000 SIFT features in a word. This setting gives at most words in the finest level. Fig. 4 gives the parameter tuning of codebook size in three data sets, respectively. Based on tuning results in Fig. 5, the initial vocabulary are settled as and for UKBench data set, and for and for Flickr PASCAL VOC 07 data set, and 0.5 million data set, respectively. For the object classification task, the linear SVM [48] is adopted to predict the category labels. For the ND retrieval (in UKBench) and image search (in 0.5 million Flickr data sets), we use L2 distance to rank image similarity with inverted indexing. JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2289 Fig. 6. Average precision in all 20 object categories of PASCAL VOC 07 data set, with comparisons to state-of-the-art works in VT [1], BoW pLSA [33], learning codebook [16], and sparse coding [36]. The line “codebook compression” denotes our final approach with both visual and task-dependent discriminability embedding, the lines “without visual” denote baseline (6), and the lines “without task” denote baseline (7). Our final approach is the one that includes both visual discriminability and task-dependent (category-based LCK) discriminability in codebook compression. In “Without Visual,” “Without Task,” and “Final Approach,” for learning the basis dictionary . we settle 2) Codebook Compression Tuning: Based on the best tuned visual codebook, we carry out our codebook compression to obtain the compressed codebook . Here, we briefly discuss the choice of three compression parameters: [control reconstruction sparsity in (2) (4) (6) (8)], [controls the degree of task-dependent embedding in (8)], and [control the dimension of basis dictionary in (2) (4) (6) (8)]. While a straightforward choice is to tune all of them simultaneously using cross-validation, we resort to a more efficient strategy that sequentially optimizes , , and . That is, we increase the volumes of . In addition, for pair as follows. First, as empireach , we tune the best ically found in [12], the sparsity parameter is settled as 0.15. Then, serves as a regularization parameter to control that the does not overfit to the task labels : compressed codebook A larger results to good fit to the training label (low variance) but will simultaneously increase the classification bias and vice versa. Then, we estimate the best by leave-one-out cross-valin each data set. idation. Fig. 4 presents our best tuned D. Quantitative Results 1) Object Recognition in PASCAL VOC 07: Fig. 6 shows our object recognition performance in PASCAL VOC 07. First, a solely visual codebook [1] [baseline (1)] with a linear SVM [48] obtains the lowest performance among all approaches. This performance was also validated in the PASCAL VOC 07 evaluations of BoW [49]. Employing pLSA [33] to compress the codebook [baseline (2)] improves performance by reducing the feature dimension to avoid the dimension curses in the SVM (this is also observed in [33]). Using codebook learning [16] (baseline (3) that learned a class-specific codebook is learned for each object class) with a one-class SVM, we find that the integration of semantic supervision can largely improve the performance. Employing sparse coding to directly learn a compressed codebook [36] [baseline (4)] outperforms baselines (1)–(3) in effectiveness. However, its time cost is extremely huge since it directly operates over the entire local feature collection (Instead, our coding operates on the BoW histogram). For baseline (6) that compresses the initial codebook without either visual discriminability or task-dependent discriminability embedding, we achieve better performance than both VT [1] and BoW pLSA [33] (the latter one is the most similar work to our alternative approach [baseline (7)] with solely visual discriminability embedding). However, we achieve lower performance than the learning codebook [16] [baseline (3)] and the sparse coding [36] [baseline (4)]. We explain this phenomenon from the fact that both baselines (3) [16] and (4) [36] integrate label supervision to build codebooks. Compared with our scheme that only embeds visual discriminability, such integration largely boosts their performances. With solely visual discriminability embedding [baseline (7)] without task-dependent embedding, we outperform the VT pLSA [33] [baseline (2)], and [1] [baseline (1)] the BoW the learning codebook [16] [baseline (3)]. In addition, we also achieve comparable performance to the sparse coding [36] scheme [baseline (4)]. Finally, we embed both visual and task-dependent discriminability to obtain the best performances over all baselines. In particular, our approach outperforms the supervised sparse coding [36] in 16 categories. Advantages in computational efficiency: It is worth mentioning that the work in [36] directly encoded the gigantic collection of the original local descriptors (such as SIFT). This is extremely time-consuming and cannot be scaled up to millions or billions of descriptors. Indeed, the time complexity of [36] is already almost huge for PASCAL VOC 07 (contain 10 000 images). On the other hand, the work of [16] learned category-specific codebooks and hence also cannot be scaled up to real-world database that contains over thousands of correlative semantic categories. This is a similar issue for the work in [31], which also learned one codebook per category to train the subsequent classifiers. Table I further shows the visual search efficiency in the 0.5 million Flickr data set. Using both visual and task-dependent discriminability embedding does not significantly increase the computational cost compared with the building of the original VT. In the online search, due to the online coding requirement, we also need additional cost to build the compressed BoW histogram, but this is still slightly efficient compared with the state-of-the-art works in extracting the pLSA topic features in [33]. In addition, adding GNP would significantly increase the 2290 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 TABLE I TIME AND STORAGE ANALYSIS IN THE 0.5 MILLION FLICKR DATABASE Fig. 7. Performance of the ND image search task in the UKBench database. We compare our task-dependent codebook compression with six baselines: baseline (1) VT [1], baseline (2) BoW pLSA [33], baseline (3) learning codebook [16], baseline (4) direct sparse coding [36], baseline (5) without visual discriminative PCA embedding, baseline (6) without task embedding, and Bag-of-Words [baseline (8)]. In such case, the four images in an identical category are with an ). identical category label in our category LCK LCK online query time. Note that the additional search time cost comes from twofold: 1) to transform the BoW histogram into the compressed histogram and 2) to search in the new inverted indexing of the compressed codewords. The later costs more time since the inverted file for compact codebook is now less sparse. 2) ND Image Retrieval in UKBench: Fig. 7 further shows the performance of above baselines in UKBench, where similar observations to Fig. 6 can be found. Note that the increase in the axis and the measurement in the axis are identical to the quantitative comparisons in [1]. In addition, there are two interesting observations. 1) ND image retrieval concerns more on the visual discriminability and less on the task-dependent discriminability. This can be validated by the large margin between baselines (6) and (7), where baseline (7) performs better due to its visual discriminability embedding. 2) For ND search, the original VT [1] [baseline (1)] even pLSA [33] [baseline (2)] and outperforms both BoW learning codebook [16] [baseline (3)]. It is because the labels (four images as an identical category) do not always offer further information in codebook building since the intracategory images are already ND. Another reason is that the ND search is without classifier training. Hence, the dimension curses for [1] in Fig. 6 can be largely avoided. 3) Image Ranking Tasks in 0.5 Million Flickr Data Set: Fig. 8(a) further shows the ND-0.5Million Evaluation in the Flickr data set. While similar performance to Fig. 6 can be found, there are two additional observations compared with ND image retrieval. 1) The scenario of semantic image search should concern more on the task-dependent embedding and less on the visual discriminability embedding. Fig. 8(a) shows the large margin of our approach and baseline (7) to baselines (1), (5), and (6). 2) Baseline (5) [50] outperforms our alternative approaches [baselines (6) and (7)], but performs worse than our final approach. However, we should note that baseline (5) from [50] gives an extremely compact image signature with only approximate 32 B, whereas our final approach gives over 1-kB representation, as shown in Table I. Fig. 8(b) further shows the SS-0.5Million Evaluation in the 0.5 million Flickr data set. Similarly, there is a large margin from the learning codebook [16] [baseline (3)] and our final approach to the VT [1] [baseline (1)], adaptive vocabulary [16] [baseline (3)], and our alternative approaches [baseline (7)]. Note that baseline (3) outperforms our alternative approach [baseline (7)] with only visual discriminability embedding. However, our final approach still outperforms baseline (3) using both visual discriminability and task-dependent discriminability embedding. The main advantage comes from the usage of the annotation-based LCK in Section IV-B2. Finally, Fig. 8(c) shows the RS-0.5Million Evaluation in our 0.5 million Flickr data set, which is used to evaluate our ranking-based LCK in Section IV-B3, with comparisons to our alternative approaches [baselines (6) and (7)]. In such case, learning from the user-query logs can largely improve our ranking performance. By linearly increasing the user-query logs, our learning complexity does not increase a lot (at most linear). As shown in Figs. 7 and 8, serving as a straightforward solution, BoW PCA does not have better performance comparing with our scheme. While PCA preserves the principal component in the original BoW histogram, it is hard to capture the visual discriminability of the BoW in search. In addition, due to the computational cost in principal component extraction, PCA is less efficient for large codebooks. 4) Where is the Matching?: We have recorded the spatial locations of local patches. After quantizing them into into using the initial BoW histogram , we transmit . Then, we recover from using . Since there are only partial bins in that are nonzero, we can plot these nonzero bins back to the local features that are quantized into these bins, which is shown in Fig. 9. JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION 2291 Fig. 8. (a) Performance of ND image search task in the 0.5 million Flickr data set (ND-0.5Million Evaluation), with comparisons to VT [1] [baseline (1)] and aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA [baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK . (b) Performance of semantic sensitive image search task in 0.5 million Flickr data set (SS-0.5Million Evaluation), with comparisons to baseline (1) VT [1], aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA . (c) Perfor[baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK mance of ranking-sensitive image search task in 0.5 million Flickr data set (RS-0.5Million Evaluation), with comparisons to baseline (1) VT [1], aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA [baseline Note that we leverage (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK the NDCG measurement rather than the MAP measurement in our previous two groups evaluations. Fig. 9. Spatial locations of the recovered local patches from our compressed signature . In each subfigure, the top left part is the image in the database, the top right patches are the local patch collection extracted from this image, the down part is the patches that belong to the nonzero visual words as well as their . spatial locations after reconstructing from Fig. 11. Ranking example in the ND-0.5Million Evaluation, in which each left figure (large) denotes a query generated from our ground-truth set. In addition, we show two groups of ranking results: 1) Upper row: original VT-based ranking [1]; 2) Lower row: Our codebook compression-based ranking results with category-based LCK embedding. Fig. 10. Average distribution of compressed bag-of-words histogram struction parameter vectors). (recon- training classifiers based on such a low-dimensional histogram can also well avoid the dimension curse caused in the previous high-dimensional BoW histograms. E. Further Discussions 5) Compressed Histogram Distribution: In Fig. 10, we sum from 100 test images to get an average histogram, 3 in which each bin shows the averaged reconstruction parameters. Note in the compression function , which that we set is learned from a 10 000-word vocabulary. We can see that, compared with the originally sparse BoW histogram , the compressed reconstruction parameter histogram (which is also our compressing descriptor) is very dense, which largely reduces the storage space (in memory) and retrieval efficiency. Finally, 3We sum all for . Then, we obtain the average value for each bin by subdividing each bin with 100. Therefore, we obtain an average reconstruction parameter histogram to see the new codeword distribution. On correlations to visual codebook topic model: Nevertheless, one straightforward solution in codebook compression is to use unsupervised topic models (such as pLSA [32], [33] and LDA [33]) to build a higher level abstraction of the initial codebook (see Fig. 11). In addition to our superior quantitative performances, we further discuss two main differences between our scheme and topic models: 1) pLSA and LDA are all generative models, which need predefined superparameters to decide topic numbers. However, it is unsuitable to assert that a given codebook has fixed topics in its compression for different task scenarios (even in an identical database). 2292 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012 2) pLSA and LDA are computationally inefficient. For instance the singular vector decomposition is hard to scale up. In contrast, our codebook compression uses the gradient decent operations to iteratively learn the basis dictionary, which can be easily paralleled (on the optimizing of or ). either On correlations to ICA dictionary learning: The work from independent component analysis (ICA) [47] also builds dictionary coding from the local patch collection, which is unsuitable for our case due to the following. 1) Components in ICA should be orthogonal between each other. On the contrary, there is not such constraints in need not to our compression, where any two bases in be orthogonal. Subsequently, our lossy constraint enables the integration of both visual and task-dependent discriminability. 2) ICA is computational inefficiency, which is deployed over the initial local patch collection. This cost is unaffordable in large scale (an identical drawback to the sparse coding based codebooks [12], [36]). 3) ICA focuses on the reconstruction capability of the original input signal. On the contrary, our model emphasizes on the discriminability (both visual and task dependent) for the subsequent visual search or object recognition tasks. [2] J. Yang, Y. Jiang, A. Hauptmann, and C. W. Ngo, “Evaluating bag-ofvisual-words representations in scene classification,” in Proc. Multimedia Inf. Retrieval, 2007, pp. 197–206. [3] R. Ji, X. Xie, H. Yao, and W. Y. Ma, “Vocabulary hierarchy optimization for effective and transferable retrieval,” in Proc. CVPR, 2009, pp. 1161–1168. [4] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrieval with large vocabulary and fast spatial matching,” in Proc. CVPR, 2007, pp. 1–8. [5] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, “Visual word ambiguity,” IEEE Trans. Pattern Anal. Mach. Intel., vol. 32, no. 7, pp. 1271–1283, Jul. 2009. [6] D. Chen, S. Tsai, and V. Chandrasekhar, “Tree histogram coding for mobile image matching,” in Proc. Data Compression Conf., 2009, pp. 143–152. [7] D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, R. Vedantham, R. Grzeszczuk, and B. Girod, “Inverted index compression for scalable image matching,” in Proc. Data Compression Conf., 2010, p. 525. [8] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos,” in Proc. ICCV, 2003, pp. 1470–1477. [9] G. Schindler and M. Brown, “City-scale location recognition,” in Proc. CVPR, 2007, pp. 1–7. [10] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scalable image search,” in Proc. ICCV, 2009, pp. 2130–2137. [11] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak geometric consistency for large scale image search,” in Proc. ECCV, 2008, pp. 304–317. [12] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised dictionary learning,” in Proc. NIPS, 2008, pp. 1033–1040. [13] S. Lazebnik and M. Raginsky, “Supervised learning of quantizer codebooks by information losss minimization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 7, pp. 1294–1309, Jul. 2009. [14] F. Moosmann, B. Triggs, and F. Jurie, “Fast discriminative visual codebooks using randomized clustering forests,” in Proc. NIPS, 2006, pp. 985–992. [15] R. Ji, H. Yao, X. Sun, B. Zhong, and W. Gao, “Towards semantic embedding in visual vocabulary,” in Proc. CVPR, 2010, pp. 918–925. [16] F. Perronnin, C. Dance, G. Csurka, and M. Bressan, “Adapted vocabularies for generic visual categ,” in Proc. ECCV, 2006, pp. 464–475. [17] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: A comprehensive review,” Int. J. Comput. Vis., vol. 73, no. 2, pp. 213–238, Jun. 2007. [18] J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabularies using diffusion distance,” in Proc. CVPR, 2009, pp. 461–468. [19] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J. Roy. Stat. Soc., vol. 58, no. 1, pp. 267–288, 1996. [20] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, 1988. [21] P. Quelhas, F. Monay, J. M. Odobez, D. Gatica-Perez, T. Tuytelaars, and L. J. Van Gool, “Modeling scenes with local descriptors and latent aspects,” in Proc. ICCV, 2005, pp. 883–890. [22] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in Proc. ICCV, 2005, pp. 1458–1465. [23] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proc. VLDB, 1999, pp. 518–529. [24] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in quantization: Improving particular object retrieval in large scale image databases,” in Proc. CVPR, 2008, pp. 1–8. [25] Y.-G. Jiang, C.-W. Ngo, and J. Yang, “Towards optimal bag-of-features for object categorization and semantic video retrieval,” in Proc. CIVR, 2007, pp. 494–501. [26] T. Kohonen, Learning vector quantization for pattern recognition Helsinki Inst. Technol., Helsinki, Finland, Tech. Rep. TKK-F-A601, 1986. [27] T. Kohonen, Self-Organizing Maps, 3rd ed. New York: SpringerVerlag, 2000. [28] A. Rao, D. Miller, K. Rose, and A. Gersho, “A generalized VQ method for combined compression and estimation,” in Proc. ICASSP, 1996, pp. 2032–2035. [29] B. Leibe, A. Leonardis, and B. Schiele, “Combined object categorization and segmentation with an implicit shape model,” in Proc. ECCV, 2004, pp. 17–32. [30] S. Agarwal and D. Roth, “Learning a sparse representation for object detection,” in Proc. ECCV, 2002, pp. 113–130. [31] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned universal visual dictionary,” in Proc. ICCV, 2005, pp. 1800–1807. [32] F.-F. Li and P. Perona, “A Bayesian hierarchical model for learning natural scene categories,” in Proc. ICCV, 2007, pp. 524–531. VI. CONCLUSION AND FURTHER WORKS In this paper, we have proposed to compress an originally high-dimensional visual codebook with the help of supervised task labels based on sparse coding. In addition to the computational benefits in both processing time and storage space, our compressed codebook is also discriminative, extensible, and flexible. Our discriminability comes from integrating the codeword term weighting measurement into our compression cost, by which the “important” codewords are slightly compressed and vice versa. Our extensibility comes from deploying our compression scheme over the most existing codebooks [1], [3], [4], [8], [9] as a task-dependent post adaption. Our flexibility comes from supervising the compression via task labels, achieved by a task-dependent LCK in our compression loss. We have conducted extensive experiments on PASCAL VOC 07, UKBench, and a 0.5 million Flickr data set. We have shown superior performances compared with state-of-the-art codebooks [1], [16], [33], [36], [50]. Two interesting questions remain open in this paper. First, because there are numerous user-query logs generated everyday in most existing image search engines, how to adapt the supervised codebook compression to the incremental query logs is still an open problem. To this end, it would be interesting to extend our codebook compression into an incremental version. Second, to a certain degree, a formerly compressed codebook also offers valuable knowledge for the new tasks. We would further exploit the feasibility of task-dependent codebooks, e.g., adapting a compressed codebook from PASCAL VOC 07 to Caltech101. REFERENCES [1] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary tree,” in Proc. CVPR, 2006, pp. 2161–2168. JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION [33] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification using a hybrid generative/discriminative approach,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008. [34] D. Donoho, “For most large underdetermined systems of equations, the minimal l1-norm near-solution approximates the sparsest near-solution,” Commun. Pure Appl. Math., vol. 59, no. 7, pp. 907–934, 2006. [35] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [36] S. Bengio, F. Pereira, and Y. Singer, “Group sparse coding,” in Proc. NIPS, 2009, pp. 1–9. [37] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning for sparse coding,” in Proc. ICML, 2009, pp. 689–696. [38] H. Liu, M. Palatucci, and J. Jiang, “Blockwise coordinate descent procedures for the multi-task Lasso, with applications to neural semantic basis discovery,” in Proc. ICML, 2009, pp. 649–656. [39] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1999. [40] C. Fellbaum, WordNet: An electronic lexical database. Cambridge, MA: MIT Press, 1998. [41] T. Pedersen, S. Patwardhan, and J. Michelizzi, “WordNet similarity measuring the relatedness of concepts,” in Proc. AAAI, 2004, pp. 1024–1025. [42] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?,” Vis. Res., vol. 37, no. 23, pp. 3311–3325, Dec. 1997. [43] M. Aharon, M. Elad, and A. M. Bruckstein, “The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. [44] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Proc. NIPS, 2006, pp. 801–808. [45] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, The PASCAL Visual Object Classes Challenge (VOC) Results. [Online]. Available: http://www.pascalnetwork.org/challenges/VOC/voc2007 2007 [46] D. Lowe, “Distinctive image features form scale-invariant keypoints,” Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004. [47] A. Hyvarinen and E. Oja, “Independent component analysis: Algorithms and applications,” Neural Netw., vol. 13, no. 4/5, pp. 411–430, May/Jun. 2000. [48] J. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, Jun. 2004. [49] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. [50] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descriptors into a compact image representation,” in Proc. CVPR, 2010, pp. 3304–3311. [51] [Online]. Available: http://www.tineye.com/coolsearches [52] [Online]. Available: http://www.google.com/mobile/goggles [53] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1, pp. 117–128, Jan. 2010. Rongrong Ji (M’11) received the Ph.D. degree from the Harbin Institute of Technology, Harbin, China, in 2011. From 2007 to 2008, he was a Research Intern with the Web Search and Mining Group, Microsoft Research Asia, Beijing, China, mentored by Xing Xie. From May 2010 to June 2010, he was a Visiting Student with the University of Texas at San Antonio, San Antonio, and cooperated with Professor Qi Tian. From July 2010 to November 2011, he was a Visiting Student with the Institute of Digital Media, Peking University, Beijing, under the supervision of Professor Wen Gao. He is currently a Postdoctoral Researcher with the Department of Electronic Engineering, Columbia University, New York. He is the author of over 40 referred journals and conferences, including the International Journal of Computer Vision, the IEEE TRANSACTIONS ON IMAGE PROCESSING, Computer Vision and Pattern Recognition, ACM Multimedia, the International Journal Conference on Artificial Intelligence, IEEE MULTIMEDIA, etc. His research interests include image retrieval and annotation, video retrieval and understanding. 2293 Dr. Ji is the recipient of a Microsoft Fellowship in 2007. He won the Best Paper Award of ACM Multimedia 2011. He was the Session Chair of ICME 2008 and Program Committee Member of ACM Multimedia 2011 . He is a Reviewer for IEEE Signal Processing Magazine, the IEEE TRANSACTIONS ON MULTIMEDIA, SMC, TKDE, and ACM Multimedia Conferences, and an Associate Editor for International Journal of Computer Applications.. Hongxun Yao (M’03) received the B.S. and M.S. degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in 1987 and 1990, respectively, and the Ph.D. degree in computer science from the Harbin Institute of Technology, Harbin, in 2003. Currently, she is a Professor with the School of Computer Science and Technology, Harbin Institute of Technology. She is the author of five books and over 200 scientific papers. Her research interests include pattern recognition, multimedia processing, and digital watermarking. Wei Liu (S’06) received the B.S. degree from Zhejiang University, Hangzhou, China, in 2001 and the M.E. degree from the Chinese Academy of Sciences, Beijing, China, in 2004. He is currently working towards the Ph.D. degree in the Department of Electrical Engineering, Columbia University, New York. He was a Research Assistant with the Department of Information Engineering, Chinese University of Hong Kong, Hong Kong. Xiaoshuai Sun (S’10) is currently working towards the Ph.D. degree with the Harbin Institute of Technology, Harbin, China. His research interests include image and video understanding, focusing on saliency analysis. Qi Tian (M’96–SM’03) received the B.E. degree in electronic engineering from Tsinghua University, Beijing, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University, Philadelphia, PA, in 1996, and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana–Champaign, Urbana, in 2002. He is currently an Associate Professor with the Department of Computer Science, University of Texas at San Antonio (UTSA), San Antonio. His research interests include multimedia information retrieval and computer vision. Dr. Tian is the recipient of a Best Student Paper in ICASSP 2006, a Best Paper Candidate in PCM 2007, the 2010 ACM Service Award, and the Top 10% Best Paper Award at MMSP 2011. He has been serving as Program Chair, Organization Committee Member, and TPCs for numerous IEEE and ACM Conferences including ACM Multimedia, SIGIR, ICCV, ICME, etc. He was the Guest Editor of the IEEE TRANSACTIONS ON MULTIMEDIA, the Journal of Computer Vision and Image Understanding, Pattern Recognition Letter, EURASIP Journal on Advances in Signal Processing,, and the Journal of Visual Communication and Image Representation. He is on the Editorial Board of the IEEE TRANSACTIONS ON CIRCUIT AND SYSTEMS FOR VIDEO TECHNOLOGY, the Journal of Multimedia, and the Journal of Machine Visions and Applications.