Query-Dependent Visual Dictionary Adaptation for Image Reranking Jialong Wang Cheng Deng Xidian University Xi’an, 710071 China Xidian Univerisity Xi’an, 710071 China wangjialong@gmail.com Wei Liu IBM T. J. Watson Research Center, Yorktown Heights, NY, chdeng@mail.xidian.edu.cn USA Rongrong Ji Xiangyu Chen weiliu@us.ibm.com Xinbo Gao Xiamen University Xiamen, 361005 China I2R, Astar Singapore, 138632 Singapore Xidian University Xi’an, 710071 China rrji@xmu.edu.cn chenxy@i2r.a-star.edu.sg xbgao@mail.xidian.edu.cn ABSTRACT Although text-based image search engines are popular for ranking images of user’s interest, the state-of-the-art ranking performance is still far from satisfactory. One major issue comes from the visual similarity metric used in the ranking operation, which depends solely on visual features. To tackle this issue, one feasible method is to incorporate semantic concepts, also known as image attributes, into image ranking. However, the optimal combination of visual features and image attributes remains unknown. In this paper, we propose a query-dependent image reranking approach by leveraging the higher level attribute detection among the top returned images to adapt the dictionary built over the visual features to a query-specific fashion. We start from offline learning transposition probabilities between visual codewords and attributes, then utilize the probabilities to online adapt the dictionary, and finally produce a querydependent and semantics-induced metric for image ranking. Extensive evaluations on several benchmark image datasets demonstrate the effectiveness and efficiency of the proposed approach in comparison with state-of-the-arts. Figure 1: Framework of our proposed approach. 1. INTRODUCTION Most popular web image search engines, such as Google and Microsoft Bing, are built under the “query by keyword” scenario, in which the related images are returned by using the associated textual information from web pages, including title, description, and surrounding caption of the related images [1]. However, text-based search schemes are known to be unsatisfying because textual information does not always describe image content accurately. Besides, mismatches between images and their associated text will inevitably introduce irrelevant images appearing in search results. To boost the precision of text-based image search, image reranking [2], which is referred to as refining an initially ranked image list queried by an input keyword, has received increasing attention in recent years. By asking a user to select a query image that reflects the user’s search intention, the initial ranking list is reordered by exploiting a similarity metric based on visual features to achieve better search performance. One basic assumption of image reranking is that visually similar images tend to be ranked together. Although there have been many image reranking approaches from different perspectives, the major challenge of well capturing the user’s search intention is still unsolved. In order to cope with this challenge, the recent efforts focus on “query expansion” which could correlate low-level visual features with high-level semantic meanings of images by absorbing more image cues. The existing query expansion in image reranking can be classified into two main categories: Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models General Terms Algorithms, Experimentation, Performance Keywords Image Reranking, Query Dependent, Dictionary Adaptation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM’13 October 21–25, 2013, Barcelona, Spain Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00. 769 visual expansion and semantic expansion [3]. The goal of visual expansion is to obtain multiple positive instances to learn a robust similarity metric which is expected to be specific to a particular query image. As a conventional means, relevance feedback is used to expand the positive instances [4], where a user is asked to label multiple relevant and irrelevant image instances. However, relevance feedback requires an extra burden of user’s labeling. To reduce the user’s burden, pseudo relevance feedback [5][6] is adopted to expand query images by taking top-ranked results as the positive instances and bottom-ranked results as the negative instances. These instances are further leveraged into training a classifier that outputs the ranking scores. Unfortunately, pseudo relevance feedback based methods are not guaranteed to work well due to the existence of false relevant images. The idea of semantic expansion is to expand the original query image with additional query terms that are relevant to the query keyword [7]. Semantic expansion was initially proposed for document retrieval. The lexical-based methods leveraged the linguistic word relationships to expand query keywords, e.g., synonyms and hypernyms. The statistical approaches, such as term clustering [8] and Latent Semantic Indexing (LSA) [9], attempted to discover the term relationships based on term-document co-occurrence statistics. The other methods attempted to reduce the topic drift by seeking frequently co-occurred patterns with the same context instead of the entire document. However, the additional terms expanded by these methods do not always consist with the semantic concepts of the query images, which makes the ranking results unsatisfying. In this paper, we propose a novel image reranking approach with query-dependent visual dictionary adaptation. Given an image retrieval list, we first construct a queryspecific transposition probability dictionary between visual features and semantic concepts. Different from most of the existing methods, we represent semantic concepts as image attributes which can be obtained by using some off-theshelf approaches, such as Classeme1 and ObjectBank2 . For a query image, its salient visual words and attributes can be acquired through performing encoding with the learned dictionary. On one hand, the semantic concepts of the query image are well expanded; on the other hand, the obtained semantic concepts are not only query-specific but also related to common-specific concepts belonging to the whole dataset. Hence, these obtained image cues can well reflect the user’s search intention. To further preserve the similarities between the query image and the reranked top-K images visually and semantically, we then learn a similarity metric under which the query image and its related images are kept as close as possible in a new feature space, and meanwhile those irrelevant images are successfully filtered out. The contributions of this work are summarized as follows: Figure 2: Similarity metric learning supervised by semantic expansion. mantic expansion for any query image with the learned co-occurring dictionary, which results in more flexible and accurate image reranking. (3) We learn a similarity metric which yields a new feature space, where visual similarities and semantic correlations can be well preserved for any query image and its related top-K images. By doing so, the reranking performance can be further boosted. The rest of the paper is organized as: we propose our approach in Section 2; we describe the experiments including the experimental settings, the experimental results, and the discussions in Section 3; we conclude the paper in Section 4. 2. THE APPROACH Figure 1 illustrates the proposed image reranking framework, which consists of offline transposition probability learning and online similarity metric learning. We detail both components respectively as below: 2.1 Formulation Given a query image q, its retrieval set is denoted as Iq = {I1 , · · · , IN }. For each image Ii in Iq , we extract its visual (i) (i) (i) vector V(i) = [v1 , v2 , · · · , vm ], and its attribute vector (i) (i) (i) (i) A = [a1 , a2 , · · · , an ]. In our work, Bag-of-Words (BoW) is used to describe visual feature, and Classeme is used to extract attribute vector to describe semantic concepts. We form visual dictionary V and semantic dictionary A on the retrieval set Iq , respectively. Therefore, the transposition probability matrix W̃ can be built based on the co-occurence of entries between V and A. For the query image q, the visual and semantic query expansion correspond to the significant visual and semantic elements selected according to W̃ respectively. This procedure can be formulated as Ṽ(q) = f V(q) , W̃ , Ã(q) = g A(q) , W̃ . (1) (1) We construct a corpus-oriented dictionary which explicitly captures the latent consistency between visual features and semantic concepts. In our work, we use image attributes to represent semantic concepts. As far as we know, it is the first time to build such a co-occurring dictionary in the context of image reranking. Here, f (·) and g(·) are the selection functions respectively, which would be detailed later. For image reranking, we hope the reordered top-K images are more related to the query image in terms of both visual and semantic similarity. To that effect, we propose to learn a similarity metric matrix M online, which is used to adapt the original visual features into a new subspace where (2) We simultaneously accomplish visual expansion and se1 http://www.cs.dartmouth.edu/~lorenzo/projects/ classemes/ 2 http://vision.stanford.edu/projects/objectbank/ 770 semantics are preserved. Figure 2 shows an exemplar illustration of such a space, where an image close to the query images should be both visually and semantically relevant. In other word, we target at learning a semantic-induced manifold structure in the visual feature space to capture the essence of both metrics. This corresponds to minimizing both visual and semantic dissimilarities: In terms of visual similarity, the accumulated distances between the top-K images and the query image should be minimized as min M K X kV(k) M − V(q) Mk22 . Algorithm 1: FISTA Algorithm. 1 Input: V(k) , W̃, M0 , λ. 2 Initialization: set θ0 = 1, X0 = M0 = I. 3 for t = 0, 1, 2, . . . until convergence of Mt do 4 Compute ∇O(Xt ). 5 Compute Lipschitz constant L = λmax (∇O(Xt )), where λmax is the largest eigenvalue. 6 Perform the generalized gradient update step: 2 γ Mt+1 = argminM 21M− Xt − L1 ∇H(Xt ) 2 + L kMk1 . 2 . Set θt+1 = t+3 t 7 Set Xt+1 = Mt+1 + 1−θ θt+1 (Mt+1 − Mt ). θt 8 end 9 Output: M = Mt+1 . (2) k=1 In terms of semantic relevance, the accumulated distances of attribute vectors between the top-K images and the query image should be minimized. We have min M K X kV(k) W̃M − V(q) W̃Mk22 . (3) k=1 Jointing Equation 2 and Equation 3, we derive the overall objective function as O = min M + K X kV(k) M − V(q) Mk22 k=1 K X kV (4) (k) W̃M − V (q) W̃Mk22 Figure 3: mAP under different number of candidates K for three datasets. + λkMk1 , k=1 where k · k1 is the `1 -norm of a matrix, λ is a constant to control the sparsity degree. Here, the main role of the sparse matrix M lies in only selecting those significant elements. Thus, the expanded visual features and semantic concepts are not only query-dependent but also corpus-oriented. Once obtaining the sparse metric matrix M, we calculate the similarity in the new space, and use the similarity score as the reranking score to reorder the search results. The similarity between query q and image Ij is defined as Pd min(xqi , xji ) . (5) sim(q, Ij ) = Pdi=1 i=1 max(xqi , xji ) 2.2 3.1 Optimization The smoothing nature of the first two terms in Equation 4 guarantees a convex and smoothing objective function. Therefore, a general Smoothing Proximal Gradient (SPG) approach is adopted in the optimization step. More specifically, we solve this `1 -norm penalized sparse learning problem by the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) [10]. SPG has been proven to achieve O( 1 ) convergence rate for a desired accuracy . The FISTA method is presented in Algorithm 1. 3. Experimental Settings Datasets: Experiments are conducted on four popular datasets: Oxford Building3 , Paris4 , INRIA Holidays5 , and UKBench6 . Oxford and Paris respectively contain 5,062 and 6,412 images, which are all provided with ground truth by manual annotation. INRIA includes 1,491 relevant images of 500 scenes or objects, where the first image is used as a query in each group. UKBench contains 10,200 images that always show the same object. Feature: We use dense SIFT descriptor [12] computed from 16 × 16 sized image patches with a stepsize of 8 pixels using VLFeat library. Then, 1,024-dimensional visual words is constructed with 1M descriptors. We use 1 × 1, 2 × 2, 3×1 sub-regions to compute a BoW as the final image visual feature. Besides, we extract attribute vector for each image with Classeme. Baseline: We design two baselines to show the improvement of the retrieval accuracy, which includes: (I) visual feature based reranking: we independently use BoW as visual feature to evaluate the similarity between images; (II) semantic feature based reranking: we independently use attribute as semantic feature to evaluate the image similarity. Evaluation Metric: We use mean Averaged Precision (mAP) to evaluate the performance on the first three datasets, while the performance measure on the UKBench is the average number of correct returning in top-4 images, denoted as (Ave. Top Num.). EXPERIMENTS In this section, we first present the experimental settings, and then we illustrate and discuss the experimental results. Moreover, we will compare our approach with state-of-theart reranking methods, such as noise resistant graph-based image reranking (NRGIR) [11], random walk-based image reranking (RWIR) [2], to demonstrate the effectiveness of our approach. 3 http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/ http://www.robots.ox.ac.uk/~vgg/data/ parisbuildings/ 5 http://lear.inrialpes.fr/~jegou/data.php 6 http://vis.uky.edu/~stewe/ukbench/ 4 771 (a) (b) Figure 4: Comparison results and visual results on different dataset. 3.2 Results and Analysis Therefore, our future work will focus on gaining more diverse and accurate semantic concepts through mining more textual information from massive Internet images. We also plan to explore scalable similarity metric learning using robust large graphs [13] and accelerating the reranking operation using principled hashing methods like [14][15]. Parameter Tuning: In our method, the top-K dataset candidates for the query image Iq are considered to evaluate reranking performance. In the objective function learning, we set λ = 0.05. We first evaluate the performance of our approach given different numbers of top dataset candidates K on baseline I. Figure 3 shows the performance on the first three datasets when we change K. When K becomes larger, the mAP values on Oxford decrease. With K increasing from 20 to 300, as shown in Figure 3, the mAP drops from 0.87 to 0.65. A similar situation is also observed with all other datasets. We use the same setting (K from 20 to 300) in all these datasets. Specially, since the queries in UKBench only have three relevant images, K is set to 4. In the subsequent experiments, without specification, we fix K = 200 for all datasets but UKBench. Comparison Results: Figure 4(a) illustrates the mAP of the baselines and some state-of-the-art methods on four datasets. As shown in Figure 4(a), the performances of all reranking methods are superior to the baseline I, i.e., direct reranking based on BoW visual feature. The mAP of our approach is about 0.14 and 1.01 higher than that of baseline I on the first three datasets and UKBench, respectively. Similarly, compared with NRGIR and RWIR, our approach obtains 0.23 and 0.1 improvements on the first three datasets, and 1.43 and nearly 1.0 improvements on UKBench, respectively. Figure 4(b) shows some visual results of our image reranking on datasets, in which three rows for each query are the results of NRGIR, RWIR, and our approach. 4. 5. ACKNOWLEDGMENTS We want to thank the helpful comments and suggestions from the anonymous reviewers. This research was supported partially by the National Natural Science Foundation of China (Nos. 61125204 and 61101250), the Program for New Century Excellent Talents in University (NCET-12-0917), the Program for New Scientific and Technological Star of Shaanxi Province (No. 2012KJXX-24). 6. REFERENCES [1] X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X.-S. Hua. Bayesian video search reranking. In Proc. ACM Multimedia, 2008. [2] W. Hsu, L. Kennedy, and S.-F. Chang. Reranking methods for visual search. IEEE Multimedia, 14(3):14–22, 2007. [3] X. Tang, K. Liu, J. Cui, F. Wen, and X. Wang. Intentsearch: Capturing user intention for one-click internet image search. IEEE Trans. PAMI, 34(7):1342–1353, 2012. [4] Y. Lu, H. Zhang, L. Wenyin, and C. Hu. Joint semantics and feautre based image retrieval using relevance feedback. IEEE Trans. Multimedia, 5(3):339–347, 2003. [5] R. Yan, A. G. Hauptmann, and R. Jin. Negative pseudo-relevance feedback in content-based video retrieval. In Proc. ACM Multimedia, 2003. [6] N. Morioka and J. Wang. Robust visual reranking via sparsity and ranking constraints. In Proc. ACM Multimedia, 2011. [7] A. Natsev, A. Haubold, J. Tešić, L. Xie, and R. Yan. Semantic concept-based query expansion and re-ranking for multimedia retrieval. In Proc. ACM Multimedia, 2007. [8] K. Sparck Jones. Automatic Keyword Classification for Information Retrieval. Archon Books, 1971. [9] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [10] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009. [11] W. Liu, Y.-G. Jiang, J. Luo, and S.-F. Chang. Noise resistant graph ranking for improved web image search. In Proc. CVPR, 2011. [12] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proc. CVPR, 2006. [13] W. Liu, J. He, and S.-F. Chang. Large graph construction for scalable semi-supervised learning. In Proc. ICML, 2010. [14] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proc. ICML, 2011. [15] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proc. CVPR, 2012. CONCLUSIONS In this paper, we propose an image reranking approach using query-dependent visual dictionary adaptation. Through the offline visual-semantic co-occurring dictionary learning, we can not only effectively capture the query-specific relationship between low-level visual features and high-level semantic concepts but also wisely extend the query image’s semantic space, thus capturing the user’s search intention. Furthermore, we conduct similarity metric learning supervised by the attribute-based semantic concepts, which makes the query image and its relevant images closer in a new feature space. The experimental results show that our approach performs significantly better than the state-of-the-arts. Although our approach achieves good performance, it is more suitable for constrained datasets but not for unconstrained web-scale image collections. The main drawback of our approach lies in that a limited number of semantic concepts cannot cover numerous images present on the Internet. 772