Query-Dependent Visual Dictionary Adaptation for Image Reranking Jialong Wang Cheng Deng

advertisement
Query-Dependent Visual Dictionary Adaptation for Image
Reranking
Jialong Wang
Cheng Deng
Xidian University
Xi’an, 710071 China
Xidian Univerisity
Xi’an, 710071 China
wangjialong@gmail.com
Wei Liu
IBM T. J. Watson Research
Center, Yorktown Heights, NY,
chdeng@mail.xidian.edu.cn
USA
Rongrong Ji
Xiangyu Chen
weiliu@us.ibm.com
Xinbo Gao
Xiamen University
Xiamen, 361005 China
I2R, Astar
Singapore, 138632 Singapore
Xidian University
Xi’an, 710071 China
rrji@xmu.edu.cn
chenxy@i2r.a-star.edu.sg xbgao@mail.xidian.edu.cn
ABSTRACT
Although text-based image search engines are popular for
ranking images of user’s interest, the state-of-the-art ranking performance is still far from satisfactory. One major
issue comes from the visual similarity metric used in the
ranking operation, which depends solely on visual features.
To tackle this issue, one feasible method is to incorporate
semantic concepts, also known as image attributes, into image ranking. However, the optimal combination of visual
features and image attributes remains unknown. In this
paper, we propose a query-dependent image reranking approach by leveraging the higher level attribute detection among the top returned images to adapt the dictionary built
over the visual features to a query-specific fashion. We start
from offline learning transposition probabilities between visual codewords and attributes, then utilize the probabilities
to online adapt the dictionary, and finally produce a querydependent and semantics-induced metric for image ranking.
Extensive evaluations on several benchmark image datasets
demonstrate the effectiveness and efficiency of the proposed
approach in comparison with state-of-the-arts.
Figure 1: Framework of our proposed approach.
1.
INTRODUCTION
Most popular web image search engines, such as Google
and Microsoft Bing, are built under the “query by keyword”
scenario, in which the related images are returned by using
the associated textual information from web pages, including title, description, and surrounding caption of the related
images [1]. However, text-based search schemes are known
to be unsatisfying because textual information does not always describe image content accurately. Besides, mismatches between images and their associated text will inevitably
introduce irrelevant images appearing in search results.
To boost the precision of text-based image search, image
reranking [2], which is referred to as refining an initially
ranked image list queried by an input keyword, has received
increasing attention in recent years. By asking a user to select a query image that reflects the user’s search intention,
the initial ranking list is reordered by exploiting a similarity
metric based on visual features to achieve better search performance. One basic assumption of image reranking is that
visually similar images tend to be ranked together.
Although there have been many image reranking approaches from different perspectives, the major challenge of well
capturing the user’s search intention is still unsolved. In
order to cope with this challenge, the recent efforts focus
on “query expansion” which could correlate low-level visual
features with high-level semantic meanings of images by absorbing more image cues. The existing query expansion in
image reranking can be classified into two main categories:
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Retrieval
models
General Terms
Algorithms, Experimentation, Performance
Keywords
Image Reranking, Query Dependent, Dictionary Adaptation
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
MM’13 October 21–25, 2013, Barcelona, Spain
Copyright 2013 ACM 978-1-4503-2404-5/13/10 ...$15.00.
769
visual expansion and semantic expansion [3]. The goal of
visual expansion is to obtain multiple positive instances to
learn a robust similarity metric which is expected to be specific to a particular query image. As a conventional means,
relevance feedback is used to expand the positive instances
[4], where a user is asked to label multiple relevant and irrelevant image instances. However, relevance feedback requires
an extra burden of user’s labeling. To reduce the user’s burden, pseudo relevance feedback [5][6] is adopted to expand
query images by taking top-ranked results as the positive instances and bottom-ranked results as the negative instances.
These instances are further leveraged into training a classifier that outputs the ranking scores. Unfortunately, pseudo relevance feedback based methods are not guaranteed to
work well due to the existence of false relevant images.
The idea of semantic expansion is to expand the original
query image with additional query terms that are relevant
to the query keyword [7]. Semantic expansion was initially
proposed for document retrieval. The lexical-based methods
leveraged the linguistic word relationships to expand query
keywords, e.g., synonyms and hypernyms. The statistical
approaches, such as term clustering [8] and Latent Semantic
Indexing (LSA) [9], attempted to discover the term relationships based on term-document co-occurrence statistics.
The other methods attempted to reduce the topic drift by
seeking frequently co-occurred patterns with the same context instead of the entire document. However, the additional
terms expanded by these methods do not always consist with
the semantic concepts of the query images, which makes the
ranking results unsatisfying.
In this paper, we propose a novel image reranking approach with query-dependent visual dictionary adaptation.
Given an image retrieval list, we first construct a queryspecific transposition probability dictionary between visual features and semantic concepts. Different from most of
the existing methods, we represent semantic concepts as image attributes which can be obtained by using some off-theshelf approaches, such as Classeme1 and ObjectBank2 . For
a query image, its salient visual words and attributes can
be acquired through performing encoding with the learned
dictionary. On one hand, the semantic concepts of the query
image are well expanded; on the other hand, the obtained
semantic concepts are not only query-specific but also related to common-specific concepts belonging to the whole
dataset. Hence, these obtained image cues can well reflect
the user’s search intention. To further preserve the similarities between the query image and the reranked top-K images
visually and semantically, we then learn a similarity metric
under which the query image and its related images are kept
as close as possible in a new feature space, and meanwhile
those irrelevant images are successfully filtered out.
The contributions of this work are summarized as follows:
Figure 2: Similarity metric learning supervised by
semantic expansion.
mantic expansion for any query image with the learned
co-occurring dictionary, which results in more flexible
and accurate image reranking.
(3) We learn a similarity metric which yields a new feature space, where visual similarities and semantic correlations
can be well preserved for any query image and its related
top-K images. By doing so, the reranking performance
can be further boosted.
The rest of the paper is organized as: we propose our
approach in Section 2; we describe the experiments including
the experimental settings, the experimental results, and the
discussions in Section 3; we conclude the paper in Section 4.
2.
THE APPROACH
Figure 1 illustrates the proposed image reranking framework, which consists of offline transposition probability learning and online similarity metric learning. We detail both
components respectively as below:
2.1
Formulation
Given a query image q, its retrieval set is denoted as Iq =
{I1 , · · · , IN }. For each image Ii in Iq , we extract its visual
(i)
(i)
(i)
vector V(i) = [v1 , v2 , · · · , vm ], and its attribute vector
(i)
(i)
(i)
(i)
A = [a1 , a2 , · · · , an ].
In our work, Bag-of-Words (BoW) is used to describe visual feature, and Classeme is used to extract attribute vector
to describe semantic concepts. We form visual dictionary V
and semantic dictionary A on the retrieval set Iq , respectively. Therefore, the transposition probability matrix W̃
can be built based on the co-occurence of entries between
V and A. For the query image q, the visual and semantic
query expansion correspond to the significant visual and semantic elements selected according to W̃ respectively. This
procedure can be formulated as
Ṽ(q) = f V(q) , W̃ , Ã(q) = g A(q) , W̃ .
(1)
(1) We construct a corpus-oriented dictionary which explicitly captures the latent consistency between visual features and semantic concepts. In our work, we use image
attributes to represent semantic concepts. As far as we
know, it is the first time to build such a co-occurring
dictionary in the context of image reranking.
Here, f (·) and g(·) are the selection functions respectively,
which would be detailed later.
For image reranking, we hope the reordered top-K images are more related to the query image in terms of both
visual and semantic similarity. To that effect, we propose to
learn a similarity metric matrix M online, which is used to
adapt the original visual features into a new subspace where
(2) We simultaneously accomplish visual expansion and se1
http://www.cs.dartmouth.edu/~lorenzo/projects/
classemes/
2
http://vision.stanford.edu/projects/objectbank/
770
semantics are preserved. Figure 2 shows an exemplar illustration of such a space, where an image close to the query
images should be both visually and semantically relevant. In other word, we target at learning a semantic-induced
manifold structure in the visual feature space to capture the
essence of both metrics. This corresponds to minimizing
both visual and semantic dissimilarities:
In terms of visual similarity, the accumulated distances
between the top-K images and the query image should be
minimized as
min
M
K
X
kV(k) M − V(q) Mk22 .
Algorithm 1: FISTA Algorithm.
1 Input: V(k) , W̃, M0 , λ.
2 Initialization: set θ0 = 1, X0 = M0 = I.
3 for t = 0, 1, 2, . . . until convergence of Mt do
4
Compute ∇O(Xt ).
5
Compute Lipschitz constant L = λmax (∇O(Xt )),
where λmax is the largest eigenvalue.
6
Perform the generalized gradient update step:
2 γ
Mt+1 = argminM 21M− Xt − L1 ∇H(Xt ) 2 + L
kMk1 .
2
.
Set θt+1 = t+3
t
7
Set Xt+1 = Mt+1 + 1−θ
θt+1 (Mt+1 − Mt ).
θt
8 end
9 Output: M = Mt+1 .
(2)
k=1
In terms of semantic relevance, the accumulated distances of
attribute vectors between the top-K images and the query
image should be minimized. We have
min
M
K
X
kV(k) W̃M − V(q) W̃Mk22 .
(3)
k=1
Jointing Equation 2 and Equation 3, we derive the overall
objective function as
O = min
M
+
K
X
kV(k) M − V(q) Mk22
k=1
K
X
kV
(4)
(k)
W̃M − V
(q)
W̃Mk22
Figure 3: mAP under different number of candidates K for three datasets.
+ λkMk1 ,
k=1
where k · k1 is the `1 -norm of a matrix, λ is a constant to
control the sparsity degree. Here, the main role of the sparse
matrix M lies in only selecting those significant elements.
Thus, the expanded visual features and semantic concepts
are not only query-dependent but also corpus-oriented.
Once obtaining the sparse metric matrix M, we calculate
the similarity in the new space, and use the similarity score
as the reranking score to reorder the search results. The
similarity between query q and image Ij is defined as
Pd
min(xqi , xji )
.
(5)
sim(q, Ij ) = Pdi=1
i=1 max(xqi , xji )
2.2
3.1
Optimization
The smoothing nature of the first two terms in Equation 4 guarantees a convex and smoothing objective function. Therefore, a general Smoothing Proximal Gradient
(SPG) approach is adopted in the optimization step. More
specifically, we solve this `1 -norm penalized sparse learning problem by the Fast Iterative Shrinkage-Thresholding
Algorithm (FISTA) [10]. SPG has been proven to achieve
O( 1 ) convergence rate for a desired accuracy . The FISTA
method is presented in Algorithm 1.
3.
Experimental Settings
Datasets: Experiments are conducted on four popular
datasets: Oxford Building3 , Paris4 , INRIA Holidays5 , and
UKBench6 . Oxford and Paris respectively contain 5,062 and
6,412 images, which are all provided with ground truth by
manual annotation. INRIA includes 1,491 relevant images
of 500 scenes or objects, where the first image is used as a
query in each group. UKBench contains 10,200 images that
always show the same object.
Feature: We use dense SIFT descriptor [12] computed
from 16 × 16 sized image patches with a stepsize of 8 pixels
using VLFeat library. Then, 1,024-dimensional visual words
is constructed with 1M descriptors. We use 1 × 1, 2 × 2,
3×1 sub-regions to compute a BoW as the final image visual
feature. Besides, we extract attribute vector for each image
with Classeme.
Baseline: We design two baselines to show the improvement of the retrieval accuracy, which includes: (I) visual
feature based reranking: we independently use BoW as visual feature to evaluate the similarity between images; (II)
semantic feature based reranking: we independently use attribute as semantic feature to evaluate the image similarity.
Evaluation Metric: We use mean Averaged Precision
(mAP) to evaluate the performance on the first three datasets, while the performance measure on the UKBench is the average number of correct returning in top-4 images, denoted
as (Ave. Top Num.).
EXPERIMENTS
In this section, we first present the experimental settings,
and then we illustrate and discuss the experimental results.
Moreover, we will compare our approach with state-of-theart reranking methods, such as noise resistant graph-based
image reranking (NRGIR) [11], random walk-based image
reranking (RWIR) [2], to demonstrate the effectiveness of
our approach.
3
http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/
http://www.robots.ox.ac.uk/~vgg/data/
parisbuildings/
5
http://lear.inrialpes.fr/~jegou/data.php
6
http://vis.uky.edu/~stewe/ukbench/
4
771
(a)
(b)
Figure 4: Comparison results and visual results on different dataset.
3.2
Results and Analysis
Therefore, our future work will focus on gaining more diverse
and accurate semantic concepts through mining more textual information from massive Internet images. We also plan
to explore scalable similarity metric learning using robust
large graphs [13] and accelerating the reranking operation
using principled hashing methods like [14][15].
Parameter Tuning: In our method, the top-K dataset
candidates for the query image Iq are considered to evaluate
reranking performance. In the objective function learning,
we set λ = 0.05.
We first evaluate the performance of our approach given
different numbers of top dataset candidates K on baseline I.
Figure 3 shows the performance on the first three datasets
when we change K. When K becomes larger, the mAP values
on Oxford decrease. With K increasing from 20 to 300, as
shown in Figure 3, the mAP drops from 0.87 to 0.65.
A similar situation is also observed with all other datasets. We use the same setting (K from 20 to 300) in all these
datasets. Specially, since the queries in UKBench only have
three relevant images, K is set to 4. In the subsequent experiments, without specification, we fix K = 200 for all datasets
but UKBench.
Comparison Results: Figure 4(a) illustrates the mAP
of the baselines and some state-of-the-art methods on four
datasets. As shown in Figure 4(a), the performances of all
reranking methods are superior to the baseline I, i.e., direct
reranking based on BoW visual feature. The mAP of our approach is about 0.14 and 1.01 higher than that of baseline I
on the first three datasets and UKBench, respectively. Similarly, compared with NRGIR and RWIR, our approach obtains 0.23 and 0.1 improvements on the first three datasets,
and 1.43 and nearly 1.0 improvements on UKBench, respectively. Figure 4(b) shows some visual results of our image
reranking on datasets, in which three rows for each query
are the results of NRGIR, RWIR, and our approach.
4.
5.
ACKNOWLEDGMENTS
We want to thank the helpful comments and suggestions
from the anonymous reviewers. This research was supported partially by the National Natural Science Foundation of
China (Nos. 61125204 and 61101250), the Program for New
Century Excellent Talents in University (NCET-12-0917),
the Program for New Scientific and Technological Star of
Shaanxi Province (No. 2012KJXX-24).
6.
REFERENCES
[1] X. Tian, L. Yang, J. Wang, Y. Yang, X. Wu, and X.-S. Hua.
Bayesian video search reranking. In Proc. ACM Multimedia,
2008.
[2] W. Hsu, L. Kennedy, and S.-F. Chang. Reranking methods for
visual search. IEEE Multimedia, 14(3):14–22, 2007.
[3] X. Tang, K. Liu, J. Cui, F. Wen, and X. Wang. Intentsearch:
Capturing user intention for one-click internet image search.
IEEE Trans. PAMI, 34(7):1342–1353, 2012.
[4] Y. Lu, H. Zhang, L. Wenyin, and C. Hu. Joint semantics and
feautre based image retrieval using relevance feedback. IEEE
Trans. Multimedia, 5(3):339–347, 2003.
[5] R. Yan, A. G. Hauptmann, and R. Jin. Negative
pseudo-relevance feedback in content-based video retrieval. In
Proc. ACM Multimedia, 2003.
[6] N. Morioka and J. Wang. Robust visual reranking via sparsity
and ranking constraints. In Proc. ACM Multimedia, 2011.
[7] A. Natsev, A. Haubold, J. Tešić, L. Xie, and R. Yan. Semantic
concept-based query expansion and re-ranking for multimedia
retrieval. In Proc. ACM Multimedia, 2007.
[8] K. Sparck Jones. Automatic Keyword Classification for
Information Retrieval. Archon Books, 1971.
[9] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and
R. Harshman. Indexing by latent semantic analysis. Journal of
the American Society of Information Science, 41(6):391–407,
1990.
[10] A. Beck and M. Teboulle. A fast iterative
shrinkage-thresholding algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
[11] W. Liu, Y.-G. Jiang, J. Luo, and S.-F. Chang. Noise resistant
graph ranking for improved web image search. In Proc. CVPR,
2011.
[12] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:
Spatial pyramid matching for recognizing natural scene
categories. In Proc. CVPR, 2006.
[13] W. Liu, J. He, and S.-F. Chang. Large graph construction for
scalable semi-supervised learning. In Proc. ICML, 2010.
[14] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with
graphs. In Proc. ICML, 2011.
[15] W. Liu, J. Wang, R. Ji, Y.-G. Jiang, and S.-F. Chang.
Supervised hashing with kernels. In Proc. CVPR, 2012.
CONCLUSIONS
In this paper, we propose an image reranking approach using query-dependent visual dictionary adaptation. Through
the offline visual-semantic co-occurring dictionary learning,
we can not only effectively capture the query-specific relationship between low-level visual features and high-level
semantic concepts but also wisely extend the query image’s
semantic space, thus capturing the user’s search intention.
Furthermore, we conduct similarity metric learning supervised by the attribute-based semantic concepts, which makes
the query image and its relevant images closer in a new feature space. The experimental results show that our approach
performs significantly better than the state-of-the-arts.
Although our approach achieves good performance, it is
more suitable for constrained datasets but not for unconstrained web-scale image collections. The main drawback of
our approach lies in that a limited number of semantic concepts cannot cover numerous images present on the Internet.
772
Download