Task-Dependent Visual-Codebook Compression , Member, IEEE , Student Member, IEEE

advertisement
2282
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
Task-Dependent Visual-Codebook Compression
Rongrong Ji, Member, IEEE, Hongxun Yao, Member, IEEE, Wei Liu, Student Member, IEEE,
Xiaoshuai Sun, Student Member, IEEE, and Qi Tian, Senior Member, IEEE
Abstract—A visual codebook serves as a fundamental component in many state-of-the-art computer vision systems. Most
existing codebooks are built based on quantizing local feature
descriptors extracted from training images. Subsequently, each
image is represented as a high-dimensional bag-of-words histogram. Such highly redundant image description lacks efficiency
in both storage and retrieval, in which only a few bins are nonzero
and distributed sparsely. Furthermore, most existing codebooks
are built based solely on the visual statistics of local descriptors,
without considering the supervise labels coming from the subsequent recognition or classification tasks. In this paper, we propose
a task-dependent codebook compression framework to handle the
above two problems. First, we propose to learn a compression
function to map an originally high-dimensional codebook into a
compact codebook while maintaining its visual discriminability.
This is achieved by a codeword sparse coding scheme with Lasso
regression, which minimizes the descriptor distortions of training
images after codebook compression. Second, we propose to adapt
our codebook compression to the subsequent recognition or classification tasks. This is achieved by introducing a label constraint
kernel (LCK) into our compression loss function. In particular,
our LCK can model heterogeneous kinds of supervision, i.e., (partial) category labels, correlative semantic annotations, and image
query logs. We validated our codebook compression in three computer vision tasks: 1) object recognition in PASCAL Visual Object
Class 07; 2) near-duplicate image retrieval in UKBench; and 3)
web image search in a collection of 0.5 million Flickr photographs.
Our compressed codebook has shown superior performances over
several state-of-the-art supervised and unsupervised codebooks.
Index Terms—Image retrieval, indexing, local feature, object
classification, supervised quantization, visual codebook.
I. INTRODUCTION
V
ISUAL-CODEBOOK representations are widely used in
many computer vision tasks such as visual search, object categorization, scene recognition, and video event detection. In a typical scenario, given the local feature descriptors ex-
Manuscript received October 27, 2010; revised March 02, 2011 and June 03,
2011; accepted June 04, 2011. Date of publication November 22, 2011; date of
current version March 21, 2012. This work was supported in part by the National Science Foundation of China (Key Program) under 61133003 and in part
by the Natural Science Foundation of China under Grant 61071180 and Grant
60775024. The work of Q. Tian was supported in part by the NSF IIS 1052851,
by the Faculty Research Awards of Google FXPAL, and by the NEC Laboratories of America. The associate editor coordinating the review of this manuscript
and approving it for publication was Prof. Mark Liao.
R. Ji and X. Sun are with the Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China.
H. Yao is with the Department of Computer Science, Harbin Institute of Technology, Harbin 150001, China (e-mail: H.Yao@hit.edu.cn).
W. Liu is with the Department of Electrical Engineering, Columbia University, New York, NY 10027 USA.
Q. Tian is with the Department of Computer Science, University of Texas at
San Antonio, San Antonio, TX 78249 USA.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2011.2176950
tracted from training images, a visual codebook is built by quantizing these descriptors into discrete codeword regions in the descriptor feature space. Subsequently, each image is represented
as a bag-of-words (BoW) histogram, which is robust against the
photographing variances in different viewpoints, scales, and occlusions.
One important issue of this visual-codebook representation
is its dimensionality. Generally speaking, most existing codebooks require more than 10 000 codewords to be tuned optimal
[1]–[5]. Such a high dimensionality usually introduces obvious
computational cost in both processing time (to match visual
descriptors online) and storage space (to maintain the search
model in the memory). This is extremely crucial for the state-ofthe-art mobile visual search scenario [6], [7], where mobile devices directly extract the BoW histogram and transmit over the
wireless link. On the contrary, since the number of local features
extracted from each image is limited (typically hundreds), each
image is represented as a sparse histogram of nonzero codewords. Moreover, due to the dimension curses in the subsequent
classifier training procedure, most object recognition systems
expect the BoW histogram to be compact. Inspired by the above
contradictions, our first goal is to compress the state-of-the-art
high-dimensional codebooks [1], [3], [4], [8], [9] to improve
their storage and retrieval efficiency while maintaining the visual discriminative capability.
Our second goal is to introduce the supervised labels from
the subsequent recognition or classification tasks to guide
our codebook compression. Our inspirations are twofold:
First, most existing visual codebooks [1], [3]–[5], [8]–[11] are
built based solely on the visual statistics of training images,
without considering their semantic labels to improve their
discriminability. Second, recent works in supervised codebook
learning [12]–[15] have directly incorporated the supervised
labels in the initial codebook building procedure, which are
computational burdensome and not scalable for new labels,
new data sets, or new tasks (even in the same data set). In
addition, by binding supervision over codebook building, the
existing supervised codebooks learning strategies cannot be
reused among each other [12]–[15]. An alternative approach
is to refine an initial codebook. However, existing works
[16]–[18] are not sufficiently flexible and scalable to model
heterogeneous kinds of supervised labels, such as correlative
semantic labels and image query logs.
In this paper, we present a task-dependent visual-codebook
compression framework to achieve the above goals. We introduce a supervised sparse coding model to learn a compression
function, which maps an originally high-dimensional codebook
into a compact basis dictionary. In this model, we integrate both
visual discriminability and task-dependent semantic discriminability (from supervised labeling) into the Lasso-based regression cost [19] for compression function learning. Using the
1057-7149/$26.00 © 2011 IEEE
JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION
2283
Fig. 1. Proposed task-dependent visual-codebook compression framework.
learned basis dictionary, each original BoW histogram is nonlinearly transformed into a low-dimensional reconstruction parametric vector, which serves as the compressed histogram for a
given image. This histogram is more compact and discriminative for the subsequent recognition classifier training, retrieval
model indexing, or a similarity ranking procedure.
To integrate the visual discriminability into our basis dictionary learning, we model the codewords’ importance using
their term frequencies and inverted document frequencies [20],
which are embedded as a weighting vector into the compression
cost function. To integrate the task-dependent discriminability
into our basis dictionary learning, we introduce a label constraint kernel (LCK) to model the labeling consistency loss in
the compression function. In particular, we show how to specify
the LCK for different tasks even in an identical database, including: 1) (partial) category labels for object recognition; 2)
correlative semantic annotations for image understanding; and
3) image query logs (obtained from the ground-truth image
ranking labeling) for web image ranking. Fig. 1 outlines the
proposed framework.
II. RELATED WORK
Visual-codebook construction: A visual codebook is typically constructed based on unsupervised vector quantization
such as k-means clustering [8], [21], which subdivides the local
feature space into discrete codeword regions. Such a division
represents an image as a BoW histogram, in which each bin
counts how many local features of this image fall into the corresponding codeword region of this bin. In recent years, there
have been numerous vector quantization approaches to build visual codebooks, such as spatial pyramids [22], vocabulary trees
(VTs) [1], approximate k-means [4], and their variations [3],
[9]. In addition to direct quantization, hashing-based approaches
are also well exploited in the literature, e.g., locality sensitive
hashing [23] and its kernelized version [10]. Recent works have
also investigated the codeword uncertainty and ambiguity using
methods such as Hamming embedding [11], soft assignments
[24], [25], and kernelized codebook histograms [5].
Supervised codebook construction: Rather than using
solely visual content statistics, works in [12]–[15] also proposed
to integrate the semantic labels to supervise the codebook construction. For instance, Mairal et al. [12] used category-aware
sparse coding to build a supervised vocabulary for object
categorization. Lazebnik et al. [13] adopted minimizing mutual
information loss to build a supervised codebook based on a
fully labeled local feature set. Moosmann et al. [14] proposed
to build supervised indexing trees using an ERC-Forest that
considered semantic labels as stopping tests. Ji et al. [15]
adopted a hidden Markov random field to use correlative web
labeling for supervised vocabulary construction. In image and
video compression, there are also similar works proposed for
learning-based vector quantization [26]–[28]. Methods such
as self-organizing maps [27] and regression loss minimization
[28] are utilized to reconstruct the original input signals that
minimized visual quantization distortions. Finally, there are
also works in learning visual parts [29], [30] from the images
of identical categories by clustering local patches with spatial
configurations.
Learning-based codeword refinement: Rather than directly supervising the codebook building procedure, works
in [16]–[18] and [31] proposed to merge or split the initial
codewords for the subsequent classifier training. For instance,
Perronnin et al. [16] integrated category labels to adapt an
initial vocabulary into several class-specific vocabularies.
Winn et al. [31] also learned class-specific vocabularies from
an initial vocabulary by merging codeword pairs, in which the
codeword distribution was modeled by the Gaussian mixture
model. Although works in [16]–[18] and [1] worked well for
limited category numbers, these methods cannot be scaled up
to general scenarios that contain numerous and correlative
category labels. To a certain degree, the topic models [e.g.,
probabilistic latent semantic analysis (pLSA) [32] and latent
Dirichlet allocation (LDA) [33]] can be also treated as unsupervised codeword refinements, which work in a generative
manner constrained by superparameters.
Sparse coding: Sparse coding theory has been recently
well investigated for effective and efficient high-dimensional
signal representation and compression [19], [34], [35]. Its main
idea is to represent a signal as a linear combination of sparse
basis from an overcomplete dictionary. Although the exact
recovery of this dictionary is NP hard, a sufficiently sparse and
linear representation can be efficiently approximated by convex
optimization [34]. For instance, Tibshirani et al. [19] proposed
the Lasso regression strategy to achieve coefficient sparsity by
adding -norm penalty to the regression loss. Works in [12]
and [36] also adopted sparse coding to directly build codebooks
from the original local patch set. However, facing millions
of local patches, the computational efficiency of direct sparse
coding [12], [36] retains to be an opening problem. Compared
with the above works, our novelties are twofold. First, our
codeword-level sparse coding improves the essential computational efficiency: We aim to learn a compression transform to
2284
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
Fig. 2. Graphical model of task-dependent codebook compression.
Fig. 3. Visualized flowchart of our codebook compression pipeline.
compress the quantized BoW histograms, rather than directly
encoding the local patch collection. It is very important for
scalable applications that contain hundreds of millions of local
patches, which are unsuitable to directly use spare coding to
learn codebooks. Second, the existing supervised sparse coding
model only allows supervision on the local patch level. On the
contrary, the coding model proposed in this paper enables the
modeling of heterogeneous types of semantic labels based on
our proposed LCK.
Our contributions: Compared with the previous works
that directly supervised the codeword quantization procedure,
we emphasize on compressing an initially high-dimensional
codebook into an extremely compact codebook (see Figs. 2
and 3). Our first contribution is the efficiency: We only operate
on the BoW histograms and subsequently avoid the burden
to directly supervise the quantization of millions or billions
of local features. Our second contribution is the extensibility:
Our approach can be deployed to most existing codebooks
[1], [3], [4], [8], [9]. We treat their BoW histograms as input,
and our subsequent compression is independent to the initial
codebook building strategies. Our last consideration is the
flexibility: We propose a “task dependent” discriminability
embedding to model different kinds of semantic labels, either
within an identical database or among different databases. This
is achieved by introducing a flexible labeling correlation kernel
in Section IV-B to model heterogeneous kinds of semantic
labels.
III. PROBLEM FORMULATION
We denote scalars as italic letters, e.g., ; denote vectors as
bold italic letters, e.g., ; denote instance space for instances
as
; denote an inner product between two vectors and
as
; denote
norms over
with
as
; and denote the transposition of matrix
as
and the inverse of as
.
The input of our codebook compression algorithm is an initial
codebook
containing
codewords, which
represents the training images as a set of BoW histograms
, in which each denotes an -bin
BoW histogram. This initial codebook can be derived from most
existing approaches such as [1], [3], [4], [8], and [9]. Then, suppose a subsequent task (such as image retrieval or object recognition) provides its labels as
. To incorporate
partial labels in our subsequent supervised codebook compression, we further assign
to indicate that the th image is
unlabeled.
We aim to learn a basis dictionary
to compress
the initial codebook , which serves as a compression function
to transfer each original BoW histogram
into a more
compact histogram
. Therefore, for new image , we
first use to generate its original BoW histogram
and then
use
to map into a compact histogram . In general, we
also have
and
. For instance
for
many widely used codebooks, and is at a hundred level in our
subsequent experiments.
Our first goal is to ensure that the basis dictionary
can
still preserve the visual discriminability of the original codebook . To maintain such discriminability, we should slightly
compress the “important” codewords and heavily compress the
“chaoslike” codewords. To this end, we model the visual discriminability of each codeword
into our compression loss
function to learn in Section IV-A.
Our second goal is to integrate the task-dependent label
constraints to supervise our compression. Such labels come
from numerous ways, such as object classes, correlative image
annotations, and image ranking orders from the user-query logs.
“Task dependent” means that the compression process should
depend on its specific task even in an identical database, e.g.,
search semantically similar images or search near-duplicate
(ND) objects. Task-dependent embedding is a challenging issue
that remains unexplored in all previous works. In Section IV-B,
we further show how to integrate heterogeneous labels from
different tasks into our codebook compression framework.
IV. TASK-DEPENDENT CODEBOOK COMPRESSION
Using the initial BoW histogram set
extracted from
training images, we learn the basis dictionary
from the
original codebook by minimizing the following cost:
Cost
(1)
where Loss
denotes the loss function to measure if the
current basis dictionary
is “good” at reconstructing an initial BoW histogram . To this end, we are trying to learn the
best compression function to transfer each high-dimensional
sparse into the compact descriptor .
To evaluate the compression distortion, we reversely measure
the recovery residual of from
. Furthermore, considering
that each histogram is sparse with respect to the few nonzero
bins, we hope that our reconstructed
is also sparse. We add
JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION
this sparse constraint into the loss function Loss
an
normalization form as follows:
2285
using
(2)
Loss
is the coefficient vector to reconstruct
from , which
meanwhile serves as the new compressed BoW histogram for
the th image. controls a tradeoff between the sparsity of the
reconstructed signal
and its precision to recover . While
guaranteeing the real sparsity through
is intractable [39],
using a Lasso-based
penalty [19] can still approximate a
sparse solution for coefficients
.
To avoid arbitrarily small
, we conduct a normalize operation
on each column
in
after each-round optimization of in (1) as follows:
s.t.
(3)
Note that Cost
in (1) is not convex with respect to
. Therefore, we resort to a joint optimization between
the basis dictionary
and the compressed BoW histogram
iteratively. 1 That is, in each-round optimization, we fix one parameter set ( or ) and optimize the
constraints as
rest ( or ). We then rewrite (2) with
Loss
(4)
One essential advantage is that, compared with direct sparse
coding to build codebook as in existing works [12], [36], we
only operate on the dictionary level (typically contains only
10 000 codewords), rather than on the entire local feature collection (which typically contains over millions or billions of
local features). Hence, the computational efficiency is largely
improved. Meanwhile, as would be proven in our subsequent
experiments, our codebook compression performance is comparable with that of the state-of-the-art ones [12], [36] that directly operated on the entire initial local feature collection.
A. Visual Discriminability Embedding
Visual discriminability refers to whether a codeword is
discriminative to distinguish each training image in the BoW
histogram. We exploit this criterion to guide our codebook
compression: Discriminative codewords should be slightly
compressed, whereas chaos-like codewords should be heavily
compressed. Following the principle of term frequency and the
inverted document frequency [20], we quantitatively measure
the visual discriminability of the th codeword as
(5)
where
denotes the discriminative weighting of codeword ,
is the number of local features that is quantized into ,
is the total amount of local features extracted from all training
images, is the total number of training images, and
is the
number of images that contain local features quantized into .
In the right part of (5), the first term measures whether can
1Nevertheless, other strategies, such as linear programming, can be also used
and the estimation of
to accelerate the learning of
represent many local features (the more the better), and the
second term measures whether is only discriminative for a
few images (the less the better). The ensemble of
forms a
visual discriminability weighting vector
, which is embedded to refine our compression loss in (4) as
Loss
(6)
Equation (6) highlights whether a given BoW histogram contains discriminative local features, which is measured by its
weighting. If so, the learning of
would emphasize more on
the compression loss of and vice versa.
B. Task-Dependent Discriminability Embedding
Task-dependent discriminability refers to whether the compressed codebook is suitable for its subsequent recognition or
retrieval task. In this subsection, we integrate the task labels
to supervise the learning of our basis dictionary
. This is achieved by introducing an LCK into our compression loss Loss
for each BoW histogram
to evaluate
its labeling consistency, as shown in
LCK
Logistic
(7)
where Logistic represents a logistic loss function for scalar
as
, which enjoys the properties similar to the loss
function of support vector machines (SVMs).
defines a
consistency measurement between two labels and , which
would be revisited in Sections IV-B1 –IV-B3.
denotes
the labels of nearby BoW histograms that fall within
of .
Then, our goal turns out to learn
with LCK constraints. To
that effect, we propose a refined loss function of (6) as
Loss
(8)
using the above refined
In addition, learning the overall Cost
Loss D
is still not convex with respect to . Similarly, we
resort to a joint optimization between and , as stated before.
It is worth mentioning that, before the compression procedure,
we preestimate and store the LCK for each
into a lookup
table, which largely accelerates our subsequent iterative regression.
Our LCK can be used in several kinds of recognition or retrieval tasks by modeling heterogeneous labels, such as oneclass, two-class, and multiple-class labels; partial labels; correlative labels; and the list of image ranking orders. We specify
the LCK in (7) as follows.
1) Modeling Category Labels: In such case, the label vector
of training images comes from several discrete
categories. We assume that any two given categories are independent. Hence, only takes effect to
once two identical
labels fall into
LCK
Logistic
(9)
2286
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
Here, we specialize the consistency measurement of
as
the count of similar labels within
:
, in
which produces 1 for two identical labels and 0 if otherwise.
Intuitively, if
contains more nearby and identical category labels, the learning of
would “pay more attention to”
the reconstruction loss of . It is worth to mention that, partial
labeling, as well as one-class labeling, can be easily modeled
using our category-based LCK in (9).
2) Modeling Correlative Semantic Annotations: We further
remove the category independence constraints, which allow the
labels to be correlative to each other. For instance, a “furniture” annotation is more closer to a “chair” annotation, compared with an “airplane” annotation. WordNet [40] is a powerful electronic lexical database to provide such semantic relationships. We adopt the WordNet::similarity [41] measurement
to quantitatively measure the semantic closeness between two relative annotations and ,
which refines the LCK as
LCK
C. Iterative Dictionary Learning
Learning the basis dictionary
in
is an optimization problem with respect to . As stated before, while the
jointly convex optimization of both
and
is unavailable,
and
we resort to an iterative convex optimization between
using block coordinate descent [12], [42]–[44]. It iteratively
(denoted as supervised dictionary
optimizes the learning of
learning) and the inference of
(denoted as task-dependent
sparse coding), as outlined in Algorithm 1.
Algorithm 1: Iterative Dictionary Learning
1 Input: Training images with initial BoW histograms
, initial visual codebook
task-dependent labels
, iteration number
, and maximum iteration number
.
2 Initialization: Set to a random Gaussian matrix with
normalized columns. Use and to calculate the initial
3 while {the
Logistic
(10)
Here, we specialize the consistency degree of as its accumulated WordNet distance within :
.
contains more correlative annotations,
Intuitively, if
is more semantically sensitive. Therefore,
will receive less
compression loss in generating and . On the contrary, more
diverse semantic annotations in
will lead to heavier
compression of .
3) Modeling Image Ranking Lists: Rather than category or
semantic labels, in some other cases, the task supervision cannot
be directly obtained. For instance, content-based image search
engines (e.g., TineEye [51] and Google Goggles [52]) usually
collect large amount of user-query logs, which offer similarity
ranking orders instead of the direct semantic labels. Suppose
that there are in total groups of user-query logs and each gives
the ranking order list
for the image query .
We give the following ranking consistency modeling of
as
(11)
denotes the ranking order of
. We set
once
or
is outside
, therefore enfor unlabeled (unranked) images.
suring
Subsequently, our LCK is refined as follows:
in which
,
or
.
} do
4
For a given task, calculate and store LCK response for
each using respective scheme in Section IV-B.
5
Supervised Dictionary Learning:
6
for each {
7
Use Lasso [19] to learn
on the following loss:
in the iteration } do
at the
iteration based
8
Loss
LCK
(13)
9
end
10
Task-Dependent Sparse Coding:
11
for each {
in the iteration } do
12
Adopt
to estimate
using (14)
13
(14)
14
end
15
(12)
with
Based on (12), we specialize the measurement of
the sum of positive
or negative
sequential ranking pairs within
as
. Intuitively,
if we get more positive sequential ranking pairs in
,
should then be ranked higher. Hence, our LCK will give less
in generating basis dictionary
and
compression loss for
.
16 end
17 Output: The basis dictionary
compressed histogram
; The new
.
It is worth to mention that, in practice, we learn the compresand the compressed histogram
simultanesion function
ously using Algorithm 1 with iteratively optimization between
JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION
and . In this viewpoint, our proposed approach can be regarded as a single-stage processing.
V. EXPERIMENTAL COMPARISONS
In this section, we compare our task-dependent codebook compression with alternative approaches and five
state-of-the-art works [1], [16], [33], [36], [50]. We carry out
our comparisons in two benchmark databases and a large-scale
web photo data set: 1) object recognition in PASCAL Visual
Object Class (VOC) 07, which aims to classify whether a given
image contains a certain object (output a binary judgement); 2)
ND image retrieval in UKBench, which aims to find ND images
that contain identical objects or identical scenes without regard
to the photographing variances; and 3) a 0.5 million Flickr
data set, which provides quantitative performance evaluations
for three kinds of task-dependent codebook compressions, including category image search, semantic similar image search,
and image ranking.
A. Benchmark Databases and Evaluations
1) PASCAL VOC 07 Object Recognition Benchmark: The
PASCAL VOC 2007 benchmark [45] contains 20 image categories, including Aeroplane, Bicycle, Bird, Boat, Bottle, Bus,
Car, Cat, Chair, Cow, Diningtable, Dog, Horse, Motorbike,
Person, Pottedplant, Sheep, Sofa, Train, and Tvmonitor. There
are around 2500 training images, 2500 validation images, and
5000 test images in total. We use PASCAL VOC 07 to evaluate
our category-based LCK (see Section IV-B1) with comparisons
to [16] and [36].
2) UKBench ND Retrieval Benchmark: The UKBench
data set [1] contains over 10 000 images with 2500 objects.
There are four images per object to offer sufficient variances
in viewpoints, lighting conditions, scales, occlusions, and
affine transforms. These category labels are modeled into our
category LCK for each four images. To offer the MSER
SIFT baseline, as reported in [1], identical to the setting of [1],
we build our retrieval model over the entire UKBench database
and then select the first image per object to form a query set. It
ensures that the top returning photo would be the query itself.
Subsequently, the performance depends on whether we can
rank the remaining three photos of this object in the highest
possible position. This is evaluated by the “correct returning”
measurement [1], which shows the average correctly returned
images in the top four results. The UKBench data set is used
to evaluate our codebook compression with the category-based
LCK in Section IV-B1, with comparisons to that in [1] and [33].
3) 0.5 Million Flickr Database: We crawled over 500 000
collaboratively labeled photos from Flickr, which gives a realworld evaluation for our task-dependent compression. It contains over 180 million local features with over 450 000 labels
(over 3600 unique keywords). Within this 0.5 million Flickr data
set, we carry out three task-dependent codebook compressions
for the three following tasks:
1) ND-0.5Million Evaluation: Our first task is the ND image
search in this 0.5 million Flickr data set. Following the
Tineye search evaluation methodology [51], we collected
and manually labeled 23 groups of web images (1100 in
total) from both Tineye and Google image search. The images in each group are partial duplicates among each other.
2287
We then add these ground-truth images into our 0.5 million
data set. This enriched data set is referred to as (ND evaluation in a 0.5 million database) (ND-0.5Million).2 We
quantitatively evaluate our task-dependent codebook compression using a category-aware LCK, with comparison to
the related works in [1] and [50].
We measure our retrieval performance by mean average
precision (MAP) at . It represents the mean precision of
queries, where each of which reveals its position-sensitive
ranking precision in the top positions as follows:
MAP
of
relevant
(15)
where
is the number of queries, is the rank,
is
is the
the number of related images for query , and
precision at the cutoff rank of the th related image.
2) SS-0.5Million Evaluation. Our second task is the semantic
sensitive image retrieval in this 0.5 million Flickr data set.
We select 20 labels identical to the categories in PASCAL
VOC 07. For each label, we randomly pick up two images
from our 0.5 million Flickr data set and rank the top 100
similar photos for each. For each ranking list, we ask a
group of volunteers to identify whether each ranked image
comes from the identical category of the query. If so, we
treat it as a “correct” image, otherwise as an “incorrect”
image. Similar to the ND-0.5Million Evaluation, we also
use MAP to measure our performance. We denote this evaluation as a semantic similar evaluation in 0.5 million database (SS-0.5Million), which is used to quantize performances between our semantic-annotation-based LCK (see
Section IV-B2) and the work in [50].
3) RS-0.5Million Evaluation. Our third task is the image
ranking based on user-query log learning. For each query
in the above SS-0.5Million Evaluation (containing 40
queries), we collect the correct returning images and
their ranking orders as user-query logs, which are used
to evaluate our ranking-based LCK in Section IV-B3. We
denote this group as a ranking-sensitive evaluation in 0.5
million database) (RS-0.5Million evaluation), which evaluates our ranking consistency LCK (see Section IV-B3)
with the normalized discounted cumulative gain (NDCG)
measurement as
(16)
where
is the NDCG at the rank position of and is
the last position of the relevant sample within the first
samples in the ranked list.
represents the rate of relevant between the th sample in the ranked list and the query
bounded in [0, 1]. is the normalized constant for a given
ranked list. Compared with precision and recall, NDCG is
sensitive to position of the highest rated image, regardless
of the variable lengths in different ranking lists.
2Since the original 0.5 million Flickr data set could also contain partial duplicate images of our 23 ground-truth categories, we use every image in the
ground-truth set to query the database and, subsequently, to identify and remove
any duplicates from the original 0.5 million Flickr data set.
2288
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
Fig. 4. Parameter tuning of best compressed codebook size in three databases. The tuning performance in (a) is measured by Precision@10. The tuning performance in (b) is measured by corrected returning. The tuning performances in (c) and (d) are measured by MAP. (a) Tuning (K, , ) in PASCAL 07. (b) Tuning
(K, , ) in UKBench. (c) Tuning (K, , ) in ND-0.5Million Evaluation. (d) Tuning (K, , ) in SS-0.5Million Evaluation.
B. Baselines and Alternative Approaches
We summarize our baselines [1], [12], [16], [33], [50] and
alternative approaches compared in the subsequent quantitative
comparisons
1) VT [1]: The first baseline comes from the unsupervised
visual-codebook generation based on k-means clustering.
We resort to its hierarchical version [1] to ensure scalable
indexing and search.
[33]: Bosch et al. [33] reported the
2)
state-of-the-art recognition performances on Caltech 101.
They use pLSA to extract topics from the original BoW
histograms, then train an SVM over the topic-level features. Since [33] also compressed an initial codebook, we
directly compare our performance to the quantitative results in [33]. In Section V-E, we will further discuss the
theoretical differences between [33] and this paper.
3) Learning-based codebook refinement [16], [31] (BoW
Learning): To build supervised vocabulary for object
recognition tasks, we also reimplement the work in [16],
which incorporated category learning to adapt an initial
vocabulary into several class-specific vocabularies. We
compare our work to [16] in both the object categorization
task (in PASCAL VOC 07) and the ND Image Retrieval
task (in UKBench).
4) Sparse-coding-based visual codebook [36] (Sparse
Coding): We further compare our codebook compression with a recent work in [36]. Different from our scheme
that operates over the BoW histogram, the work in [36]
used fast sparse coding to generate a compact dictionary
from local patch collections. Since [36] is also deployed on
the PASCAL VOC 07, we can directly offer quantitative
comparisons to [36] in the object categorization task.
5) Aggregating local features (AggreSIFT): Jegou et al. [50]
proposed to aggregate local descriptors with principal
component analysis (PCA) and hashing indexing, which
produced an approximate 32-bit descriptor per image, as a
successive work of [53]. We also compare our scheme to
[50] in our 0.5 million Flickr data set.
6) Compressed codebook without visual discriminability embedding (without visual): As an alternative approach, we
do not embed both visual discriminability weighting
and LCK to our loss function [see (8)]. Hence, only direct
sparse coding is used to compress the initial codebook [use
(2) to replace (8)]. This baseline demonstrates our effectiveness in visual discriminability embedding;
Fig. 5. Parameter tuning of the best hierarchical layers for the initial visual
codebooks within three benchmark databases, respectively. (a) Tuning H in
PASCAL07. (b) Tuning H in UKBench. (c) Tuning H in 0.5 million Flickr
database.
7) Compressed codebook without task-dependent embedding
(without task): As another alternative approach, we only
but not
embed the visual discriminability weighting
LCK, to compress the initial codebook [using (6) to replace
(8)].
- : A straightfor8)
ward approach is to compress the high-dimensional BoW
histograms using PCA. In such a case, PCA can reduce
the dimensionality while maintaining the main characteristics of the BoW histogram. In Section V-D, we compare
our codebook compression with BoW PCA in both UKBench and the 0.5 million Flickr data set.
C. Parameter Tunings
1) Codebook Construction Tuning: For each benchmark
data set, we extract SIFT [46] features from its training images.
Using all SIFT features, we build a VT [1] to get the initial
for each
codebook , which generates a BoW histogram
training image . We denote the hierarchical level (decide how
and the branching factor
many layers to build the VT) as
(decide how many children clusters for each parent node) as
. We stop the quantization division once there are less than
1000 SIFT features in a word. This setting gives at most
words in the finest level. Fig. 4 gives the parameter tuning of
codebook size in three data sets, respectively. Based on tuning
results in Fig. 5, the initial vocabulary are settled as
and
for UKBench data set,
and
for
and
for Flickr
PASCAL VOC 07 data set, and
0.5 million data set, respectively. For the object classification
task, the linear SVM [48] is adopted to predict the category
labels. For the ND retrieval (in UKBench) and image search (in
0.5 million Flickr data sets), we use L2 distance to rank image
similarity with inverted indexing.
JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION
2289
Fig. 6. Average precision in all 20 object categories of PASCAL VOC 07 data set, with comparisons to state-of-the-art works in VT [1], BoW pLSA [33],
learning codebook [16], and sparse coding [36]. The line “codebook compression” denotes our final approach with both visual and task-dependent discriminability
embedding, the lines “without visual” denote baseline (6), and the lines “without task” denote baseline (7). Our final approach is the one that includes both visual
discriminability and task-dependent (category-based LCK) discriminability in codebook compression. In “Without Visual,” “Without Task,” and “Final Approach,”
for learning the basis dictionary .
we settle
2) Codebook Compression Tuning: Based on the best tuned
visual codebook, we carry out our codebook compression to obtain the compressed codebook . Here, we briefly discuss the
choice of three compression parameters: [control reconstruction sparsity in (2) (4) (6) (8)], [controls the degree of task-dependent embedding in (8)], and [control the dimension of basis
dictionary in (2) (4) (6) (8)]. While a straightforward choice is
to tune all of them simultaneously using cross-validation, we
resort to a more efficient strategy that sequentially optimizes ,
, and . That is, we increase the volumes of . In addition, for
pair as follows. First, as empireach , we tune the best
ically found in [12], the sparsity parameter is settled as 0.15.
Then, serves as a regularization parameter to control that the
does not overfit to the task labels :
compressed codebook
A larger results to good fit to the training label (low variance)
but will simultaneously increase the classification bias and vice
versa. Then, we estimate the best by leave-one-out cross-valin each data set.
idation. Fig. 4 presents our best tuned
D. Quantitative Results
1) Object Recognition in PASCAL VOC 07: Fig. 6 shows
our object recognition performance in PASCAL VOC 07. First,
a solely visual codebook [1] [baseline (1)] with a linear SVM
[48] obtains the lowest performance among all approaches. This
performance was also validated in the PASCAL VOC 07 evaluations of BoW [49]. Employing pLSA [33] to compress the
codebook [baseline (2)] improves performance by reducing the
feature dimension to avoid the dimension curses in the SVM
(this is also observed in [33]). Using codebook learning [16]
(baseline (3) that learned a class-specific codebook is learned for
each object class) with a one-class SVM, we find that the integration of semantic supervision can largely improve the performance. Employing sparse coding to directly learn a compressed
codebook [36] [baseline (4)] outperforms baselines (1)–(3) in
effectiveness. However, its time cost is extremely huge since it
directly operates over the entire local feature collection (Instead,
our coding operates on the BoW histogram).
For baseline (6) that compresses the initial codebook without
either visual discriminability or task-dependent discriminability
embedding, we achieve better performance than both VT [1] and
BoW pLSA [33] (the latter one is the most similar work to our
alternative approach [baseline (7)] with solely visual discriminability embedding). However, we achieve lower performance
than the learning codebook [16] [baseline (3)] and the sparse
coding [36] [baseline (4)]. We explain this phenomenon from
the fact that both baselines (3) [16] and (4) [36] integrate label
supervision to build codebooks. Compared with our scheme that
only embeds visual discriminability, such integration largely
boosts their performances.
With solely visual discriminability embedding [baseline (7)]
without task-dependent embedding, we outperform the VT
pLSA [33] [baseline (2)], and
[1] [baseline (1)] the BoW
the learning codebook [16] [baseline (3)]. In addition, we also
achieve comparable performance to the sparse coding [36]
scheme [baseline (4)].
Finally, we embed both visual and task-dependent discriminability to obtain the best performances over all baselines.
In particular, our approach outperforms the supervised sparse
coding [36] in 16 categories.
Advantages in computational efficiency: It is worth mentioning that the work in [36] directly encoded the gigantic collection of the original local descriptors (such as SIFT). This is
extremely time-consuming and cannot be scaled up to millions
or billions of descriptors. Indeed, the time complexity of [36] is
already almost huge for PASCAL VOC 07 (contain 10 000 images). On the other hand, the work of [16] learned category-specific codebooks and hence also cannot be scaled up to real-world
database that contains over thousands of correlative semantic
categories. This is a similar issue for the work in [31], which
also learned one codebook per category to train the subsequent
classifiers.
Table I further shows the visual search efficiency in the 0.5
million Flickr data set. Using both visual and task-dependent
discriminability embedding does not significantly increase the
computational cost compared with the building of the original
VT. In the online search, due to the online coding requirement,
we also need additional cost to build the compressed BoW
histogram, but this is still slightly efficient compared with the
state-of-the-art works in extracting the pLSA topic features in
[33]. In addition, adding GNP would significantly increase the
2290
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
TABLE I
TIME AND STORAGE ANALYSIS IN THE 0.5 MILLION FLICKR DATABASE
Fig. 7. Performance of the ND image search task in the UKBench database. We
compare our task-dependent codebook compression with six baselines: baseline
(1) VT [1], baseline (2) BoW pLSA [33], baseline (3) learning codebook [16],
baseline (4) direct sparse coding [36], baseline (5) without visual discriminative
PCA
embedding, baseline (6) without task embedding, and Bag-of-Words
[baseline (8)]. In such case, the four images in an identical category are with an
).
identical category label in our category LCK LCK
online query time. Note that the additional search time cost
comes from twofold: 1) to transform the BoW histogram into
the compressed histogram and 2) to search in the new inverted
indexing of the compressed codewords. The later costs more
time since the inverted file for compact codebook is now less
sparse.
2) ND Image Retrieval in UKBench: Fig. 7 further shows
the performance of above baselines in UKBench, where similar
observations to Fig. 6 can be found. Note that the increase in
the axis and the measurement in the axis are identical to
the quantitative comparisons in [1]. In addition, there are two
interesting observations.
1) ND image retrieval concerns more on the visual discriminability and less on the task-dependent discriminability.
This can be validated by the large margin between baselines (6) and (7), where baseline (7) performs better due to
its visual discriminability embedding.
2) For ND search, the original VT [1] [baseline (1)] even
pLSA [33] [baseline (2)] and
outperforms both BoW
learning codebook [16] [baseline (3)]. It is because the labels (four images as an identical category) do not always
offer further information in codebook building since the intracategory images are already ND. Another reason is that
the ND search is without classifier training. Hence, the dimension curses for [1] in Fig. 6 can be largely avoided.
3) Image Ranking Tasks in 0.5 Million Flickr Data Set:
Fig. 8(a) further shows the ND-0.5Million Evaluation in the
Flickr data set. While similar performance to Fig. 6 can be
found, there are two additional observations compared with
ND image retrieval.
1) The scenario of semantic image search should concern
more on the task-dependent embedding and less on the visual discriminability embedding. Fig. 8(a) shows the large
margin of our approach and baseline (7) to baselines (1),
(5), and (6).
2) Baseline (5) [50] outperforms our alternative approaches
[baselines (6) and (7)], but performs worse than our final
approach. However, we should note that baseline (5) from
[50] gives an extremely compact image signature with only
approximate 32 B, whereas our final approach gives over
1-kB representation, as shown in Table I.
Fig. 8(b) further shows the SS-0.5Million Evaluation in the
0.5 million Flickr data set. Similarly, there is a large margin from
the learning codebook [16] [baseline (3)] and our final approach
to the VT [1] [baseline (1)], adaptive vocabulary [16] [baseline (3)], and our alternative approaches [baseline (7)]. Note
that baseline (3) outperforms our alternative approach [baseline (7)] with only visual discriminability embedding. However,
our final approach still outperforms baseline (3) using both visual discriminability and task-dependent discriminability embedding. The main advantage comes from the usage of the annotation-based LCK in Section IV-B2.
Finally, Fig. 8(c) shows the RS-0.5Million Evaluation in
our 0.5 million Flickr data set, which is used to evaluate our
ranking-based LCK in Section IV-B3, with comparisons to our
alternative approaches [baselines (6) and (7)]. In such case,
learning from the user-query logs can largely improve our
ranking performance. By linearly increasing the user-query
logs, our learning complexity does not increase a lot (at most
linear).
As shown in Figs. 7 and 8, serving as a straightforward solution, BoW PCA does not have better performance comparing
with our scheme. While PCA preserves the principal component
in the original BoW histogram, it is hard to capture the visual
discriminability of the BoW in search. In addition, due to the
computational cost in principal component extraction, PCA is
less efficient for large codebooks.
4) Where is the Matching?: We have recorded the spatial locations of local patches. After quantizing them into
into
using
the initial BoW histogram , we transmit
. Then, we recover
from
using
. Since there are only partial bins in
that are nonzero, we can plot these nonzero bins
back to the local features that are quantized into these bins,
which is shown in Fig. 9.
JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION
2291
Fig. 8. (a) Performance of ND image search task in the 0.5 million Flickr data set (ND-0.5Million Evaluation), with comparisons to VT [1] [baseline (1)] and
aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words
PCA [baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK
.
(b) Performance of semantic sensitive image search task in 0.5 million Flickr data set (SS-0.5Million Evaluation), with comparisons to baseline (1) VT [1], aggregate local features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA
. (c) Perfor[baseline (8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK
mance of ranking-sensitive image search task in 0.5 million Flickr data set (RS-0.5Million Evaluation), with comparisons to baseline (1) VT [1], aggregate local
features [50] [baseline (5)], without visual discriminative embedding [baseline (6)], without task embedding [baseline (7)], and Bag-of-Words PCA [baseline
Note that we leverage
(8)]. In such case, the images annotations are considered as correlative annotation supervision for using our category LCK LCK
the NDCG measurement rather than the MAP measurement in our previous two groups evaluations.
Fig. 9. Spatial locations of the recovered local patches from our compressed
signature . In each subfigure, the top left part is the image in the database, the
top right patches are the local patch collection extracted from this image, the
down part is the patches that belong to the nonzero visual words as well as their
.
spatial locations after reconstructing from
Fig. 11. Ranking example in the ND-0.5Million Evaluation, in which each left
figure (large) denotes a query generated from our ground-truth set. In addition, we show two groups of ranking results: 1) Upper row: original VT-based
ranking [1]; 2) Lower row: Our codebook compression-based ranking results
with category-based LCK embedding.
Fig. 10. Average distribution of compressed bag-of-words histogram
struction parameter vectors).
(recon-
training classifiers based on such a low-dimensional histogram
can also well avoid the dimension curse caused in the previous
high-dimensional BoW histograms.
E. Further Discussions
5) Compressed Histogram Distribution: In Fig. 10, we sum
from 100 test images to get an average histogram, 3 in which
each bin shows the averaged reconstruction parameters. Note
in the compression function
, which
that we set
is learned from a 10 000-word vocabulary. We can see that, compared with the originally sparse BoW histogram , the compressed reconstruction parameter histogram (which is also our
compressing descriptor) is very dense, which largely reduces
the storage space (in memory) and retrieval efficiency. Finally,
3We sum all
for
. Then, we obtain the average value for
each bin by subdividing each bin with 100. Therefore, we obtain an average
reconstruction parameter histogram to see the new codeword distribution.
On correlations to visual codebook topic model: Nevertheless, one straightforward solution in codebook compression
is to use unsupervised topic models (such as pLSA [32], [33]
and LDA [33]) to build a higher level abstraction of the initial
codebook (see Fig. 11). In addition to our superior quantitative
performances, we further discuss two main differences between
our scheme and topic models:
1) pLSA and LDA are all generative models, which need predefined superparameters to decide topic numbers. However, it is unsuitable to assert that a given codebook has
fixed topics in its compression for different task scenarios
(even in an identical database).
2292
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 21, NO. 4, APRIL 2012
2) pLSA and LDA are computationally inefficient. For instance the singular vector decomposition is hard to scale
up. In contrast, our codebook compression uses the gradient decent operations to iteratively learn the basis dictionary, which can be easily paralleled (on the optimizing of
or ).
either
On correlations to ICA dictionary learning: The work
from independent component analysis (ICA) [47] also builds
dictionary coding from the local patch collection, which is
unsuitable for our case due to the following.
1) Components in ICA should be orthogonal between each
other. On the contrary, there is not such constraints in
need not to
our compression, where any two bases in
be orthogonal. Subsequently, our lossy constraint enables
the integration of both visual and task-dependent discriminability.
2) ICA is computational inefficiency, which is deployed over
the initial local patch collection. This cost is unaffordable
in large scale (an identical drawback to the sparse coding
based codebooks [12], [36]).
3) ICA focuses on the reconstruction capability of the original
input signal. On the contrary, our model emphasizes on the
discriminability (both visual and task dependent) for the
subsequent visual search or object recognition tasks.
[2] J. Yang, Y. Jiang, A. Hauptmann, and C. W. Ngo, “Evaluating bag-ofvisual-words representations in scene classification,” in Proc. Multimedia Inf. Retrieval, 2007, pp. 197–206.
[3] R. Ji, X. Xie, H. Yao, and W. Y. Ma, “Vocabulary hierarchy optimization for effective and transferable retrieval,” in Proc. CVPR, 2009, pp.
1161–1168.
[4] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object
retrieval with large vocabulary and fast spatial matching,” in Proc.
CVPR, 2007, pp. 1–8.
[5] J. van Gemert, C. Veenman, A. Smeulders, and J. Geusebroek, “Visual
word ambiguity,” IEEE Trans. Pattern Anal. Mach. Intel., vol. 32, no.
7, pp. 1271–1283, Jul. 2009.
[6] D. Chen, S. Tsai, and V. Chandrasekhar, “Tree histogram coding for
mobile image matching,” in Proc. Data Compression Conf., 2009, pp.
143–152.
[7] D. Chen, S. Tsai, V. Chandrasekhar, G. Takacs, R. Vedantham, R.
Grzeszczuk, and B. Girod, “Inverted index compression for scalable
image matching,” in Proc. Data Compression Conf., 2010, p. 525.
[8] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to
object matching in videos,” in Proc. ICCV, 2003, pp. 1470–1477.
[9] G. Schindler and M. Brown, “City-scale location recognition,” in Proc.
CVPR, 2007, pp. 1–7.
[10] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for
scalable image search,” in Proc. ICCV, 2009, pp. 2130–2137.
[11] H. Jegou, M. Douze, and C. Schmid, “Hamming embedding and weak
geometric consistency for large scale image search,” in Proc. ECCV,
2008, pp. 304–317.
[12] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Supervised
dictionary learning,” in Proc. NIPS, 2008, pp. 1033–1040.
[13] S. Lazebnik and M. Raginsky, “Supervised learning of quantizer codebooks by information losss minimization,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 31, no. 7, pp. 1294–1309, Jul. 2009.
[14] F. Moosmann, B. Triggs, and F. Jurie, “Fast discriminative visual codebooks using randomized clustering forests,” in Proc. NIPS, 2006, pp.
985–992.
[15] R. Ji, H. Yao, X. Sun, B. Zhong, and W. Gao, “Towards semantic embedding in visual vocabulary,” in Proc. CVPR, 2010, pp. 918–925.
[16] F. Perronnin, C. Dance, G. Csurka, and M. Bressan, “Adapted vocabularies for generic visual categ,” in Proc. ECCV, 2006, pp. 464–475.
[17] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features
and kernels for classification of texture and object categories: A comprehensive review,” Int. J. Comput. Vis., vol. 73, no. 2, pp. 213–238,
Jun. 2007.
[18] J. Liu, Y. Yang, and M. Shah, “Learning semantic visual vocabularies
using diffusion distance,” in Proc. CVPR, 2009, pp. 461–468.
[19] R. Tibshirani, “Regression shrinkage and selection via the Lasso,” J.
Roy. Stat. Soc., vol. 58, no. 1, pp. 267–288, 1996.
[20] G. Salton and C. Buckley, “Term-weighting approaches in automatic
text retrieval,” Inf. Process. Manage., vol. 24, no. 5, pp. 513–523, 1988.
[21] P. Quelhas, F. Monay, J. M. Odobez, D. Gatica-Perez, T. Tuytelaars,
and L. J. Van Gool, “Modeling scenes with local descriptors and latent
aspects,” in Proc. ICCV, 2005, pp. 883–890.
[22] K. Grauman and T. Darrell, “The pyramid match kernel: Discriminative classification with sets of image features,” in Proc. ICCV, 2005,
pp. 1458–1465.
[23] A. Gionis, P. Indyk, and R. Motwani, “Similarity search in high dimensions via hashing,” in Proc. VLDB, 1999, pp. 518–529.
[24] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Lost in
quantization: Improving particular object retrieval in large scale image
databases,” in Proc. CVPR, 2008, pp. 1–8.
[25] Y.-G. Jiang, C.-W. Ngo, and J. Yang, “Towards optimal bag-of-features
for object categorization and semantic video retrieval,” in Proc. CIVR,
2007, pp. 494–501.
[26] T. Kohonen, Learning vector quantization for pattern recognition
Helsinki Inst. Technol., Helsinki, Finland, Tech. Rep. TKK-F-A601,
1986.
[27] T. Kohonen, Self-Organizing Maps, 3rd ed. New York: SpringerVerlag, 2000.
[28] A. Rao, D. Miller, K. Rose, and A. Gersho, “A generalized VQ method
for combined compression and estimation,” in Proc. ICASSP, 1996, pp.
2032–2035.
[29] B. Leibe, A. Leonardis, and B. Schiele, “Combined object categorization and segmentation with an implicit shape model,” in Proc. ECCV,
2004, pp. 17–32.
[30] S. Agarwal and D. Roth, “Learning a sparse representation for object
detection,” in Proc. ECCV, 2002, pp. 113–130.
[31] J. Winn, A. Criminisi, and T. Minka, “Object categorization by learned
universal visual dictionary,” in Proc. ICCV, 2005, pp. 1800–1807.
[32] F.-F. Li and P. Perona, “A Bayesian hierarchical model for learning
natural scene categories,” in Proc. ICCV, 2007, pp. 524–531.
VI. CONCLUSION AND FURTHER WORKS
In this paper, we have proposed to compress an originally
high-dimensional visual codebook with the help of supervised
task labels based on sparse coding. In addition to the computational benefits in both processing time and storage space,
our compressed codebook is also discriminative, extensible,
and flexible. Our discriminability comes from integrating the
codeword term weighting measurement into our compression
cost, by which the “important” codewords are slightly compressed and vice versa. Our extensibility comes from deploying
our compression scheme over the most existing codebooks
[1], [3], [4], [8], [9] as a task-dependent post adaption. Our
flexibility comes from supervising the compression via task
labels, achieved by a task-dependent LCK in our compression
loss. We have conducted extensive experiments on PASCAL
VOC 07, UKBench, and a 0.5 million Flickr data set. We have
shown superior performances compared with state-of-the-art
codebooks [1], [16], [33], [36], [50].
Two interesting questions remain open in this paper. First, because there are numerous user-query logs generated everyday in
most existing image search engines, how to adapt the supervised
codebook compression to the incremental query logs is still an
open problem. To this end, it would be interesting to extend our
codebook compression into an incremental version. Second, to
a certain degree, a formerly compressed codebook also offers
valuable knowledge for the new tasks. We would further exploit the feasibility of task-dependent codebooks, e.g., adapting
a compressed codebook from PASCAL VOC 07 to Caltech101.
REFERENCES
[1] D. Nister and H. Stewenius, “Scalable recognition with a vocabulary
tree,” in Proc. CVPR, 2006, pp. 2161–2168.
JI et al.: TASK-DEPENDENT VISUAL-CODEBOOK COMPRESSION
[33] A. Bosch, A. Zisserman, and X. Munoz, “Scene classification using a
hybrid generative/discriminative approach,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 30, no. 4, pp. 712–727, Apr. 2008.
[34] D. Donoho, “For most large underdetermined systems of equations,
the minimal l1-norm near-solution approximates the sparsest near-solution,” Commun. Pure Appl. Math., vol. 59, no. 7, pp. 907–934, 2006.
[35] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face
recognition via sparse representation,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.
[36] S. Bengio, F. Pereira, and Y. Singer, “Group sparse coding,” in Proc.
NIPS, 2009, pp. 1–9.
[37] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online dictionary learning
for sparse coding,” in Proc. ICML, 2009, pp. 689–696.
[38] H. Liu, M. Palatucci, and J. Jiang, “Blockwise coordinate descent procedures for the multi-task Lasso, with applications to neural semantic
basis discovery,” in Proc. ICML, 2009, pp. 649–656.
[39] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition
by basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1999.
[40] C. Fellbaum, WordNet: An electronic lexical database. Cambridge,
MA: MIT Press, 1998.
[41] T. Pedersen, S. Patwardhan, and J. Michelizzi, “WordNet similarity
measuring the relatedness of concepts,” in Proc. AAAI, 2004, pp.
1024–1025.
[42] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete
basis set: A strategy employed by v1?,” Vis. Res., vol. 37, no. 23, pp.
3311–3325, Dec. 1997.
[43] M. Aharon, M. Elad, and A. M. Bruckstein, “The K-SVD: An algorithm for designing of overcomplete dictionaries for sparse representations,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322,
Nov. 2006.
[44] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding
algorithms,” in Proc. NIPS, 2006, pp. 801–808.
[45] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A.
Zisserman, The PASCAL Visual Object Classes Challenge (VOC)
Results. [Online]. Available: http://www.pascalnetwork.org/challenges/VOC/voc2007 2007
[46] D. Lowe, “Distinctive image features form scale-invariant keypoints,”
Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, Nov. 2004.
[47] A. Hyvarinen and E. Oja, “Independent component analysis: Algorithms and applications,” Neural Netw., vol. 13, no. 4/5, pp. 411–430,
May/Jun. 2000.
[48] J. Suykens and J. Vandewalle, “Least squares support vector machine
classifiers,” Neural Process. Lett., vol. 9, no. 3, pp. 293–300, Jun. 2004.
[49] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL visual object classes (VOC) challenge,” Int. J.
Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010.
[50] H. Jegou, M. Douze, C. Schmid, and P. Perez, “Aggregating local descriptors into a compact image representation,” in Proc. CVPR, 2010,
pp. 3304–3311.
[51] [Online]. Available: http://www.tineye.com/coolsearches
[52] [Online]. Available: http://www.google.com/mobile/goggles
[53] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearest
neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no.
1, pp. 117–128, Jan. 2010.
Rongrong Ji (M’11) received the Ph.D. degree from
the Harbin Institute of Technology, Harbin, China, in
2011.
From 2007 to 2008, he was a Research Intern with
the Web Search and Mining Group, Microsoft Research Asia, Beijing, China, mentored by Xing Xie.
From May 2010 to June 2010, he was a Visiting Student with the University of Texas at San Antonio,
San Antonio, and cooperated with Professor Qi Tian.
From July 2010 to November 2011, he was a Visiting
Student with the Institute of Digital Media, Peking
University, Beijing, under the supervision of Professor Wen Gao. He is currently
a Postdoctoral Researcher with the Department of Electronic Engineering, Columbia University, New York. He is the author of over 40 referred journals and
conferences, including the International Journal of Computer Vision, the IEEE
TRANSACTIONS ON IMAGE PROCESSING, Computer Vision and Pattern Recognition, ACM Multimedia, the International Journal Conference on Artificial Intelligence, IEEE MULTIMEDIA, etc. His research interests include image retrieval
and annotation, video retrieval and understanding.
2293
Dr. Ji is the recipient of a Microsoft Fellowship in 2007. He won the Best
Paper Award of ACM Multimedia 2011. He was the Session Chair of ICME
2008 and Program Committee Member of ACM Multimedia 2011 . He is a
Reviewer for IEEE Signal Processing Magazine, the IEEE TRANSACTIONS ON
MULTIMEDIA, SMC, TKDE, and ACM Multimedia Conferences, and an Associate Editor for International Journal of Computer Applications..
Hongxun Yao (M’03) received the B.S. and M.S.
degrees in computer science from the Harbin Shipbuilding Engineering Institute, Harbin, China, in
1987 and 1990, respectively, and the Ph.D. degree
in computer science from the Harbin Institute of
Technology, Harbin, in 2003.
Currently, she is a Professor with the School of
Computer Science and Technology, Harbin Institute
of Technology. She is the author of five books and
over 200 scientific papers. Her research interests
include pattern recognition, multimedia processing,
and digital watermarking.
Wei Liu (S’06) received the B.S. degree from Zhejiang University, Hangzhou, China, in 2001 and the
M.E. degree from the Chinese Academy of Sciences,
Beijing, China, in 2004. He is currently working towards the Ph.D. degree in the Department of Electrical Engineering, Columbia University, New York.
He was a Research Assistant with the Department
of Information Engineering, Chinese University of
Hong Kong, Hong Kong.
Xiaoshuai Sun (S’10) is currently working towards
the Ph.D. degree with the Harbin Institute of Technology, Harbin, China.
His research interests include image and video understanding, focusing on saliency analysis.
Qi Tian (M’96–SM’03) received the B.E. degree
in electronic engineering from Tsinghua University,
Beijing, China, in 1992, the M.S. degree in electrical
and computer engineering from Drexel University,
Philadelphia, PA, in 1996, and the Ph.D. degree
in electrical and computer engineering from the
University of Illinois, Urbana–Champaign, Urbana,
in 2002.
He is currently an Associate Professor with the Department of Computer Science, University of Texas
at San Antonio (UTSA), San Antonio. His research
interests include multimedia information retrieval and computer vision.
Dr. Tian is the recipient of a Best Student Paper in ICASSP 2006, a Best Paper
Candidate in PCM 2007, the 2010 ACM Service Award, and the Top 10% Best
Paper Award at MMSP 2011. He has been serving as Program Chair, Organization Committee Member, and TPCs for numerous IEEE and ACM Conferences
including ACM Multimedia, SIGIR, ICCV, ICME, etc. He was the Guest Editor
of the IEEE TRANSACTIONS ON MULTIMEDIA, the Journal of Computer Vision
and Image Understanding, Pattern Recognition Letter, EURASIP Journal on
Advances in Signal Processing,, and the Journal of Visual Communication and
Image Representation. He is on the Editorial Board of the IEEE TRANSACTIONS
ON CIRCUIT AND SYSTEMS FOR VIDEO TECHNOLOGY, the Journal of Multimedia, and the Journal of Machine Visions and Applications.
Download