Optimal Semi-Supervised Metric Learning for Image Retrieval Kun Zhao , Wei Liu

advertisement
Optimal Semi-Supervised Metric Learning
for Image Retrieval
Kun Zhao1 , Wei Liu2 , Jianzhuang Liu1,3,4
1
Shenzhen Key Lab for CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
2
Electrical Engineering Department, Columbia University, New York, NY, USA
3
Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong
4
Media Lab, Huawei Technologies Co. Ltd., China
zkzkmm@gmail.com, wl2223@columbia.edu, liu.jianzhuang@huawei.com
ABSTRACT
a huge amount of images are extremely expensive. Much
research has been devoted to automatically annotating images [2]. However, text descriptions generated by automatic
annotation techniques are often inaccurate and limited to a
small vocabulary. Moreover, the context of images is often
complicated or confusing and hence cannot always provide
accurate textual labels, in which case such text descriptions
will include much noise inevitably. These drawbacks make it
difficult for these methods to cater for users’ search intents.
Content-based image retrieval (CBIR) [11] is an active
research topic in recent years. In a typical CBIR system,
images are usually represented by low-level features such
as color, texture and shape. CBIR systems try to bridge
the gap between the low-level features and the high-level semantic information. Similarities between images are usually
measured by a distance between feature vectors, and learning a rational distance metric is thus a key issue. Learning
effective distance metrics for image retrieval has attracted
more and more attentions in recent years [17]. The related
works can be divided into three main categories.
The first is unsupervised learning methods, aiming to find
low-dimensional embeddings of high-dimensional data vectors. The representative techniques include Principle Component Analysis (PCA), Multidimensional Scaling (MDS)
[7], and Locality Preserving Projections (LPP) [8].
The second group is supervised learning approaches. The
category labels of the training data are given beforehand
and a distance metric is learned to classify the training
data. State-of-the-arts include Linear Discriminant Analysis (LDA) [7], Neighbourhood Components Analysis (NCA)
[6], Maximally Collapsing Metric Learning (MCML) [5], and
distance metric learning for Large Margin Nearest Neighbor
(LMNN) classification [14].
Unlike many unsupervised approaches that often rely on
distributions of raw data, and unlike supervised methods
suffering from heavy manual labeling labor and the overfitting risk, the third category is semi-supervised (or weakly
supervised) learning methods which try to learn distance
metrics with pairwise constraints, or known as side information [16]. The constraints indicate whether two data samples
are relevant (similar) or irrelevant (dissimilar). Xing et al.
[16] cast the learning task into a convex optimization problem and applied the generated solution to data clustering.
Following this work, several emerging metric techniques proceed in this weakly supervised direction. Relevant Component Analysis (RCA) [1] learns a global linear transformation by merely exploiting relevant constraints (chunklets).
In a typical content-based image retrieval (CBIR) system,
images are represented as vectors and similarities between
images are measured by a specified distance metric. However, the traditional Euclidean distance cannot always deliver satisfactory performance, so an effective metric sensible
to the input data is desired. Tremendous recent works on
metric learning have exhibited promising performance, but
most of them suffer from limited label information and expensive training costs. In this paper, we propose two novel
metric learning approaches, Optimal Semi-Supervised Metric Learning and its kernelized version. In the proposed
approaches, we incorporate information from both labeled
and unlabeled data to design a convex and computationally
tractable learning framework which results in a globally optimal solution to the target metric of much lower rank than
the original data dimension. Experiments on several image
benchmarks demonstrate that our approaches lead to consistently better distance metrics than the state-of-the-arts
in terms of accuracy for image retrieval.
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval
General Terms
Algorithms
Keywords
Image retrieval, semi-supervised learning, metric learning
1. INTRODUCTION
In current popular image search engines, images are indexed by text descriptions extracted from the context of
these images and users input text queries to search the images of interest in an image database. This is more like a
variant of text retrieval. Obviously, manual annotations of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MM’12, October 29–November 2, 2012, Nara, Japan.
Copyright 2012 ACM 978-1-4503-1089-5/12/10 ...$15.00.
Area Chair: Qi Tian
893
a matrix L = I +D−W −W ⊤ where D ∈ Rn×n is
∑a diagonal
matrix whose diagonal entries are set to Djj = n
i=1 Wij .
Now we propose our objective function Q(M ) for optimal
semi-supervised metric learning as follows
Lately, Davis et al. proposed an Information-Theoretic Metric Learning (ITML) [4] approach to deal with both relevant
and irrelevant constraints. However, ITML fails to incorporate and utilize the rich unlabeled data, with a risk of
overfitting. Hoi et al. developed a truly semi-supervised approach Laplacian Regularized Metric Learning (LRML) [9]
to learn a metric from both labeled and unlabeled data, but
the learned metric is in full rank.
The work conducted in this paper falls into the third category. We not only utilize the pairwise similar and dissimilar
constraints of the input data but also consider the data distribution in the original feature space to avoid the overfitting
problem. For high-dimensional image representation, a lowrank metric is often preferred because it can result in efficient
distance computation and data noise removal. Although the
previous work [3] proposed a low-rank version of the ITML
algorithm, the rank of the target metric must be specified in
advance. On the contrary, our proposed approaches can automatically learn optimal low-rank metrics from the training
data as well as the given pairwise constraints.
n
γ1 ∑ 2
1
||M − M0 ||2F +
DM (xi , xj )Wij
2
n i,j=1
∑
∑
γ3
γ2
2
2
DM
(xi , xj ) −
DM
(xi , xj ),
+
|M|
|C|
Q(M ) =
(xi ,xj )∈M
(xi ,xj )∈C
(3)
where ∥ · ∥F represents the Frobenius norm of a matrix, and
γ1 , γ2 , γ3 > 0 are the trade-off parameters which together
control the extents of three regularization terms: the sum
of the squared distances between all neighboring data pairs,
the sum of the squared distances between all similar pairs,
and the sum of the squared distances between all dissimilar
pairs. Let us define a d × d matrix
n
γ1 ∑
Wij (xi − xj )(xi − xj )⊤
n i,j=1
∑
γ2
(xi − xj )(xi − xj )⊤
+
|M|
(xi ,xj )∈M
∑
γ3
−
(xi − xj )(xi − xj )⊤
|C|
(xi ,xj )∈C
∑
γ1
γ2
= XLX ⊤ +
(xi − xj )(xi − xj )⊤
n
|M|
(xi ,xj )∈M
∑
γ3
−
(xi − xj )(xi − xj )⊤ ,
|C|
S=
2. OPTIMAL SEMI-SUPERVISED METRIC
LEARNING
Let X = {xi ∈ Rd }n
i=1 be a training data set, where
n is the total number of data points and d indicates the
dimensionality of the data. Two groups of pairwise constraints among X , a must-link (i.e., similar) constraint set
M = {(xi , xj )} ⊂ X × X and a cannot-link (i.e., dissimilar)
constraint set C = {(xi , xj )} ⊂ X × X , are known. Note
that each data pair (xi , xj ) is disordinal. Let M ∈ Rd×d
be the desired distance matric, which is a symmetric and
positive semidefinite matrix. Then for any two data vectors
xi and xj , the distance between them is expressed as
√
DM (xi , xj ) = ∥xi − xj ∥M = (xi − xj )⊤ M (xi − xj ). (1)
(xi ,xj )∈C
in which X = [x1 , · · · , xn ] ∈ Rd×n is the data matrix.
So far, we can reformulate the original objective defined
in eq. (3) as follows:
The objective of metric learning is to learn the metric matrix
M from X subject to the similar and dissimilar constraints
collected in M ∪ C.
In a typical scenario of semi-supervised learning, unlabeled data points are usually much more than labeled data
points. Under the semi-supervised metric learning circumstance studied in this paper, the labeled data points are
those implicated in M ∪ C while the unlabeled data points
are the rest ones in X .
Similar to [9], we engage the unlabeled data to a neighborhood graph, e.g., the k-NN graph, which establishes a local
similarity structure using an initial metric M0 such as the
Euclidean metric (i.e., the identity matrix I) or the Mahalanobis metric (i.e., the inverse covariance matrix). The local rather than global similarity structure that applying the
initial metric M0 results in can be thought to hold under the
mild condition that the neighborhood scale k is sufficiently
small, e.g., 5 ≤ k ≤ 20. More concretely, the affinity matrix
W ∈ Rn×n of the k-NN graph G is defined as
{
1, j ∈ N
i
k
Wij =
(2)
0, otherwise
min
Q(M ) =
s.t.
M ≽0
1
||M − M0 ||2F + tr(M S),
2
(4)
where M0 is the initial metric which can be I or the inverse
covariance matrix. This optimization problem is not trivial
as the positive semidefinite constraint M ≽ 0 is difficult to
deal with. To this end, we propose an iterative algorithm
to exactly solve the problem in (4). We first compute the
gradient of Q with respect to M :
∇M Q = M − M0 + S,
(5)
and perform a projected gradient descent procedure as follows
Mt+1 (αt ) = ⌊Mt − αt ∇Mt Q⌋≽0 ,
t = 0, 1, 2, · · · ,
(6)
where ⌊·⌋≽ 0 is the positive semidefinite cone projection operator, that is, for any symmetric matrix A ∈ Rd×d
⌊A⌋≽0 = U (Λ)≥0 U ⊤ ,
(7)
where the orthogonal matrix U ∈ R
and the diagonal
matrix Λ ∈ Rd×d stem from the eigen-decomposition of the
matrix A = U ΛU ⊤ , and the operator (Λ)≥0 zeros all negative diagonal entries in Λ. The defined matrix operator ⌊·⌋≽ 0
can guarantee that the metric matrix Mt at each iteration is
always positive semidefinite. The dynamic parameter αt is
an appropriate step size enabling effective gradient descent,
d×d
in which Ni ⊂ [1 : n] denotes the set containing the k indexes
corresponding to the k nearest neighbors of xi in the entire
data set X . Note that W is asymmetric, so the referred kNN graph G is actually a directed graph. We further define
894
Algorithm 1
Input: the objective function Q(·), the initial metric matrix
M0 , and the parameters β = 0.5, η = 0.1.
Output: the optimal metric matrix M ∗ = Mt .
Initialize t ← 0;
repeat
zt
zt ← 0; E ← ∇Mt Q; M ← ⌊M
( t − β E⌋≽0 ; )
while Q(M ) − Q(Mt ) > ηtr E ⊤ (M − Mt ) do
zt ← zt + 1; M ← ⌊Mt − β zt E⌋≽0 ;
end while
Mt+1 ← M ; t ← t + 1;
until Q(Mt ) converges.
aa
f a sf
Figure 1: Nearest neighbor samples. The first row
shows the queries. The second row and the third row
correspond to the nearest neighbors of the queries
obtained by EU and OS2 ML, respectively.
Φ = [ϕ(x1 ), · · · , ϕ(xn )] which is spanned by all training data
vectors in z. Then we have H = ΦA where A ∈ Rn×r , and
we can rewrite Mϕ = ΦA(ΦA)⊤ = ΦAA⊤ Φ⊤ . Let us define
a new vectorial map k : Rd 7→ Rn as follows
which can be selected from β z (0 < β < 1, z = 0, 1, 2, · · · )
such that the Wolfe condition prescribed below holds:
(
)
Q(Mt+1 (β z )) − Q(Mt ) ≤ ηtr (∇Mt Q)⊤ (Mt+1 (β z ) − Mt ) .
k(x) = [K(x1 , x), · · · , K(xn , x)]⊤ .
(11)
It is easy to derive the following distance expression by substituting Mϕ = ΦAA⊤ Φ⊤ into eq. (10)
(8)
αt is simply chosen as β zt where zt is the smallest nonnegative integer satisfying the Wolfe condition. Notice that
0 < η < 1 is a constant. Through making use of the
projected gradient descent algorithm, we can achieve the
globally optimal solution M ∗ because the original objective
function eq. (3) is convex with respect to M and the feasible
solution set M ≽ 0 is a convex set. More importantly, the
projected gradient descent procedure shown in eq. (6) can
automatically yield a low-rank metric matrix. Our proposed
optimization algorithm is summarized into Algorithm 1.
DMk (k(xi ), k(xj )) = ∥k(xi ) − k(xj )∥Mk
√
= (k(xi ) − k(xj ))⊤ Mk (k(xi ) − k(xj )), (12)
in which Mk = AA⊤ is the kernel metric matrix and the
resulting distance function DMk is thus a kernel distance
measure. The derived kernel distance function is the key
to kernelize the linear OS2 ML method, which is as simple
as rerunning OS2 ML over the kernel data vector collection
K = {k(x1 ), · · · , k(xn )}, leading to the nonlinear OS2 KML
method.
4.
3. OPTIMAL SEMI-SUPERVISED KERNEL
METRIC LEARNING
EXPERIMENTS
In this section, we evaluate the performance of the proposed two metric learning approaches on three image datasets:
USPS digit image database1 , a subset (SubCaltech) of
Caltech-101 database2 , and CIFAR-10 dataset3 . We describe the details of our experimental evaluation below.
USPS Database. The first image dataset is the United
State Postal Service (USPS) database of handwritten digital characters contains 11000 normalized grayscale images
of size 16 × 16, with 1100 images for each of the ten classes
from 0 to 9. In the experiment, we compare the Mean Average Precisions (MAPs) of the proposed two methods with
the MAPs of five related methods: Euclidean distance (EU
for short), Mahalanobis distance (Mah for short), LRML
[9], ITML-1 and ITML-2 [4]. The first two are unsupervised, while the other three are semi-supervised. LRML uses
both labeled and un-labeled data to learn a full-ranked metric, but ITML-1 (with initial identity matrix) and ITML2 (with initial inverse covariance matrix) use only labeled
data. All the parameters of the algorithms are tuned for
the best configurations. We randomly select 1000 data from
the database and label 30% of the data. The test process is
repeated 100 times, and the average MAP is used for comparison. In our experiments, we select the Gaussian kernel
Linear distance metrics cannot always be sufficient to discover complicated nonlinear relationships among real-world
image data. In the machine learning field, kernel methodology is a powerful tool for exploring complex structure nonlinearly embedded in input data, which has been successfully applied to kernelize many linear techniques including
Kernel PCA [15] and Kernel Fisher Discriminant Analysis
[10]. In this paper, we also exploit the kernel trick to kernelize our proposed Optimal Semi-Supervised Metric Learning
(OS2 ML) method, namely Optimal Semi-Supervised Kernel
Metric Learning (OS2 KML), in order to achieve better image retrieval accuracy.
In general, the kernel technique implicitly maps original
data in the input space Rd to an infinite feature space z via
some vectorial mapping ϕ : Rd 7→ z. The similarity measure
for the “new” data existing in z is represented by a kernel
function K(·, ·) which essentially induces the feature space
z. The kernel function computes the inner product between
two data vectors ϕ(xi ) and ϕ(xj ) in z:
K(xi , xj ) = ϕ⊤ (xi )ϕ(xj ).
Query
EU
OS2ML
(9)
Suppose that the distance metric in z is Mϕ . Then we
calculate the distance between two data vectors ϕ(xi ) and
ϕ(xj ) as
−∥x −x ∥2
i
j
K(xi , xj ) = exp(
) as the kernel function in run2σ 2
2
ning our OS KML method.
The result is shown in Table 1, from which we can see
that OS2 ML and OS2 KML both perform well and OS2 KML
DMϕ (ϕ(xi ), ϕ(xj )) = ∥ϕ(xi ) − ϕ(xj )∥Mϕ
√
= (ϕ(xi ) − ϕ(xj ))⊤ Mϕ (ϕ(xi ) − ϕ(xj )). (10)
1
http://www-i6.informatik.rwth-aachen.de/~keysers/
usps.html
2
http://www.vision.caltech.edu/Image_Datasets/
Caltech101/
3
http://www.cs.toronto.edu/~kriz/cifar.html
We equivalently write the metric matrix Mϕ of rank r as
HH ⊤ in which H ∈ R|z|×r . Following [10], we further assume that each column vector in H belongs to the subspace
895
Table 1: Comparisons of MAPs on three databases. (labeling rate = 30%)
Methods
EU
Mah LRML ITML-1 ITML-2 OS2 ML OS2 KML
USPS
0.5591 0.2166 0.5167 0.7264
0.2480 0.6224
0.8044
SubCaltech 0.6128 0.2557 0.5714 0.6965
0.1821 0.8224
0.8369
CIFAR-10 0.1825 0.1262 0.1727 0.2132
0.1390 0.2110
0.2418
Query
obtains the best result among all the methods. Figure 1
shows some retrieved examples obtained by EU and OS2 ML,
which indicates that OS2 ML finds much better results.
SubCaltech Database. Second, we test our methods
on SubCaltech which is constructed by us. It consists of
10 categories (total 3044 images): airplane, bonsai, carside,
chandelier, face, hawksbill, ketch, leopard, motorbike, and
watch. Each of the categories contains more than 100 images. We adopt Locality-constrained Linear Coding (LLC)
[13] to represent the images. The comparison of the MAPs
obtained by the seven methods is also shown in Table 1.
Both OS2 ML and OS2 KML have higher MAPs than the
others and OS2 KML still achieves the highest result.
CIFAR-10 Database. CIFAR-10 contains 6000 tiny images of size 32×32 in ten categories. We use a global feature,
a 384 dimensional “gist” [12] vector, to represent each image.
We compute the MAPs of the seven methods with different
percentages of labeled data. The results are shown in Figure 2, where we can observe that both OS2 ML and OS2 KML
perform well and OS2 KML consistently outperforms all the
other methods. Figure 3 shows the retrieved images to some
query examples with the Euclidean distance and OS2 KML
respectively. The result proves that the proposed method is
efficacious to improve the retrieval performance. Note that
since the gist feature is simple and cannot represent much
information of an image, the MAPs on CIFAR-10 are relatively low.
The above experiments have clearly shown that the proposed approach outperforms the state-of-the-art semi-supervised methods for image retrieval.
Figure 3: Retrieval results for some queries in
CIFAR-10. On each block, the first row and the
second row correspond to the nearest neighbors of
the queries on the leftmost obtained by EU and
OS2 KML respectively.
6.
0.28
0.26
EU
Mah
LRML
ITML−1
ITML−2
OS2ML
2
OS KML
7.[1] A.REFERENCES
Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a
MAP
0.24
mahalanobis metric from equivalence constraints. JMLR, 6:937–965, 2005.
[2] D. M. Blei and M. I. Jordan. Modeling annotated data. In Proc. SIGIR,
2003.
[3] J. V. Davis and I. S. Dhillon. Structured metric learning for high
dimensional problems. In Proc. KDD, 2008.
[4] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon.
Information-theoretic metric learning. In Proc. ICML, 2007.
[5] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS
18, 2005.
[6] J. Goldberger, S. Roweis, and R. Salakhutdinov. Neighbourhood
components analysis. In NIPS 17, 2004.
[7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning:
Data Mining, Inference, and Prediction. Second Edition, Springer, 2009.
[8] X. He and P. Niyogi. Locality preserving projections. In NIPS 16, 2003.
[9] S. C. Hoi, W. Liu, and S.-F. Chang. Semi-supervised distance metric
learning for collaborative image retrieval and clustering. ACM Trans.
Multimedia Computing, Communications and Applications, 6(3, artical 18), 2010.
[10] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller. Fisher
discriminant analysis with kernels. In Proc. IEEE Signal Processing Society
Workshop, 1999.
[11] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain.
Content-based image retrieval at the end of the early years. IEEE Trans.
PAMI, 22(12):1349–1380, 2000.
[12] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image
databases for recognition. In Proc. CVPR, 2008.
[13] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.
Locality-constrained linear coding for image classification. In Proc. CVPR,
2010.
[14] K. Q. Weinberger and L. K. Saul. Distance metric learning for large
margin nearest neighbor classification. JMLR, 10:207–244, 2009.
[15] C. Williams. On a connection between kernel pca and metric
multidimensional scaling. Machine Learning, 46(1):11–19, 2002.
[16] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric
learning with application to clustering with side-information. In NIPS 15,
2002.
[17] L. Yang and R. Jin. Distance metric learning: A comprehensive survey.
Technical report, Michigan State University, May 2006.
0.22
0.2
0.18
0.16
0.14
0.12
0.1
0.15
ACKNOWLEDGMENTS
This work was supported by grants from Natural Science
Foundation of China (60975029, 61070148), and Science, Industry, Trade, Information Technology Commission of Shenzhen Municipality, China (JC200903180635A, JC20100527
0378A, ZYC201006130313A), and Guangdong Province through Introduced Innovative R&D Team of Guangdong Province 201001D0104648280.
0.32
0.3
Retrieval result
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Labeling Rate
Figure 2: Comparisons of the MAPs obtained by the
seven algorithms along with different percentages of
labeling rates.
5. CONCLUSIONS
This paper has proposed two novel metric learning methods, Optimal Semi-Supervised Metric Learning (OS2 ML)
and its kernelized version OS2 KML, for image retrieval. We
design a convex and computationally tractable learning framework which results in a globally optimal solution to the target metric of much lower rank than the original data dimension. The experimental results on three image data sets
demonstrate that the proposed approaches outperform the
state-of-the-arts in terms of accuracy for image retrieval.
896
Download