Optimal Semi-Supervised Metric Learning for Image Retrieval Kun Zhao1 , Wei Liu2 , Jianzhuang Liu1,3,4 1 Shenzhen Key Lab for CVPR, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China 2 Electrical Engineering Department, Columbia University, New York, NY, USA 3 Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong 4 Media Lab, Huawei Technologies Co. Ltd., China zkzkmm@gmail.com, wl2223@columbia.edu, liu.jianzhuang@huawei.com ABSTRACT a huge amount of images are extremely expensive. Much research has been devoted to automatically annotating images [2]. However, text descriptions generated by automatic annotation techniques are often inaccurate and limited to a small vocabulary. Moreover, the context of images is often complicated or confusing and hence cannot always provide accurate textual labels, in which case such text descriptions will include much noise inevitably. These drawbacks make it difficult for these methods to cater for users’ search intents. Content-based image retrieval (CBIR) [11] is an active research topic in recent years. In a typical CBIR system, images are usually represented by low-level features such as color, texture and shape. CBIR systems try to bridge the gap between the low-level features and the high-level semantic information. Similarities between images are usually measured by a distance between feature vectors, and learning a rational distance metric is thus a key issue. Learning effective distance metrics for image retrieval has attracted more and more attentions in recent years [17]. The related works can be divided into three main categories. The first is unsupervised learning methods, aiming to find low-dimensional embeddings of high-dimensional data vectors. The representative techniques include Principle Component Analysis (PCA), Multidimensional Scaling (MDS) [7], and Locality Preserving Projections (LPP) [8]. The second group is supervised learning approaches. The category labels of the training data are given beforehand and a distance metric is learned to classify the training data. State-of-the-arts include Linear Discriminant Analysis (LDA) [7], Neighbourhood Components Analysis (NCA) [6], Maximally Collapsing Metric Learning (MCML) [5], and distance metric learning for Large Margin Nearest Neighbor (LMNN) classification [14]. Unlike many unsupervised approaches that often rely on distributions of raw data, and unlike supervised methods suffering from heavy manual labeling labor and the overfitting risk, the third category is semi-supervised (or weakly supervised) learning methods which try to learn distance metrics with pairwise constraints, or known as side information [16]. The constraints indicate whether two data samples are relevant (similar) or irrelevant (dissimilar). Xing et al. [16] cast the learning task into a convex optimization problem and applied the generated solution to data clustering. Following this work, several emerging metric techniques proceed in this weakly supervised direction. Relevant Component Analysis (RCA) [1] learns a global linear transformation by merely exploiting relevant constraints (chunklets). In a typical content-based image retrieval (CBIR) system, images are represented as vectors and similarities between images are measured by a specified distance metric. However, the traditional Euclidean distance cannot always deliver satisfactory performance, so an effective metric sensible to the input data is desired. Tremendous recent works on metric learning have exhibited promising performance, but most of them suffer from limited label information and expensive training costs. In this paper, we propose two novel metric learning approaches, Optimal Semi-Supervised Metric Learning and its kernelized version. In the proposed approaches, we incorporate information from both labeled and unlabeled data to design a convex and computationally tractable learning framework which results in a globally optimal solution to the target metric of much lower rank than the original data dimension. Experiments on several image benchmarks demonstrate that our approaches lead to consistently better distance metrics than the state-of-the-arts in terms of accuracy for image retrieval. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms Algorithms Keywords Image retrieval, semi-supervised learning, metric learning 1. INTRODUCTION In current popular image search engines, images are indexed by text descriptions extracted from the context of these images and users input text queries to search the images of interest in an image database. This is more like a variant of text retrieval. Obviously, manual annotations of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’12, October 29–November 2, 2012, Nara, Japan. Copyright 2012 ACM 978-1-4503-1089-5/12/10 ...$15.00. Area Chair: Qi Tian 893 a matrix L = I +D−W −W ⊤ where D ∈ Rn×n is ∑a diagonal matrix whose diagonal entries are set to Djj = n i=1 Wij . Now we propose our objective function Q(M ) for optimal semi-supervised metric learning as follows Lately, Davis et al. proposed an Information-Theoretic Metric Learning (ITML) [4] approach to deal with both relevant and irrelevant constraints. However, ITML fails to incorporate and utilize the rich unlabeled data, with a risk of overfitting. Hoi et al. developed a truly semi-supervised approach Laplacian Regularized Metric Learning (LRML) [9] to learn a metric from both labeled and unlabeled data, but the learned metric is in full rank. The work conducted in this paper falls into the third category. We not only utilize the pairwise similar and dissimilar constraints of the input data but also consider the data distribution in the original feature space to avoid the overfitting problem. For high-dimensional image representation, a lowrank metric is often preferred because it can result in efficient distance computation and data noise removal. Although the previous work [3] proposed a low-rank version of the ITML algorithm, the rank of the target metric must be specified in advance. On the contrary, our proposed approaches can automatically learn optimal low-rank metrics from the training data as well as the given pairwise constraints. n γ1 ∑ 2 1 ||M − M0 ||2F + DM (xi , xj )Wij 2 n i,j=1 ∑ ∑ γ3 γ2 2 2 DM (xi , xj ) − DM (xi , xj ), + |M| |C| Q(M ) = (xi ,xj )∈M (xi ,xj )∈C (3) where ∥ · ∥F represents the Frobenius norm of a matrix, and γ1 , γ2 , γ3 > 0 are the trade-off parameters which together control the extents of three regularization terms: the sum of the squared distances between all neighboring data pairs, the sum of the squared distances between all similar pairs, and the sum of the squared distances between all dissimilar pairs. Let us define a d × d matrix n γ1 ∑ Wij (xi − xj )(xi − xj )⊤ n i,j=1 ∑ γ2 (xi − xj )(xi − xj )⊤ + |M| (xi ,xj )∈M ∑ γ3 − (xi − xj )(xi − xj )⊤ |C| (xi ,xj )∈C ∑ γ1 γ2 = XLX ⊤ + (xi − xj )(xi − xj )⊤ n |M| (xi ,xj )∈M ∑ γ3 − (xi − xj )(xi − xj )⊤ , |C| S= 2. OPTIMAL SEMI-SUPERVISED METRIC LEARNING Let X = {xi ∈ Rd }n i=1 be a training data set, where n is the total number of data points and d indicates the dimensionality of the data. Two groups of pairwise constraints among X , a must-link (i.e., similar) constraint set M = {(xi , xj )} ⊂ X × X and a cannot-link (i.e., dissimilar) constraint set C = {(xi , xj )} ⊂ X × X , are known. Note that each data pair (xi , xj ) is disordinal. Let M ∈ Rd×d be the desired distance matric, which is a symmetric and positive semidefinite matrix. Then for any two data vectors xi and xj , the distance between them is expressed as √ DM (xi , xj ) = ∥xi − xj ∥M = (xi − xj )⊤ M (xi − xj ). (1) (xi ,xj )∈C in which X = [x1 , · · · , xn ] ∈ Rd×n is the data matrix. So far, we can reformulate the original objective defined in eq. (3) as follows: The objective of metric learning is to learn the metric matrix M from X subject to the similar and dissimilar constraints collected in M ∪ C. In a typical scenario of semi-supervised learning, unlabeled data points are usually much more than labeled data points. Under the semi-supervised metric learning circumstance studied in this paper, the labeled data points are those implicated in M ∪ C while the unlabeled data points are the rest ones in X . Similar to [9], we engage the unlabeled data to a neighborhood graph, e.g., the k-NN graph, which establishes a local similarity structure using an initial metric M0 such as the Euclidean metric (i.e., the identity matrix I) or the Mahalanobis metric (i.e., the inverse covariance matrix). The local rather than global similarity structure that applying the initial metric M0 results in can be thought to hold under the mild condition that the neighborhood scale k is sufficiently small, e.g., 5 ≤ k ≤ 20. More concretely, the affinity matrix W ∈ Rn×n of the k-NN graph G is defined as { 1, j ∈ N i k Wij = (2) 0, otherwise min Q(M ) = s.t. M ≽0 1 ||M − M0 ||2F + tr(M S), 2 (4) where M0 is the initial metric which can be I or the inverse covariance matrix. This optimization problem is not trivial as the positive semidefinite constraint M ≽ 0 is difficult to deal with. To this end, we propose an iterative algorithm to exactly solve the problem in (4). We first compute the gradient of Q with respect to M : ∇M Q = M − M0 + S, (5) and perform a projected gradient descent procedure as follows Mt+1 (αt ) = ⌊Mt − αt ∇Mt Q⌋≽0 , t = 0, 1, 2, · · · , (6) where ⌊·⌋≽ 0 is the positive semidefinite cone projection operator, that is, for any symmetric matrix A ∈ Rd×d ⌊A⌋≽0 = U (Λ)≥0 U ⊤ , (7) where the orthogonal matrix U ∈ R and the diagonal matrix Λ ∈ Rd×d stem from the eigen-decomposition of the matrix A = U ΛU ⊤ , and the operator (Λ)≥0 zeros all negative diagonal entries in Λ. The defined matrix operator ⌊·⌋≽ 0 can guarantee that the metric matrix Mt at each iteration is always positive semidefinite. The dynamic parameter αt is an appropriate step size enabling effective gradient descent, d×d in which Ni ⊂ [1 : n] denotes the set containing the k indexes corresponding to the k nearest neighbors of xi in the entire data set X . Note that W is asymmetric, so the referred kNN graph G is actually a directed graph. We further define 894 Algorithm 1 Input: the objective function Q(·), the initial metric matrix M0 , and the parameters β = 0.5, η = 0.1. Output: the optimal metric matrix M ∗ = Mt . Initialize t ← 0; repeat zt zt ← 0; E ← ∇Mt Q; M ← ⌊M ( t − β E⌋≽0 ; ) while Q(M ) − Q(Mt ) > ηtr E ⊤ (M − Mt ) do zt ← zt + 1; M ← ⌊Mt − β zt E⌋≽0 ; end while Mt+1 ← M ; t ← t + 1; until Q(Mt ) converges. aa f a sf Figure 1: Nearest neighbor samples. The first row shows the queries. The second row and the third row correspond to the nearest neighbors of the queries obtained by EU and OS2 ML, respectively. Φ = [ϕ(x1 ), · · · , ϕ(xn )] which is spanned by all training data vectors in z. Then we have H = ΦA where A ∈ Rn×r , and we can rewrite Mϕ = ΦA(ΦA)⊤ = ΦAA⊤ Φ⊤ . Let us define a new vectorial map k : Rd 7→ Rn as follows which can be selected from β z (0 < β < 1, z = 0, 1, 2, · · · ) such that the Wolfe condition prescribed below holds: ( ) Q(Mt+1 (β z )) − Q(Mt ) ≤ ηtr (∇Mt Q)⊤ (Mt+1 (β z ) − Mt ) . k(x) = [K(x1 , x), · · · , K(xn , x)]⊤ . (11) It is easy to derive the following distance expression by substituting Mϕ = ΦAA⊤ Φ⊤ into eq. (10) (8) αt is simply chosen as β zt where zt is the smallest nonnegative integer satisfying the Wolfe condition. Notice that 0 < η < 1 is a constant. Through making use of the projected gradient descent algorithm, we can achieve the globally optimal solution M ∗ because the original objective function eq. (3) is convex with respect to M and the feasible solution set M ≽ 0 is a convex set. More importantly, the projected gradient descent procedure shown in eq. (6) can automatically yield a low-rank metric matrix. Our proposed optimization algorithm is summarized into Algorithm 1. DMk (k(xi ), k(xj )) = ∥k(xi ) − k(xj )∥Mk √ = (k(xi ) − k(xj ))⊤ Mk (k(xi ) − k(xj )), (12) in which Mk = AA⊤ is the kernel metric matrix and the resulting distance function DMk is thus a kernel distance measure. The derived kernel distance function is the key to kernelize the linear OS2 ML method, which is as simple as rerunning OS2 ML over the kernel data vector collection K = {k(x1 ), · · · , k(xn )}, leading to the nonlinear OS2 KML method. 4. 3. OPTIMAL SEMI-SUPERVISED KERNEL METRIC LEARNING EXPERIMENTS In this section, we evaluate the performance of the proposed two metric learning approaches on three image datasets: USPS digit image database1 , a subset (SubCaltech) of Caltech-101 database2 , and CIFAR-10 dataset3 . We describe the details of our experimental evaluation below. USPS Database. The first image dataset is the United State Postal Service (USPS) database of handwritten digital characters contains 11000 normalized grayscale images of size 16 × 16, with 1100 images for each of the ten classes from 0 to 9. In the experiment, we compare the Mean Average Precisions (MAPs) of the proposed two methods with the MAPs of five related methods: Euclidean distance (EU for short), Mahalanobis distance (Mah for short), LRML [9], ITML-1 and ITML-2 [4]. The first two are unsupervised, while the other three are semi-supervised. LRML uses both labeled and un-labeled data to learn a full-ranked metric, but ITML-1 (with initial identity matrix) and ITML2 (with initial inverse covariance matrix) use only labeled data. All the parameters of the algorithms are tuned for the best configurations. We randomly select 1000 data from the database and label 30% of the data. The test process is repeated 100 times, and the average MAP is used for comparison. In our experiments, we select the Gaussian kernel Linear distance metrics cannot always be sufficient to discover complicated nonlinear relationships among real-world image data. In the machine learning field, kernel methodology is a powerful tool for exploring complex structure nonlinearly embedded in input data, which has been successfully applied to kernelize many linear techniques including Kernel PCA [15] and Kernel Fisher Discriminant Analysis [10]. In this paper, we also exploit the kernel trick to kernelize our proposed Optimal Semi-Supervised Metric Learning (OS2 ML) method, namely Optimal Semi-Supervised Kernel Metric Learning (OS2 KML), in order to achieve better image retrieval accuracy. In general, the kernel technique implicitly maps original data in the input space Rd to an infinite feature space z via some vectorial mapping ϕ : Rd 7→ z. The similarity measure for the “new” data existing in z is represented by a kernel function K(·, ·) which essentially induces the feature space z. The kernel function computes the inner product between two data vectors ϕ(xi ) and ϕ(xj ) in z: K(xi , xj ) = ϕ⊤ (xi )ϕ(xj ). Query EU OS2ML (9) Suppose that the distance metric in z is Mϕ . Then we calculate the distance between two data vectors ϕ(xi ) and ϕ(xj ) as −∥x −x ∥2 i j K(xi , xj ) = exp( ) as the kernel function in run2σ 2 2 ning our OS KML method. The result is shown in Table 1, from which we can see that OS2 ML and OS2 KML both perform well and OS2 KML DMϕ (ϕ(xi ), ϕ(xj )) = ∥ϕ(xi ) − ϕ(xj )∥Mϕ √ = (ϕ(xi ) − ϕ(xj ))⊤ Mϕ (ϕ(xi ) − ϕ(xj )). (10) 1 http://www-i6.informatik.rwth-aachen.de/~keysers/ usps.html 2 http://www.vision.caltech.edu/Image_Datasets/ Caltech101/ 3 http://www.cs.toronto.edu/~kriz/cifar.html We equivalently write the metric matrix Mϕ of rank r as HH ⊤ in which H ∈ R|z|×r . Following [10], we further assume that each column vector in H belongs to the subspace 895 Table 1: Comparisons of MAPs on three databases. (labeling rate = 30%) Methods EU Mah LRML ITML-1 ITML-2 OS2 ML OS2 KML USPS 0.5591 0.2166 0.5167 0.7264 0.2480 0.6224 0.8044 SubCaltech 0.6128 0.2557 0.5714 0.6965 0.1821 0.8224 0.8369 CIFAR-10 0.1825 0.1262 0.1727 0.2132 0.1390 0.2110 0.2418 Query obtains the best result among all the methods. Figure 1 shows some retrieved examples obtained by EU and OS2 ML, which indicates that OS2 ML finds much better results. SubCaltech Database. Second, we test our methods on SubCaltech which is constructed by us. It consists of 10 categories (total 3044 images): airplane, bonsai, carside, chandelier, face, hawksbill, ketch, leopard, motorbike, and watch. Each of the categories contains more than 100 images. We adopt Locality-constrained Linear Coding (LLC) [13] to represent the images. The comparison of the MAPs obtained by the seven methods is also shown in Table 1. Both OS2 ML and OS2 KML have higher MAPs than the others and OS2 KML still achieves the highest result. CIFAR-10 Database. CIFAR-10 contains 6000 tiny images of size 32×32 in ten categories. We use a global feature, a 384 dimensional “gist” [12] vector, to represent each image. We compute the MAPs of the seven methods with different percentages of labeled data. The results are shown in Figure 2, where we can observe that both OS2 ML and OS2 KML perform well and OS2 KML consistently outperforms all the other methods. Figure 3 shows the retrieved images to some query examples with the Euclidean distance and OS2 KML respectively. The result proves that the proposed method is efficacious to improve the retrieval performance. Note that since the gist feature is simple and cannot represent much information of an image, the MAPs on CIFAR-10 are relatively low. The above experiments have clearly shown that the proposed approach outperforms the state-of-the-art semi-supervised methods for image retrieval. Figure 3: Retrieval results for some queries in CIFAR-10. On each block, the first row and the second row correspond to the nearest neighbors of the queries on the leftmost obtained by EU and OS2 KML respectively. 6. 0.28 0.26 EU Mah LRML ITML−1 ITML−2 OS2ML 2 OS KML 7.[1] A.REFERENCES Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a MAP 0.24 mahalanobis metric from equivalence constraints. JMLR, 6:937–965, 2005. [2] D. M. Blei and M. I. Jordan. Modeling annotated data. In Proc. SIGIR, 2003. [3] J. V. Davis and I. S. Dhillon. Structured metric learning for high dimensional problems. In Proc. KDD, 2008. [4] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proc. ICML, 2007. [5] A. Globerson and S. Roweis. Metric learning by collapsing classes. In NIPS 18, 2005. [6] J. Goldberger, S. Roweis, and R. Salakhutdinov. Neighbourhood components analysis. In NIPS 17, 2004. [7] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, Springer, 2009. [8] X. He and P. Niyogi. Locality preserving projections. In NIPS 16, 2003. [9] S. C. Hoi, W. Liu, and S.-F. Chang. Semi-supervised distance metric learning for collaborative image retrieval and clustering. ACM Trans. Multimedia Computing, Communications and Applications, 6(3, artical 18), 2010. [10] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.-R. Muller. Fisher discriminant analysis with kernels. In Proc. IEEE Signal Processing Society Workshop, 1999. [11] A. W. M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain. Content-based image retrieval at the end of the early years. IEEE Trans. PAMI, 22(12):1349–1380, 2000. [12] A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In Proc. CVPR, 2008. [13] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proc. CVPR, 2010. [14] K. Q. Weinberger and L. K. Saul. Distance metric learning for large margin nearest neighbor classification. JMLR, 10:207–244, 2009. [15] C. Williams. On a connection between kernel pca and metric multidimensional scaling. Machine Learning, 46(1):11–19, 2002. [16] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning with application to clustering with side-information. In NIPS 15, 2002. [17] L. Yang and R. Jin. Distance metric learning: A comprehensive survey. Technical report, Michigan State University, May 2006. 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.15 ACKNOWLEDGMENTS This work was supported by grants from Natural Science Foundation of China (60975029, 61070148), and Science, Industry, Trade, Information Technology Commission of Shenzhen Municipality, China (JC200903180635A, JC20100527 0378A, ZYC201006130313A), and Guangdong Province through Introduced Innovative R&D Team of Guangdong Province 201001D0104648280. 0.32 0.3 Retrieval result 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Labeling Rate Figure 2: Comparisons of the MAPs obtained by the seven algorithms along with different percentages of labeling rates. 5. CONCLUSIONS This paper has proposed two novel metric learning methods, Optimal Semi-Supervised Metric Learning (OS2 ML) and its kernelized version OS2 KML, for image retrieval. We design a convex and computationally tractable learning framework which results in a globally optimal solution to the target metric of much lower rank than the original data dimension. The experimental results on three image data sets demonstrate that the proposed approaches outperform the state-of-the-arts in terms of accuracy for image retrieval. 896