To Appear in Proc. IEEE International Conference on Computer Vision (ICCV), 2015 Adaptive Hashing for Fast Similarity Search Fatih Cakir Stan Sclaroff Department of Computer Science Boston University, Boston, MA {fcakir,sclaroff}@cs.bu.edu Abstract are ensured to be within a scale of the true neighbors. Many subsequent studies have utilized training data to learn the hash functions. These data-dependent methods can be categorized as unsupervised, semi-supervised and supervised techniques. Ignoring any label information, unsupervised studies [23, 24, 7] are often used to preserve a metricinduced distance in the Hamming space. When the goal is the retrieval of semantically similar neighbors, supervised methods [12, 19, 18, 17] have shown to outperform these unsupervised techniques. Finally, semi-supervised methods like [22] leverage label information while constraining the solution space with the use of unlabeled data. Although data-dependent solutions tend to yield higher accuracy in retrieval tasks than their data-independent counterparts, the computational cost of the learning phase is a critical issue when large datasets are considered. Large scale data are the norm for hashing applications, and as datasets continue to grow and include variations that were not originally present, hash functions must also accommodate this deviation. However, this may necessitate retraining from scratch for many data-dependent solutions employing a batch learning strategy. To overcome these challenges, we propose an online learning algorithm in which the hash functions are updated swiftly in an iterative manner with streaming data. The proposed formulation employs stochastic gradient descent (SGD). SGD algorithms provide huge memory savings and can provide substantial performance improvements in large-scale learning. The properties of SGD have been extensively studied [1, 14] and SGD has been successfully applied to a wide variety of applications including but not limited to; tracking [10], recognition [15], learning [21] etc. Despite its feasibility and practicality, it is not straightforward to apply SGD for hashing. Please observe the setup in Fig. 1. Considering semantic based retrieval, an appropriate goal would be to assign the same binary vector to instances sharing the same label. The two hash functions of form f = sgn(wT x) are ample to yield perfect empirical Mean Average Precision (mAP) scores based on Hamming With the staggering growth in image and video datasets, algorithms that provide fast similarity search and compact storage are crucial. Hashing methods that map the data into Hamming space have shown promise; however, many of these methods employ a batch-learning strategy in which the computational cost and memory requirements may become intractable and infeasible with larger and larger datasets. To overcome these challenges, we propose an online learning algorithm based on stochastic gradient descent in which the hash functions are updated iteratively with streaming data. In experiments with three image retrieval benchmarks, our online algorithm attains retrieval accuracy that is comparable to competing state-of-the-art batch-learning solutions, while our formulation is orders of magnitude faster and being online it is adaptable to the variations of the data. Moreover, our formulation yields improved retrieval performance over a recently reported online hashing technique, Online Kernel Hashing. 1. Introduction While the current ease of collecting images and videos has made huge repositories available for the computer vision community, it has also challenged researchers to address the problems associated with such large-scale data. At the core of these problems lies the challenge of fast similarity search and the issue of storing these massive collections. One promising strategy for expediting similarity search involves mapping the data items into Hamming space via a set of hash functions where linear search is fast and often sublinear solutions perform well. Furthermore, the resulting binary representations allow compact indexing structures with a very low memory footprint. Relatively early work in this domain include the dataindependent Locality Sensitive Hashing and its variants [3, 5, 13] in which the retrieved nearest neighbors of a query 1 Figure 1: A setup in R2 with four classes and two hash functions f1 and f2 . The feature space is divided into four regions and a binary code is assigned to instances of each class. Given the circled pair of points sampled from two distinct classes C1 and C2 , the hash function f2 assigns the same bit. Correcting this ‘error’ will give rise to the case where identical binary codes are assigned to different classes, a far cry from what is desired. rankings for the synthetic dataset in Fig. 1. However, in an online learning framework where pairs of points are available at each iteration, f2 will produce identical mappings for a pair sampled from the two distinct classes C1 and C2 , but it will be erroneous to “correct” this hashing. Simply put, the collective effort of the hash functions towards the end goal can lead to difficulty in assessing which hash functions to update or whether to update any at all. In this study, we thus propose an SGD based solution for hashing. Being online, our method is amenable to subsequent variations of the data. Moreover, the method is orders of magnitude faster than state-of-the-art batch solutions while attaining comparable retrieval accuracy on three standard image retrieval benchmarks. We also provide improved retrieval accuracy over a recently reported online hashing method [9]. The remainder of the paper is organized as follows. Section 2 gives the framework for our method. In Section 3 we report experiments followed by concluding remarks in Section 4. 2. Adaptive Hashing In this section, we first define the notation and the feature representation used in our formulation. We then devise an online method for learning the hash functions via Stochastic Gradient Descent (SGD), along with an update strategy that selects the hash functions to be updated and a regularization formulation that discourages redundant hash functions. 2.1. Notation and Feature Representation We are given a set of data points X = {x1 , . . . , xN } in which each point is in a d−dimensional feature space, i.e., x ∈ Rd . In addition, let S denote a similarity matrix in which its elements sij ∈ {−1, 1} define non-similarity or similarity for pairs xi and xj , respectively. S can be derived from a metric defined on Rd or from label information if available. Let H denote the Hamming space. In hashing, the ultimate goal is to assign binary codes to instances such that their proximity in feature space Rd , as embodied in S, are preserved in Hamming space H. A collection of hash functions {f1 , f2 , . . . , fb } is utilized for this purpose where each function f : Rd → {−1, 1} accounts for the generation of one bit in the binary code. Following [18, 9], we use hash functions of the form f (x) = sgn(wT φ(x) − w0 ) T (1) where φ(x) = [K(x1 , x), . . . , K(xm , x)] is a non-linear mapping, K(·, ·) is a kernel function and points x1 , . . . , xm are uniformly sampled from X beforehand. This kernelized representation has shown to be effective [13] especially for inseparable data while also being efficient when m N . For compact codes, maximizing the entropy of the hash functions R has shown to be beneficial [23, 22]. This implies f (x)dP (x) = 0 and we approximate it by setPN ting the bias term w0 to be N1 i=1 wT φ(xi ). f (x) can then be compactly written as sgn(wT ψ(x)) where ψ(x) = PN φ(x) − N1 i=1 φ(xi ). Finally, the binary code of x is computed and denoted by f (x) = sgn(W T ψ(x)) = [f1 (x), ..., fb (x)]T where sgn is an element-wise operation and W = [w1 , ..., wb ] ∈ Rd×b with d now denoting the dimensionality of ψ(x). After defining the feature representation, hashing studies usually approach the problem of learning the parameters w1 , ..., wb by minimizing a certain loss measure [12, 19, 18] or through optimizing an objective function and subsequently inferring these parameters [23, 24]. In the next section, we will present our SGD based algorithm for learning the hash functions. From this perspective, we assume the pairs of points {xi , xj } arrive sequentially in an i.i.d manner from an underlying distribution and the similarity indicator sij is computed on the fly. 2.2. Learning Formulation Following [17], which compared different loss functions, we employ the squared error loss: 2 T l(f (xi ), f (xj ); W ) = (f (xi ) f (xj ) − bsij ) (2) where b is the length of the binary code. This loss function is mathematically attractive for gradient computations and has been empirically shown to offer superior performance [18]. If F = [f (x1 ), ..., f (xN )]T = T sgn(W T ψ (x1 )), . . . , sgn(W T ψ (xN )) = T T sgn( W Φ ) where Φ = [ψ(x1 ), ..., ψ(xN )] then learning the hash function parameters can be formulated as the following least-squares optimization problem X 2 min J(W ) = l(f (xi ), f (xj ); W ) = F T F − bS F W ∈Rm×b ij (3) where k · kF is the Frobenius norm. In [18], this problem is solved in a sequential manner, by finding a parameter vector w at each iteration. Instead, we would like to minimize it via online gradient descent, where at each iteration a pair {xi , xj } is chosen at random, and parameter W is updated according to the following rule W t+1 ← W t − η t ∇W l(f (xi ), f (xj ); W t ) (4) where the learning rate η t is a positive real number. To compute ∇W l, we approximate the non-differentiable sgn function with the sigmoid σ(x) = 2/(1 + e−x ) − 1. Assuming vg = wgT ψ(xi ) and ug = wgT ψ(xj ), the derivative of l with respect to wf g is ∂l(xi ,xj ,W ) ∂wf g T = 2[σ(W T ψ(xi )) σ(W T ψ(xj )) − bsij ]× [σ(vg ) where ∂σ(ug ) ∂ug = ∂σ(ug ) ∂ug ∂ug ∂wf g 2e−ug 2 (1+e−ug ) + σ(ug ) and ∂ug ∂wf g ∂σ(vg ) ∂vg ∂vg ∂wf g ] (5) = ψf (xj )1 . Subsequently, we obtain ∂l/∂W = OP + QR where O = [ψ(xi ), . . . , ψ(xi )] ∈ Rm×b and P = diag[σ(u1 )∂σ(v1 )/∂v1 , . . . , σ(ub )∂σ(vb )/∂vb ] ∈ Rb×b 2 . Updating W solely based on the squared error loss would be erroneous since this function incurs a penalty if the Hamming distance between non-similar (similar) points is not maximized (minimized). As illustrated in Fig. 1, a perfect retrieval can still be achieved when this criterion is not satisfied. Thus, an additional step before applying Eq. 4 is 1 ∂σ(vg ) and ∂vg are similarly evaluated. ∂vg ∂wf g 2 Q and R are obtained by replacing x with i ug and vg in R. xj in O and by swapping required to determine what parameters to update, as will be described next. 2.3. Update Strategy The squared error function in Eq. 2 is practically convenient and performs well, but its gradient may be nonzero even when there is no need for an update (e.g., the case shown in Fig. 1). Thus, at each step of online learning, we need to decide whether or not to update any hash function at all and if so, we seek to determine the amount and which of the hash functions need to be corrected. For this purpose, we employ the hinge-like loss function of [19, 9] defined as max(0, dH − (1 − α)b) sij = 1 lh (f (xi ), f (xj )) = max(0, αb − dH ) sij = −1 (6) where dH ≡ kf (xi ) − f (xj )kH is the Hamming distance and α ∈ [0, 1] is a user-defined parameter designating the extent to which the hash functions may produce a loss. If lh = 0 then we do not perform any update, otherwise, dlh e indicates the number of bits in the binary code to be corrected. In determining which of the dlh e functions to update, we consider updating the functions for which the hash mappings are the most erroneous. Geometrically, this corresponds to the functions for which similar (dissimilar) points are incorrectly mapped to different (identical) margin. Formally, let o Λ = n bits with a high |fb (xi )| |fb (xj )| |f1 (xi )| |f1 (xj )| max kw1 k , kw1 k , . . . , max kwb k , kwb k where w = [w, w0 ]T . We select the hash functions to be updated by sorting set Λ in descending order and identifying the indices of the first dlh e elements that incorrectly map {xi , xj }. 2.4. Regularization Decorrelated hash functions are important for attaining good performance with compact codes and also for avoiding redundancy in the mappings [22]. During online learning, the decision boundaries may become progressively correlated, subsequently leading to degraded performance. In order to alleviate this issue, we add an orthogonality regularizer to Eq. 1. This provides an alternative to strictly constraining the hashings to be orthogonal, which may result in mappings along directions that have very low variance in the data. Eq. 1 then becomes l(f (xi ), f (xj ); W ) T 2 = (f (xi ) f (xj ) − Bsij ) 2 + λ4 W T W − I F (7) where λ is the regularization parameter. Consequently, ∂l/∂W is now equal to OP + QR + λ(W W T − I)W . The objective function in Eq. 7 is non-convex, and solving for the global minimum is difficult in practice. To get around this, a surrogate convex function could considered input : Streaming pairs {(xti , xtj )}Tt=1 , α, λ, η t Initialize W0 ; for t ← 1 to T do Compute binary codes f (xti ), f (xtj ) and similarity indicator sij ; Compute loss lh (f (xi ), f (xj )) according to Eq. 6 ; if dlh e = 6 0 then Compute ∇W l(xi , xj , W t ) ∈ Rd×b ; Compute and sort Λ; j ← the incorrect first dlh e indices of sorted Λ; ∇W l(:, j) ← 0; W t+1 ← W t − η t ∇W l(xi , xj , W t )// Update; else W t+1 ← W t end end Algorithm 1: Adaptive hashing algorithm based on SGD for fast similarity search for which a global minimum can be found [20] and convergence of learning can be analyzed. However, because of the strong assumptions that must be made, use of a convex surrogate can lead to poor approximation quality in representing the “real problem,” resulting in inferior performance in practice [4, 2]. In the computer vision literature, many have advocated for direct minimization of non-convex objective functions, oftentimes using stochastic gradient descent (SGD), e.g., [11, 8, 6], yielding state-of-the-art results. In this work, we wish to faithfully represent the actual adaptive hashing learning task; therefore, we eschew the use of a convex surrogate for Eq. 7 and directly minimize via SGD. We summarize our adaptive hashing algorithm in Alg. 1. We find that directly minimizing Eq. 7 via SGD yields excellent performance in adaptively learning the hashing functions in our experiments. 3. Experiments For comparison with [9] we evaluate our approach on the 22K LabelMe and PhotoTourism-Half Dome datasets. In addition, we also demonstrate results on the large-scale Tiny 1M benchmark. We compare our method with five stateof-the-art solutions; Kernelized Locality Sensitive Hashing (KLSH) [13], Binary Reconstructive Embeddings (BRE) [12], Minimal Loss Hashing (MLH) [19], Supervised Hashing with Kernels (SHK) [18] and Fast Hashing (FastHash) [16]. These methods have outperformed earlier supervised and unsupervised techniques such as [3, 24, 23, 22]. In addition, we compare our approach against Online Kernel Hashing (OKH) [9]. 3.1. Evaluation Protocol We consider two large-scale retrieval schemes: one scheme is based on Hamming ranking and the other is based on hash lookup. The first scheme ranks instances based on Hamming distances to the query. Although it requires a linear scan over the corpus it is extremely fast owing to the binary representations of the data. To evaluate retrieval accuracy, we measure Mean Average Precision (mAP) scores for a set of queries evaluated at varying bit lengths (up to 256 bits). The second retrieval scheme, which is based on hash lookup, involves retrieving instances within a Hamming ball; specifically, we set the Hamming radius to 3 bits. This procedure has constant time complexity. To quantify retrieval accuracy under this scheme, we compute the Average Precision. If a query returns no neighbors within the Hamming ball, it is considered as zero precision. We also report mAP with respect to CPU time. Furthermore, when comparing against OKH we also report the cumulative mean mAP and area under curve (AUC) metric. For all experiments, we follow the protocol used in [18, 9] to construct training and testing sets. All experiments were conducted on a workstation with 2.4 GHz Intel Xeon CPU and 512 GB RAM. 3.2. Datasets 22K LabelMe The 22K LabelMe dataset has 22,019 images represented as 512-dimensional Gist descriptors. As pre-processing, we normalize each instance to have unit length. The dataset is randomly partitioned into two: a training and testing set with 20K and 2K samples, respectively. A 2K subset of the training points is used as validation data for tuning algorithmic parameters, while another 2K samples are used to learn the hash functions; the remaining samples are used to populate the hash table. The l2 norm is used in determining the nearest neighbors. Specifically, xi and xj are considered similar pairs if their Euclidean norm is within the smallest 5% of the 20K distances. S is constructed accordingly and the closest 5% distances of a query are also used to determine its true neighbors. Half Dome This dataset contains a series of patches with matching information obtained from the Photo Tourism reconstruction of Half Dome. The dataset includes 107,732 patches in which the matching ones are assumed to be projected from the same 3D point into different images. We extract Gist descriptors for each patch and as previously, we normalize each instance to have unit length. The dataset is then partitioned into a training and testing set with 105K and 2K samples, respectively. A 2K subset of the training samples is used as validation data for tuning algorithmic parameters, while another 2K samples from the training set are used to learn the hash functions, and the remaining training samples are then used to populate the hash table. The match information associated with the patches is used to construct the matrix S and to determine the true neighbors of a query. both the Halfdome and Tiny 1M benchmarks. The regularization also has a positive albeit slight effect in improving the performance. This validates a former claim that the regularization term alleviates the possible gradual correlation among hash functions as the algorithm progresses and thus helps in avoiding inferior performance. 3.4. Comparison with Batch Solutions Tiny 1M For this benchmark, a set of million images are sampled from the Tiny image dataset in which 2K distinct samples are used as test, training and validation data while the rest is used for populating the hash table. The data is similarly preprocessed to have unit norm length. Similarly to the 22K LabelMe benchmark, the l2 norm is used in determining the nearest neighbors, where xi and xj are considered similar pairs if their Euclidean distance is within the smallest 5% of the 1M distances. S is constructed accordingly and the closest 5% distances of a query are also used to determine its true neighbors. As described in the next section, when evaluating our method against OKH, the online learning process is continued until 10K, 50K and 25K pairs of training points are observed for the LabelMe, Half Dome and Tiny1M datasets, respectively. This is done to provide a more detailed and fair comparison between the OKH online method and our online method for adaptively learning the hashing functions. 3.3. Comparison with Online Hashing [9] Fig. 2 gives comparison between our method and OKH. We report mAP and cumulative mean mAP values with respect to iteration number for 96 bit binary codes. We also report the Area Under Curve (AUC) score for the latter metric. As stated, the parameters for both methods have been set via cross-validation and the methods are initialized with KLSH. The online learning is continued for 10K, 50K and 25K points for the LabelMe, Half Dome and Tiny1M datasets. The experiments are repeated five times with different randomly chosen initializations and orderings of pairs. Ours and Oursr denote the un-regularized and regularized versions of our method, respectively. Analyzing the results, we do not observe significant improvements in performance for the 22K LabelMe dataset. Our method has a benign AUC score increase with respect to OKH. On the other hand, the performance improvements for the Half Dome and Tiny 1M datasets are much more noticeable. Our method converges faster in terms of iterations and shows a significant boost in AUC score (0.67 compared to 0.60 for Half Dome and 0.43 compared to 0.40 for Tiny 1M, specifically). Although both methods perform similarly at the early stages of learning, OKH yields inferior performance as the learning process continues compared to our method. Moreover our technique learns better hash functions in achieving higher retrieval accuracy for We also conducted experiments to compare our online solution against leading batch methods. For these comparisons, we show performance values for hash codes of varying length, in which the number of bits is varied from 12 to 256. All methods are trained and tested with the same data splits. Specifically, as a standard setup, 2K training points are used for learning the hash functions for both batch and online solutions. Thus, for online techniques, the performance scores reported in the tables and figures are derived from an early stage of the algorithm in which both our method and OKH achieve similar results, as will be shown. This is different when reporting mAP values with respect to CPU time, where the online learning is continued for 10K (LabelMe), 50K (Halfdome), and 25K (Tiny1M) pairs of points as is done in Section 3.3. The mAP vs CPU time figures is thus important to demonstrate a more fair comparison. Other algorithmic parameters have been set via cross validation. Finally, Ours describe our regularized method. Table 1 shows mAP values for the 22K LabelMe benchmark. Analyzing the results, we observe that while BRE performs the best in most cases, our method surpasses other batch solutions especially when # of bits is > 96, e.g., for 96 and 128 bit binary codes our method achieves 0.64 and 0.67 mAP values, respectively, surpassing SHK, MLH, FastHash KLSH. More importantly, these results are achieved with drastic training time improvements. For example, it takes only ∼ 4 secs to learn the hash functions that give retrieval accuracy that is comparable with state-of-the-art batch solutions that take at best 600 secs to train. Results for the hash lookup based retrieval scheme are shown in Fig. 4 (center-right). Here, we again observe that the online learning techniques demonstrate similar hash lookup precision values and success rates. Not surprisingly both the precision values and success rates drop near zero for lengthier codes due to the exponential decrease of instances in the hash bins. With fewer bits the hash lookup precision is also low, most likely due to the decrease in hashing quality with small binary codes. This suggests that implementers must appropriately select the Hamming ball radius and the code size for optimal retrieval performance. Similarly, Table 2 reports mAP at varying numbers of bits for the Halfdome dataset. We observe that SHK performs the best for most cases while our method is a close runner-up, e.g., SHK achieves 0.74 and 0.79 mAP values compared to our 0.66 and 0.76 for 96 and 128 bits, respec- (a) (b) (c) (d) (e) (f) Figure 2: Mean average precision and its cumulative mean with respect to iteration number for comparison with OKH [9]. Ours and Oursr denote the unregularized and regularized versions of our method, respectively. (Top row) 22K LabelMe dataset. (Middle row) Half dome dataset. (Bottom row) Tiny 1M dataset. tively. Again, our method achieves these results with hash functions learned orders of magnitude faster than SHK and all other batch methods. Though KLSH also has a very low training time it demonstrates poor retrieval performance compared vs. other solutions. Fig. 5 (center-right) show results for the hash lookup precision and success rate in which we observe similar patterns and values compared to stateof-the-art techniques. Table 3 reports mAP values for the large scale Tiny 1M dataset. Here we observe that FastHash and BRE performs best with compact codes and lengthier codes, respectively. Our online technique performs competitively against these batch methods. The hash lookup precision and success rates for the online techniques are again competitive as shown in Fig. 6 (center-right). Figs. 4-6 (left) show mAP values of all the techniques Mean Average Precision (Random ∼ 0.05) Method BRE [12] MLH [19] SHK [18] FastHash [16] KLSH [13] OKH [9] (@2K) Ours (@2K) 12 bits 0.26 0.31 0.39 0.40 0.31 0.32 0.33 24 bits 0.48 0.37 0.48 0.49 0.42 0.43 0.43 48 bits 0.58 0.46 0.56 0.56 0.48 0.53 0.54 96 bits 0.67 0.59 0.61 0.61 0.56 0.63 0.64 128 bits 0.70 0.61 0.63 0.63 0.59 0.67 0.67 256 bits 0.74 0.63 0.67 0.66 0.64 0.73 0.71 Training time (seconds) 96 bits 929 967 1056 672 5×10−4 3.7 3.8 Table 1: Mean Average Precision for the 22K LabelMe dataset. For all methods, 2K points are used in learning the hash functions. Mean Average Precision (Random ∼ 1×10−4 ) Method BRE [12] MLH [19] SHK [18] FastHash [16] KLSH [13] OKH [9] (@2K) Ours (@2K) 12 bits 0.003 0.012 0.024 0.024 0.012 0.022 0.024 24 bits 0.078 0.13 0.23 0.16 0.05 0.18 0.18 48 bits 0.40 0.38 0.46 0.43 0.11 0.45 0.47 96 bits 0.62 0.62 0.74 0.69 0.28 0.69 0.66 128 bits 0.69 0.67 0.79 0.75 0.33 0.73 0.76 256 bits 0.80 0.80 0.87 0.84 0.40 0.81 0.83 Training time (seconds) 96 bits 1022 968 631 465 5 ×10−4 6 7.8 Table 2: Mean Average Precision for the Halfdome dataset. For all methods, 2K points are used in learning the hash functions. Method BRE [12] MLH [19] SHK [18] FastHash [16] KLSH [13] OKH [9] (@2K) Ours (@2K) Mean Average Precision (Random ) 12 bits 0.24 0.21 0.24 0.26 0.2 0.23 0.24 24 bits 0.32 0.28 0.30 0.32 0.25 0.28 0.29 48 bits 0.38 0.35 0.36 0.38 0.32 0.36 0.36 96 bits 0.47 0.41 0.41 0.44 0.39 0.44 0.44 128 bits 0.50 0.45 0.42 0.46 0.42 0.47 0.48 256 bits 0.59 0.55 0.45 0.52 0.45 0.55 0.52 Training time (seconds) 96 bits 669 672 3534 772 5×10−4 1.9 2.4 Table 3: Mean Average Precision for the Tiny 1M dataset. For all methods, 2K points are used in learning the hash functions. cant difference in performance between our technique and OKH. However, both of these online methods outperform state-of-the-art (except for BRE) with computation time significantly reduced. For the Halfdome and Tiny 1M benchmarks, the OKH method is slightly faster than ours; however, in both benchmarks the accuracy of OKH degrades as the learning process continues. Overall, our method demonstrates higher retrieval accuracy compared to OKH and is generally the runner-up method overall. Yet, our results are obtained from hash functions learned orders of magnitude faster compared to the state-of-the-art batch techniques. Figure 3: Sample pictures from the 22K LabelMe, Halfdome and Tiny 1M datasets. with respect to CPU time. The batch learners correspond to single points in which 2K samples are used for learning. For the 22K LabelMe dataset, we observe no signifi- 4. Conclusion In this study, we have proposed an algorithm based on stochastic gradient descent to efficiently learn hash functions for fast similarity search. Being online it is adaptable to variations and growth in datasets. Our online algorithm Figure 4: Results for the 22 LabelMe dataset. (Left) Mean Average Precision (at 96 bits) with respect to CPU time in which for OKH and our method, the online learning is continued for 10K pairs of points. (Center) Mean hash lookup precision with Hamming radius 3. (Right) Hash lookup success rate. Figure 5: Results for the Halfdome dataset. (Left) Mean Average Precision (at 96 bits) with respect to CPU time in which for OKH and our method, the online learning is continued for 50K pairs of points. (Center) Mean hash lookup precision with Hamming radius 3. (Right) Hash lookup success rates. Figure 6: Results for the Tiny 1M dataset. (Left) Mean Average Precision (at 96 bits) with respect to CPU time in which for OKH and our method, the online learning is continued for 25K pairs of points. (Center) Mean hash lookup precision with Hamming radius 3. (Right) Hash lookup success rates. attains retrieval accuracy that is comparable with state-ofthe-art batch-learning methods on three standard image retrieval benchmarks, while being orders of magnitude faster than competing state-of-the-art batch-learning methods. In addition, our proposed formulation gives improved retrieval performance over the only competing online hashing technique, OKH, as demonstrated in experiments. Adaptive online hashing methods are essential for large- scale applications where the dataset is not static, but continues to grow and diversify. This application setting presents important challenges for hashing-based fast similarity search techniques, since the hashing functions would need to continue to evolve and improve over time. While our results are promising, we believe further improvements are possible. In particular, we would like to investigate methods that can augment and grow the hash codes over time, while at the same time not revisiting (or relearning) hashes for previously seen items that are already stored in an index. References [1] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Proc. Advances in Neural Information Processing Systems (NIPS), pages 161–168, 2008. 1 [2] R. Collobert, F. Sinz, J. Weston, and L. Bottou. Trading convexity for scalability. In Proc. International Conf. on Machine Learning (ICML), 2006. 4 [3] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the twentieth annual symposium on Computational geometry SCG, 2004. 1, 4 [4] T.-M.-T. Do and T. Artiè€res. Regularized bundle methods for convex and non-convex risks. Journal of Machine Learning Research, 2012. 4 [5] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases VLDB, 1999. 1 [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. 4 [7] Y. Gong and S. Lazebnik. Iterative quantization: A procrustean approach to learning binary codes. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011. 1 [8] J. He, L. Balzano, and A. Szlam. Incremental gradient on the grassmannian for online foreground and background separation in subsampled video. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012. 4 [9] L.-K. Huang, Q. Y. 0010, and W.-S. Zheng. Online hashing. In International Joint Conferences on Artificial Intelligence IJCAI, 2013. 2, 3, 4, 5, 6, 7 [10] R. Kehl, M. Bray, and L. Van Gool. Full body tracking from multiple views using stochastic sampling. In IEEE conference on Computer Vision and Pattern Recognition CVPR, 2005. 1 [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. Advances in Neural Information Processing Systems (NIPS), 2012. 4 [12] B. Kulis and T. Darrell. Learning to hash with binary reconstructive embeddings. In Proc. Advances in Neural Information Processing Systems (NIPS), 2009. 1, 2, 4, 7 [13] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing for scalable image search. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2009. 1, 2, 4, 7 [14] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Mùˆller. Effiicient backprop. In Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pages 9– 50, London, UK, UK, 1998. Springer-Verlag. 1 [15] Y. Lecun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2004. 1 [16] G. Lin, C. Shen, Q. Shi, A. van den Hengel, and D. Suter. Fast supervised hashing with decision trees for highdimensional data. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014. 4, 7 [17] G. Lin, C. Shen, D. Suter, and A. van den Hengel. A general two-step approach to learning-based hashing. In Proc. IEEE International Conf. on Computer Vision (ICCV), 2013. 1, 3 [18] J. W. Liu, Wei and, R. Ji, Y.-G. Jiang, and S.-F. Chang. Supervised hashing with kernels. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012. 1, 2, 3, 4, 7 [19] M. Norouzi and D. J. Fleet. Minimal loss hashing for compact binary codes. In Proc. International Conf. on Machine Learning (ICML), 2011. 1, 2, 3, 4, 7 [20] S. Shalev-Shwartz. Online learning and online convex optimization. Journal Foundations and Trends in Machine Learning, 2012. 3 [21] S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. In Proc. International Conf. on Machine Learning (ICML), 2007. 1 [22] J. Wang, S. Kumar, and S.-F. Chang. Semi-supervised hashing for large-scale search. IEEE Trans. Pattern Anal. Mach. Intell. TPAMI, 2012. 1, 2, 3, 4 [23] Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. In Proc. Advances in Neural Information Processing Systems (NIPS), 2008. 1, 2, 4 [24] D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval SIGIR, 2010. 1, 2, 4