Nearest-Neighbor Search on a Time Budget via Max-Margin Trees Parikshit Ram∗ Dongryeol Lee Alexander G. Gray flix, BlockBuster) and potential matches based on personality features in online dating sites (OkCupid.com, Match.com). Many of these applications have a large collection of objects – both n and d are large. For example, online repositories contain over petabytes of data with easily over millions of objects. Moreover, objects like images, online text articles and songs usually have over tens of thousands of dimensions. This makes the problem hard, yet many of these applications have limited response times. However, these applications can afford to incur a slight loss in the accuracy of the result. Therefore, many algorithms employ approximation techniques to trade scalability over accuracy. Examples of such time-limited applications are common – Online search queries are nowadays expected to be completed within a few milliseconds even though the results can be slightly inaccurate. New users at a online music library want to listen to a playlist of songs as soon as they sign up and input their preferences, but 1 Introduction they do not require the perfect songs to be played right In this paper, we address the well-studied problem away; the search process needs to populate the first few of nearest-neighbor search (nns) in large datasets of songs of the playlist almost immediately but can subsepotentially high dimensionality. The task of nns is quently find the more relevant music as application gets defined as follows: For a query point q, return its closest more time. Users wish to explore potential matches on point with respect to a specified distance measure D online dating sites soon after they create their profile from a given collection S of n points/objects with d with their character traits and preferences – they want features/dimensions. Formally, we need to find a point potential matches as soon as possible, but are ready to p ∈ S, such that wait for their “perfect match”. In all these applications, almost immediate results are required. However, users (1.1) D (q, p) = min D (q, r). r∈S prefer the service provider with the better results (better accuracy/relevance for a small wait/search time). nns has extensive applications across many fields in Approximation in nns has been extensively studied, computer science such as databases, computer vision, but none of the algorithms ever explicitly upper bounds information retrieval and machine learning. Moreover, the response times – to our knowledge, the problem of nn methods are heavily used in different sciences such as returning the best approximate answer within a given astronomy, molecular physics and bioinformatics. Evtime has not been studied. We propose the problem of eryday applications of nn search include image queries nns on a time budget, a meta-algorithm for it, and a and similar images results in online information retrieval new data structure for solving the problem accurately. (Web/Image Search), recommendations in music (PanWe now review existing ideas in nns as background to dora, last.fm, GrooveShark) and movie libraries (Netour approach. Abstract Many high-profile applications pose high-dimensional nearest-neighbor search problems. Yet, it still remains difficult to achieve fast query times for state-of-the-art approaches which use multidimensional trees for either exact or approximate search, possibly in combination with hashing approaches. Moreover, a number of these applications only have a limited amount of time to answer nearest-neighbor queries. However, we observe empirically that the correct neighbor is often found early within the tree-search process, while the bulk of the time is spent on verifying its correctness. Motivated by this, we propose an algorithm for finding the best neighbor given any particular time limit, and develop a new data structure, the max-margin tree, to achieve accurate results even with small time budgets. Max-margin trees perform better in the limited-time setting than current commonly-used data structures such as the kd-tree and more recently developed data structures like the RP-tree. ∗ School of Computational Science and Engineering, Georgia Institute of Technology nn error (example spill trees [5]). These obtain significant speed up over the exact methods but still do not to scale to high dimensions1 . Some other dann methods lower the dimensionality of the data with random projections [8] to obtain better scalability. Locality sensitive hashing [9, 10] (LSH) hashes the data into a lower dimensional buckets using hash functions which guarantee that “close” points are hashed into the same bucket with high probability and “farther apart” points are hashed into the same bucket with low probability. This method has significant improvements in running times over traditional methods in high dimensional data and is shown to be highly scalable. Numerous other methods using random projections exist in literature for dann. Hybrid spill trees [5] combine the two ideas of tree-based search and random projections by building spill trees on the randomly projected data. They have been shown to have improved scalability. Note that there is a significant difference in the way LSH and the tree-based algorithms deal with approximations. Tree-based algorithms generally grow the tree on a data-independent (or weakly data-dependent) heuristic. The search strategy ensures the upper bound on the approximation by visiting all required parts of the tree (backtracking in Algorithm 1). LSH buckets the data at multiple scales and the search algorithm 1.2 Approximations simply looks at the appropriate buckets, searching only What approximation principle (notion of error) should within those buckets without any explicit control over be used? Distance approximate nearest-neighbor search the error. This makes LSH often highly efficient, albeit, (dann) approximates the query distance to the nn and at times, at the cost of having inaccurate as well as no any neighbor within that query distance is considered results. Using a tree allows performing the approximato be “good enough”. Applications where the values tion decisions during the search process for each query, of the distances to the neighbor are essential (with no yielding the possibility of more stable and accurate reregard to the ordering of the neighbor) benefit from this sults (obviously, at the cost of some efficiency).2 Another natural notion of nn error is the percentile approximation. The nn distance error is defined as the relative difference between the distances of the query of the data set closer to the query than the returned to the returned neighbor and true nn. For a given solution – the nn error is the relative rank of the query q and a returned neighbor candidate p ∈ S, the returned solution. For a query q and a returned neighbor candidate p ∈ S, the nn rank error τ is defined nn distance error ǫ is defined as: as: D (q, p) − min D (q, r) |{r ∈ S: D (q, r) < D (q, p)}| r∈S (1.3) , τ= (1.2) . ǫ= |S| min D (q, r) 1.1 Approaches to Nearest-neighbor Search The simplest approach of linear search over S to find the nn is easy to implement, but requires O(n) computations for a single nn query, making it intractable for moderately large n. The most generally efficient and common approach is to use multidimensional indexing data structures. Binary spatial partitioning trees, like kd-trees [1] (MakeKDTreeSplit subroutine in Algorithm 3), ball trees [2] and metric trees [3] index the data for efficient querying. Tree-based depth-first search algorithms (Algorithm 1) greedily traverse down the tree and subsequently attempt to prune away parts of the data set from the computation using the current neighbor candidate while backtracking along the tree traversal path in the tree. High dimensional data pose an inherent problem for Euclidean nns because of the concentration of the pairwise Euclidean distances [4]. This implies that in high dimensions, the Euclidean distances between uniformly distributed points lie in a small range of continuous values. The tree based algorithms perform no better than linear search because of their inability to employ tight upper bounds on the running nn distance in high dimensions [5, 6, 7]. This prompted interest in approximation of the nn problem to obtain scalability. r∈S Numerous techniques exist for dann and are fairly scalable to higher dimensions under certain assumptions. They can be broadly divided into two groups: methods utilizing and modifying tree-based search and methods utilizing random projections [8]. Certain tree-based algorithms prune more aggressively on the standard tree data structures utilizing the allowed approximation to obtain some speedup. Another approach limits the depth-first algorithm to the first leaf and modifies the tree data structures to bound 1 However, in the same paper, a variant of the spill tree is presented which does scale to high dimensions. We will subsequently discuss that result. 2 Another significant difference is that for the tree-based methods, the preprocessing is done only once. The search process is adapted for the different desired levels of approximation (that is allowed error ǫ as defined in Equation 1.2). The preprocessing of the data for LSH is done for a particular level of approximation (that is, for a fixed value of ǫ). To obtain another level of approximation, the preprocessing step has to be done again. In practice, this caveat is dealt with by preprocessing and storing the hashed data from LSH for multiple levels of approximation (that is, for multiple values of ǫ). Algorithm 1 Error constrained nns Algorithm 2 Time constrained nns ECNNS(Query q, Tree T , Error ǫ) if isLeaf (T ) then q.nn ← arg minr∈T ∪{q.nn} d(q, r) else if hT.w, qi + T.b < 0 then ECNNS(q, T.lef t, ǫ) /* backtrack */ if cannot prune(q, T.w, T.b, ǫ) then ECNNS(q, T.right, ǫ) end if else ECNNS(q, T.right, ǫ) /* backtrack */ if cannot prune(q, T.w, T.b, ǫ) then ECNNS(q, T.lef t, ǫ) end if end if end if return q.nn TCNNS(Query q, Tree T , Time t) if t ≥ 0 then if isLeaf (T ) then q.nn ← arg minr∈T ∪{q.nn} d(q, r) t←t−1 else if hT.w, qi + T.b < 0 then TCNNS(q, T.lef t, t) TCNNS(q, T.right, t) else TCNNS(q, T.right, t) TCNNS(q, T.lef t, t) end if end if end if return q.nn Figure 1: Recursive depth-first nns algorithms. The splitting hyperplane at each level of the binary tree T is denoted by (T.w, T.b). The field q.nn stores the current best neighbor candidate for the query q. The function cannot prune(q, T.w, T.b, ǫ) checks whether this node can be removed from the computation (i.e. can be pruned) given the current neighbor candidate and the allowed approximation ǫ, otherwise the depth-first algorithm needs to traverse that part of the tree (i.e. backtrack along the depth-first path in the tree). The time constraint t is in the form of allowed number of leaves to be visited (see text for details). where |A| denotes the cardinality of a set A. Rank approximate nearest-neighbor search (rann) [11] introduces the notion of nn rank error and approximates the problem of nns to returning a candidate neighbor whose distance to the query is ranked in the low percentile of the data. This form of approximation is more sensible in a number of applications where the ordering of the neighbors is more crucial than the absolute values of the query distances. NN error in this paper. Our focus is on highdimensional search where nn rank error seems more appropriate for the following reasons: (i) In high dimensional data the pairwise distances tend to be concentrated in a small interval and dann can trivialize the nns since all the points might satisfy the distance approximation. (ii) In high dimensional data sets like images, the actual (Euclidean) distances between the points (images) do not have any inherent meaning and the distances are generally designed to correspond to certain ordering constraints. Hence, we consider the nn rank error as the notion of nn error in this paper. 1.3 Learning from the data Another line of research focuses on learning from the data to improve the efficiency in search trees. The trees are no longer grown on a data-independent heuristic and are shown to adapt better to the data. Unsupervised learning. kd-trees are built on a heuristic (for example, median split or mean split), often choosing the coordinate with the largest spread, and splitting the data at the median along that direction (MakeKDTreeSplit subroutine in Algorithm 3). The usual kd-trees [1, 12] are known to adapt to the ambient dimension, not to the intrinsic dimension3 [13]. PCA-trees [14] split the data in each level at the median along the principal eigenvector of the covariance matrix of the data and completely adapt to the intrinsic dimension of the data [13]. However, computing the covariance matrix and the principal eigenvector at each level of the tree is a computationally expensive building procedure. 2-means-trees split the data along the direction spanned by the centroids of the 2-means clustering solution as per the cluster assignments. They are also known to perform well and adapt to the intrinsic dimension [13]. However, similar to PCA-trees, 3 Adapting to the intrinsic dimension of the data implies that the data diameter of a node is halved after number of splits linearly dependent on the intrinsic dimension of the data, not on the ambient dimension. Locating vs. Validating on a KD−tree Locating vs. Validating on a KD−tree 80 70 Locating Validating Percentage of the queries Percentage of the queries 70 60 50 40 30 20 10 0 Locating Validating 60 50 40 30 20 10 0 0 20 40 60 80 Number of leaves (a) Optdigits 0 500 1000 1500 2000 2500 3000 3500 Number of leaves (b) MNist Figure 2: The histogram of the number of leaves visited by the query using the depth-first search algorithm (Algorithm 1 with ǫ = 0) to locate the true nearest neighbors are contrasted with the histogram of the total number of leaves eventually visited by the depth-first strategy using a kd-trees. These examples imply that most queries require examining of a relatively small subset of the data set (that is, visit a small number of leaves) before they find their true nn. However, majority of the queries end up scanning most of the data to validate that their current neighbor candidate is in fact the true nn. they are computationally expensive to build, requiring a 2-means clustering at each level of the tree. Both these trees perform considerably better than kd-trees for standard statistical tasks like vector quantization, nns and regression [13]. Random Projection Trees (RPtrees) are built recursively by splitting the data at the median along a random direction chosen from the surface of the unit sphere (MakeRPTreeSplit subroutine of Algorithm 3). They are shown to adapt to the intrinsic dimension of the data [15]. They perform better than kd-trees for the aforementioned statistical tasks. RP-trees are not as adaptive as the PCA-trees and the 2-means-trees, but have comparable performances and are computationally much cheaper to build. Supervised learning. Cayton, et.al. [16] and Maneewongvatana, et.al. [12] explicitly learn from the data, choosing optimal splits (with respect to minimizing the expected querying time) at each level of the kd-tree on the data set using a set of sample training queries. They define a cost function for a split in terms of the amount of computation required and the fixed nn error rate for the training queries and subsequently pick the split with the least cost. Cayton, et.al. [16], also present an algorithm to learn the binning structure used in LSH. Clarkson [17], uses training queries to estimate the effective voronoi tessellation of the data set. These papers focus on minimizing the expected query time given a fixed allowed error (error is zero in the case of exact search). In Cayton, et.al.[16], the minimization of the cost function can be done by fixing the amount of computation and minimizing the nn error rate. However, the focus of these methods have been to minimize computation time for a fixed error. The “curse of dimensionality” makes the bounding of the error very expensive in highdimensional data. Even the learned kd-tree obtains speedups of less than an order of magnitude (highest being less than 5) over the standard median-split kd-tree, which are known to lack scalability in high-dimensions (the speedups [16] are much better if the query is from a distribution different than the data set queried on and enough training queries are provided). In this paper, we will introduce a new kind of tree for nns based on a form of unsupervised learning. 1.4 This paper We propose a new way of utilizing tree-based algorithms for nns on a time budget in high dimensions. The rest of the paper is organized as follows: • We define the problem of time-constrained nns and empirically motivate a tree-based meta-algorithm for time-constrained nns in Section 2. • We motivate and present a new data structure for accurate time-constrained nn – the Max-Margin tree (MM-tree) in Section 3 and demonstrate the superior accuracy of MM-trees compared to other trees in time-constrained nns on high dimensional data sets in Section 4. • We conclude with some discussions and possible future directions for this work in Section 5. 2 Time-constrained NNS: Problem and Meta-Algorithm The problem. The most common theme in nn research is to fix the approximation and try to develop algorithms that can obtain results most efficiently. The exact search is a special case of this setting with zero error. Effectively, the nn methods are required to perform the following task: Error constrained search: minimize running time subject to the restriction that we guarantee a certain amount of error. This formulation of nn generally requires the algorithm to ensure that the allowed approximation is always valid. Having located the appropriate result for the query, the algorithm still has to verify that the result is actually within the allowed approximation. This algorithmic behavior is not suitable for many applications of nns (mentioned in Section 1) where the response times are limited – (i) In this setting of limited time, it is hard to guess the appropriate amount of approximation for which the error-constrained algorithm would return an answer within the time limit. (ii) Validation of the error bound can be considered a waste of limited resources in the limited time setting. These applications do not necessarily require the best search result right away, but the focus can be on obtaining the best possible result (lowest nn error) within the limited time. The true nn would always be preferred but not necessarily required. Given enough time, the algorithm should always return the true nn. To summarize, the nn methods would be required to perform the following task: Time constrained search: minimize nn error subject to a limited number of computations (amount of time). This formulation of nns focuses on minimizing the nn error given the time without the requirement of validating the error. One way of thinking about it is to effectively minimize the area under the curve of error obtained with the returned nn candidate (without validation) with respect to the search time available (for example, Figure 4 presents such curves). The algorithm. A possible algorithm for timeconstrained search is a natural modification of the treebased algorithm for error-constrained search in Algorithm 1. The algorithm performs the similar depth-first search, but performs no validation of the nn error and returns the current best neighbor candidate as the answer to the query when the time runs out (Algorithm 2). Even though tree-based algorithms appear to not scale well to high dimensions, the depth-first search is still conjectured to be a useful strategy [17] – depth-first search on different data structures generally locates the correct nn quite early on as a nn candidate, but spends the subsequent computations searching other parts of the tree without any improvement in the result. We perform an experiment to verify this fact empirically. We consider 2 OCR image data sets - Optdigits (64 dimensional) [18] and MNist (784 dimensional) [19]. We split the data sets into a reference set and query set (a 75-25% random split for Optdigits and a 90-10% random split for MNist). We perform the depth-first nns for the exact nn of the each of the queries in the reference set (Algorithm 1 with ǫ = 0) using kd-trees (with a leaf size of 30) to index the reference sets. We present the histogram of the minimum number of leaves visited to locate the true nearest neighbor and total number of leaves to find as well as validate its correctness in Figure 2. The maximum number of leaves visited by any query was 76 for Optdigits and around 3500 for MNist. In both the cases, around 60% of the queries located their true nearest neighbors in the first 5% of the leaves visited. But only less than 5% of queries finished validating within that time. This empirically exhibits the capability of the depth-first nns on tree data structures to locate the correct answer early on, but the failure to perform efficient validation. Note that we have chosen the time constraint in the form of a limit on the number of leaves to be visited. This is done for the sake of convenience since the leaves of a tree can be thought of as an unit of computation in a tree. Here, we assume that there is enough time to perform the O(d log n) computations to get to the leaf at least once. However, Algorithm 2 can be generalized to the setting where the time is explicitly constrained (instead of being in terms of units of computation). 3 Robust data structure: Max-margin Trees Algorithm 2 can be used with any binary tree, but the quality of the results will depend on how well the tree has been able to index the data set. More specifically, consider the distribution of the number of leaves of the kd-tree required to locate the true neighbor from Figure 2. A tree that indexes the data appropriately will have a highly left-skewed distribution, which implies highly accurate results for small time budget with Algorithm 2. A weakly data-dependent tree building heuristic (like the median split in kd-trees) would not necessarily perform well. We wish to utilize trees which learn from the data. To guarantee low error with Algorithm 2, we need to ensure that the first few leaves visited by the depth-first (a) kd-tree (b) Maximum margin tree Figure 3: Visual representations of a kd-tree and a MM-tree: The weakly data-dependent heuristic of a kd-tree can provide less compact partitions. A more data-dependent heuristic like maximum margin splits can be more useful for time-constrained nns. strategy contains the true nn of the query q, implying that q falls on the side of the hyperplane at each level of the tree that contains the points closer to q. Since we are considering binary trees, the hyperplane at each node of the tree needs to effectively perform a simple binary classification task for q – learning a hyperplane that places q on the side containing its close neighborhood with respect to the resolution of the data in the node and does not split that close neighborhood. For this task, we explore a widely used technique in supervised learning – large margin discriminants. Large margin methods. Maximum margins have been exploited in many areas of machine learning because of their superior theoretical and empirical performance in classification [20]. This can be attributed to the robustness of the discriminant functions estimated from the maximum margin formulation to labelling noise – minor perturbations of a point does not change its class. We seek a similar form of robustness – minor perturbations of a point does not change the side of the hyperplane its lies on, hence, ensuring that close neighborhoods are not split by the hyperplane4 . A maxmargin hyperplane is robust to small perturbations in the data distribution which tend to adversely affect the splitting hyperplane in traditional tree structures. This robustness of the hierarchical large margin splits in tree data structure would intuitively boost the accuracy of the time constrained nn algorithm (Algorithm 2). The desired tree building procedure using hierarchical large 4 This also implies that if q is part of the neighborhood, q is placed on the appropriate side of the hyperplane margin splits is presented in Algorithm 4. The idea of maximum margin splits in the tree data structure has been proposed earlier [21] using a set of training queries. However, the split is just a standard coordinate-axis-parallel kd-tree split rotated to get a maximum margin split on the partition induced by the kd-tree split. These splits do not necessarily find the large margins present in the data, rather just improve the kd-tree split – the kd-tree split might have already split a close neighborhood and the rotation of the split might not help much. We wish to to utilize the richer set of max-margin partitions by obtaining a completely unsupervised split. Recently, maximum margin formulations have been used for the unsupervised task of maximum margin clustering (mmc) [22, 23, 24, 25, 26, 27]. The task is to find the optimal hyperplane (w∗ , b∗ ) as well as the optimal labelling vector y∗ from the following optimization problem given a set of points {x1 , x2 , . . . , xn }: n (3.4) (3.5) s.t. (3.6) (3.7) min min y∈{−1,+1}n w,b,ξi X 1 hw, wi + C ξi 2 i=1 yi (hw, xi i + b) ≥ 1 − ξi ξi ≥ 0 Pn −l ≤ i=1 yi ≤ l. ∀i = 1, . . . , n ∀i = 1, . . . , n This formulation finds a maximum margin split in the data to obtain two clusters. The class balance constraint (Eq. 3.7) avoids trivial solutions and the constant l ≥ 0 controls the class imbalance. The margin constraints Algorithm 3 Different kinds of split in the data. Algorithm 4 Max-margin tree building MakeKDTreeSplit(Data S) BuildMaxMargTree(Data S) ds ← arg maxi∈{1:d} range(S(i)) if size(S) ≤ Nmin then w ← eds // the d-dimensional vector with all entries as T.points ← S T.isLeaf ← True zero except dth s entry which is one. else b ← −median(S(ds )) (w, b) ← MakeMaxMargSplit(S) return (w, b) if trivial split(w, b, S) then MakeRPTreeSplit(Data S) (w, b) ← MakeKDTreeSplit(S) Pick a random unit d-dimensional vector v end if w←v Sl ← {x ∈ S: hw, xi + b < 0} b ← −median({hx, vi: x ∈ S}) Sr ← {x ∈ S: hw, xi + b ≥ 0} return (w, b) T.w ← w, T.b ← b MakeMaxMargSplit(Data S) T.lef t ← BuildMaxMargTree(Sl ) Formulate S as a mmc problem P T.right ← BuildMaxMargTree(Sr ) (w, b) ← CuttingPlaneMethod(P ) end if return (w, b) return T (Eq. 3.5) enforce a robust separation of the data. This notion of robustness provides promising results in the case of clustering. We apply this method of binary splitting of the data recursively to build a max-margin tree (MM-tree). For the details of the mmc formulation see Xu, et.al. [22]. Remark. Note that the presence of different configurations in the data like clusters (regions with a large number of points but no large margins) or uniform spread (regions where every split has a large margin) implies that it might not always be possible (or useful) to obtain a large margin split in the data to the desired resolution. In such a situation, we employ the standard efficient kdtree split. In the implementation, whenever the cutting plane method for the mmc problem produces a trivial split (where all the points end up being assigned to one side of the split), we conduct a kd-tree median split (Algorithm 4). If l = ǫn, we get the following recursion (1 + ǫ)n D(n) = D + 1. 2 2 n, which is O(log n). This gives us a depth of log 1+ǫ Comparing it to the log2 n depth of the kd-trees and RP-trees, the ratio of the depth of the median-split tree to the depth of MM-tree is given by log2 n = 1 − log2 (1 + ǫ), 2 n log 1+ǫ which is close to 1 for small ǫ. In our experiments, we use ǫ = 0.001. Construction times. The mmc formulation results in a non-convex integer programming problem. Xu, et.al. [22] relax it to a semi-definite program and solve it using expensive off-the-shelf solvers, making it unde3.1 Max-margin tree: Properties Tree-depth. For the recursive construction of the MM- sirable in the building process of a trees. The cutting tree, we would require to choose an appropriate value plane method (CPMMC) solves the mmc problem in for the parameter l in the balance constraint. Using an O(n) time [25, 28]. adaptive value for l, we provide a depth-bound for the Theorem 3.2. (Theorem 4.4 in [25]) For any dataset MM-tree. X with n samples and sparsity of s, and any fixed value of (the regularization parameter) C > 0 and Theorem 3.1. For a given dataset X of size n and an (the stopping criteria for the cutting plane method) adaptive balance constraint l = ǫn (for a small fixed ε > 0, the cutting plane (CPMMC) algorithm takes time constant ǫ > 0) in the mmc formulation (line 1 in O(sn). the MakeMaxMarginSplit subroutine of Algorithm 3), the MM-tree constructed by Algorithm 4 has a depth of O(log n). Proof. Let D(n) be the depth of the tree for n points. We use this cutting plane method to obtain the maximum margin splits in the data while building the MM-tree. The cutting plane method applied to mmc is explained in detail in Zhao, et.al. [25]. Corollary 3.1. For a given dataset X of size n and sparsity s, any fixed value of (the regularization parameter) C > 0, (the stopping criteria for the cutting plane method) ε > 0 and the adaptive balance constraint l = ǫn throughout Algorithm 4, the runtime of Algorithm 4 is O(sn log n). - Optdigits and MNist digit data sets (with the same query-reference split mentioned in Section 2). We also present results for the 4096-dimensional Images data set [30] with a 75-25% reference-query split. We compare the performance vs. time (number of leaves visited in the Algorithm 2) in two different ways: Proof. By Theorem 3.2, the total time spent in the MakeMaxMarginSplit subroutine for a particular level of the tree is O(sn). If the MakeMaxMarginSplit subroutine returns a trivial split, the MakeKDTreeSplit subroutine takes O(n) time. Moreover, splitting the data into the left or right subtree given the max-margin split takes O(n) time. Hence, the total time spent at any particular level of the tree is O(sn). Then, if T (n) is the time taken by the algorithm to build a tree on n points, we get the following recursion: (1 + ǫ)n T (n) = 2T + O(sn). 2 • mean nn rank error and the first standard deviation above the mean – provides a view of the overall performance of the data structure • maximum nn rank error – provides a view of the stability of the performance, indicating the worst possible performance. For a given number of leaves, we compare the nn rank error of the results obtained by each of the data structures. We compare the best performing RP-tree (c = 3.0 for Optdigits, 3.5 for MNist, 2.5 for Images) and MMtree (C = 0.01 for Optdigits, 0.00001 for MNist, 0.05 for Images) to kd-trees. The results are presented in Figure 4. The first standard deviation below the mean Solving the recursion gives the total time taken by is not shown in the plot because for all the cases, it Algorithm 4 as O(sn log n). was almost always zero. The results for Optdigits indicate that MM-trees achieve much lower error than The standard kd-tree splits require O(n) time for kd-trees and RP-trees given the same number of leaves a set of n points. The RP-tree splits require O(n) visited. In fact, the behavior of the means of kd-trees to split the data, but actually require O(n2 ) time in and RP-trees are very comparable to the behavior of compute the maximum and average diameters of the the first standard deviation of the MM-trees. The redata (Algorithm 3 in Freund, et.al. [29]). The total sults for MNist make the superiority of MM-trees more construction time for these splits would depend on the apparent by showing at least an order of magnitude less depth of the tree formed. For kd-trees and RP-trees, the nn rank error for MM-trees when compared to kd-trees median split ensures a depth of O(log n), resulting in and RP-trees with a time budget of less than 50 leaves. a total of O(n log n) and O(n2 log n) construction time Moreover, the first standard deviation from the mean respectively. of the nn error for MM-trees is as good as the mean of the nn error of the next best competitor. The results 4 Experiments for Images show that the mean nn error of kd-trees and We compare the performance of our proposed MM- MM-trees are close, but the MM-trees are more statree with kd-trees and RP-trees. We consider the ble than the kd-trees with the first standard deviation nn rank error as the quality of the query results since close to the mean and a lower maximum nn error. In the rank error is a more robust notion of nn error fact, in all the data sets, the first standard deviation [11]. For implementing the RP-tree, we use Algorithm above the mean and the maximum rank error demon3 in Fruend, et.al. [29]. We cross-validate for the best strate the significantly more stable performance of the value of c for the RP-tree and for the best value of MM-trees. These results justify the choice of the large the regularization parameter C for the MM-tree. In margin splits in the tree for reducing the nn error of the the next subsection, we present results for the time- time-constrained nn search algorithm (Algorithm 2). constrained setting, comparing the error obtained given fixed time. Though achieving efficiency in the time- 4.2 Error constrained search constrained setting is the goal of this work, the last In this experiment, we compare the time required by subsection compares the performance of the MM-tree to each of these data structures (with different parameter the kd-tree and RP-tree in the error-constrained setting settings) to obtain a bounded rank error using the rankapproximate nearest neighbor algorithm (the single for the sake of curiosity. tree algorithm in Figure 2 in Ram, et.al., 2009 [11]). Similar to the time constrained setting, we compare the 4.1 Time-constrained Search We continue with the two data sets presented earlier performance in the following way: Time constrained NN Time constrained NN 15 MM−tree: mean MM−tree: mean + std KD−tree: mean KD−tree: mean + std RP−tree: mean RP−tree: mean + std 1 0.5 Maximum rank error (in %−tile) Rank error (in %−tile) 1.5 0 MM−tree KD−tree RP−tree 10 5 0 0 10 20 30 40 50 0 10 Number of leaves visited 20 (a) Optdigits: Mean error + 1 standard deviation Time constrained NN 50 Time constrained NN 40 MM−tree: mean MM−tree: mean + std KD−tree: mean KD−tree: mean + std RP−tree: mean RP−tree: mean + std 0.8 0.6 0.4 0.2 Maximum rank error (in %−tile) Rank error (in %−tile) 40 (b) Optdigits: Maximum error 1 0 MM−tree KD−tree RP−tree 35 30 25 20 15 10 5 0 0 50 100 150 200 0 50 Number of leaves visited 100 150 200 Number of leaves visited (c) MNist: Mean error + 1 standard deviation (d) MNist: Maximum error Time constrained NN Time constrained NN 3 2 1.5 1 0.5 0 Maximum rank error (in %−tile) 20 MM−tree: mean MM−tree: mean + std KD−tree: mean KD−tree: mean + std RP−tree: mean RP−tree: mean + std 2.5 Rank error (in %−tile) 30 Number of leaves visited MM−tree KD−tree RP−tree 15 10 5 0 2 4 6 8 10 Number of leaves visited (e) Images: Mean error + 1 standard deviation 2 4 6 8 10 Number of leaves visited (f) Images: Maximum error Figure 4: Rank error vs. Constrained time: Results for time-constrained search on Optdigits, MNIST and Images data sets with different trees are presented in this figure. The results indicate that the MM-tree achieves lower error rates compared to the other two tree data structures. The MM-tree has the lowest mean and maximum rank-error. Moreover, in two of the above cases, the first standard deviation of the rank-error for the MM-tree is very close the mean rank-error of the other data structures. Error Constrained NN Error constrained NN MM−tree KD−tree RP−tree 6 5 4 3 2 1 40 Maximum rank error (in %−tile) Average rank error (in %−tile) 7 0 MM−tree KD−tree RP−tree 35 30 25 20 15 10 5 0 0 200 400 600 800 0 Average number of distance computations 200 (a) Optdigits: Mean error 600 800 (b) Optdigits: Maximum error Error constrained NN Error constrained NN 50 MM−tree KD−tree RP−tree 6 5 4 3 2 1 0 Maximum rank error (in %−tile) 7 Average rank error (in %−tile) 400 Average number of distance computations MM−tree KD−tree RP−tree 40 30 20 10 0 0 500 1000 1500 2000 2500 Average number of distance computations (c) MNist: Mean error 0 500 1000 1500 2000 2500 Average number of distance computations (d) MNist: Maximum error Figure 5: Constrained rank error vs. Time: Results for error-constrained search on Optdigits and MNIST data sets with different trees are presented in this figure. The results indicate that time required to find a neighbor candidate within a required level of approximation and validate its correctness is almost same in all the three data structures. This indicates that the MM-tree is as capable as the state-of-the-art tree based error-constrained methods. However, results from Figure 4 indicate that the time required to locate the actual correct candidates are very different. • the mean nn rank error obtained for a given error bound vs. the search time (which includes location and validation time) – provides a view of the overall performance of the data structure in the errorconstrained setting • the maximum nn rank error obtained for a given error bound vs. the search time – provides a view of the stability of the performance, indicating the worst possible performance in the error constrained setting. We continue using the following 2 datasets: Optdigits and MNist. We compare the best performing RPtree (c = 3.5 for Optdigits and MNist) and MM-tree (C = 0.01 for Optdigits and C = 0.00001 for MNist) to kd-trees. The results are presented in Figure 5. Firstly, the results indicate that the MM-trees perform comparably to the well-established kd-trees as well as the recent RP-trees. Secondly, they indicate that when validation is required, all the data structures perform comparably without having a clear winner. This observation supports our guess that no data structure can exhibit improved scalability when validation is enforced in high dimensions. We conjecture that learning is unable to obtain significant gain since the validation process still dominates the majority of the computation. 5 Discussion and Future Directions In this paper, we extend the problem of approximate nn-search to the limited time setting and motivate its importance and applicability to real-world problems. Furthermore, we propose an algorithm and a data structure to solve this problem with high accuracy and empirically demonstrate the superior performance of the new data structure. This data structure utilizes the robustness of the well-known large margin discriminants for the task of nn search. This method of nn search has immediate applications in many real world applications. We also provide theoretical analyses of the approach for additional insight into its performance properties. Future work will include a theoretical analysis of this algorithm and the data structures in the form of time-dependent error rates. An interesting extension of this framework is to consider a time constraint for a set of queries rather than a single query. This presents interesting problems of adaptive time allocation (allocating less time for easy queries and more for harder ones) and estimation of the hardness of a query on the fly. The MM-trees can also be extended beyond metric spaces by using the kernel trick and MM-trees can be useful in the Hilbert space[25]. This would increase the scope of this algorithm to a wider spectrum of applications. References [1] J. H. Freidman, J. L. Bentley, and R. A. Finkel. An Algorithm for Finding Best Matches in Logarithmic Expected Time. ACM Trans. Math. Softw., 1977. [2] S. M. Omohundro. Five Balltree Construction Algorithms. Technical Report TR-89-063, International Computer Science Institute, December 1989. [3] F. P. Preparata and M. I. Shamos. Computational Geometry: An Introduction. Springer, 1985. [4] J. M. Hammersley. The Distribution of Distance in a Hypersphere. Annals of Mathematical Statistics, 21, 1950. [5] T. Liu, A. W. Moore, A. G. Gray, and K. Yang. An Investigation of Practical Approximate Nearest Neighbor Algorithms. 2005. [6] L. Cayton. Fast Nearest Neighbor Retrieval for Bregman Divergences. Proceedings of the 25th international conference on Machine learning, 2008. [7] T. Liu, A. W. Moore, and A. G. Gray. Efficient Exact k-NN and Nonparametric Classification in High Dimensions. 2004. [8] W.B. Johnson and J. Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics, 26(189-206):1–1, 1984. [9] A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. pages 518–529, 1999. [10] P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC, pages 604–613, 1998. [11] P. Ram, D. Lee, H. Ouyang, and A. Gray. Rankapproximate nearest neighbor search: Retaining meaning and speed in high dimensions. In Advances in Neural Information Processing Systems 22. 2009. [12] S. Maneewongvatana and D.M. Mount. Analysis of approximate nearest neighbor searching with clustered point sets. Data structures, near neighbor searches, and methodology: fifth and sixth DIMACS implementation challenges: papers related to the DIMACS challenge on dictionaries and priority queues (1995-1996) and the DIMACS challenge on near neighbor searches (1998-1999), 2002. [13] N. Verma, S. Kpotufe, and S. Dasgupta. Which spatial partition trees are adaptive to intrinsic dimension? In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009. [14] J. McNames. A fast nearest-neighbor algorithm based on a principal axis search tree. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 964– 976, 2001. [15] S. Dasgupta and Y. Freund. Random projection trees and low dimensional manifolds. In Proceedings of the 40th annual ACM symposium on Theory of computing, 2008. [16] L. Cayton and S. Dasgupta. A learning framework for nearest neighbor search. Advances in Neural Information Processing Systems, 20, 2007. [17] K.L. Clarkson. Nearest neighbor queries in metric spaces. Discrete and Computational Geometry, 1999. [18] C. L. Blake and C. J. Merz. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml/, 1998. [19] Y. LeCun. MNist dataset, 2000. http://yann.lecun.com/exdb/mnist/. [20] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of some recent advances. ESAIM Probability and statistics, 9:323–375, 2005. [21] S. Maneewongvatana and D. Mount. The analysis of a probabilistic approach to nearest neighbor searching. Algorithms and Data Structures, 2001. [22] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans. Maximum margin clustering. Advances in neural information processing systems, 17, 2005. [23] H. Valizadegan and R. Jin. Generalized maximum margin clustering and unsupervised kernel learning. Advances in Neural Information Processing Systems, 19, 2007. [24] K. Zhang, I.W. Tsang, and J.T. Kwok. Maximum margin clustering made practical. In Proceedings of the 24th international conference on Machine learning, 2007. [25] B. Zhao, F. Wang, and C. Zhang. Efficient maximum margin clustering via cutting plane algorithm. In The 8th SIAM International Conference on Data Mining, 2008. [26] Y. Hu, J. Wang, N. Yu, and X.S. Hua. Maximum margin clustering with pairwise constraints. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, 2008. [27] Y.F. Li, I.W. Tsang, J.T. Kwok, and Z.H. Zhou. Tighter and convex maximum margin clustering. In Proceeding of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater Beach, FL, 2009. [28] F. Wang, B. Zhao, and C. Zhang. Linear time maximum margin clustering. Neural Networks, IEEE Transactions on, 21(2):319–332, 2010. [29] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma. Learning the structure of manifolds using random projections. Advances in Neural Information Processing Systems, 2007. [30] J. B. Tenenbaum, V. Silva, and J.C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 2000.