Nearest-Neighbor Search on a Time Budget via Max-Margin Trees Parikshit Ram Dongryeol Lee

advertisement
Nearest-Neighbor Search on a Time Budget
via Max-Margin Trees
Parikshit Ram∗
Dongryeol Lee
Alexander G. Gray
flix, BlockBuster) and potential matches based on personality features in online dating sites (OkCupid.com,
Match.com).
Many of these applications have a large collection
of objects – both n and d are large. For example, online
repositories contain over petabytes of data with easily
over millions of objects. Moreover, objects like images,
online text articles and songs usually have over tens of
thousands of dimensions. This makes the problem hard,
yet many of these applications have limited response
times. However, these applications can afford to incur
a slight loss in the accuracy of the result. Therefore,
many algorithms employ approximation techniques to
trade scalability over accuracy.
Examples of such time-limited applications are common – Online search queries are nowadays expected to
be completed within a few milliseconds even though the
results can be slightly inaccurate. New users at a online music library want to listen to a playlist of songs
as soon as they sign up and input their preferences, but
1 Introduction
they do not require the perfect songs to be played right
In this paper, we address the well-studied problem
away; the search process needs to populate the first few
of nearest-neighbor search (nns) in large datasets of
songs of the playlist almost immediately but can subsepotentially high dimensionality. The task of nns is
quently find the more relevant music as application gets
defined as follows: For a query point q, return its closest
more time. Users wish to explore potential matches on
point with respect to a specified distance measure D
online dating sites soon after they create their profile
from a given collection S of n points/objects with d
with their character traits and preferences – they want
features/dimensions. Formally, we need to find a point
potential matches as soon as possible, but are ready to
p ∈ S, such that
wait for their “perfect match”. In all these applications,
almost immediate results are required. However, users
(1.1)
D (q, p) = min D (q, r).
r∈S
prefer the service provider with the better results (better accuracy/relevance for a small wait/search time).
nns has extensive applications across many fields in
Approximation in nns has been extensively studied,
computer science such as databases, computer vision,
but none of the algorithms ever explicitly upper bounds
information retrieval and machine learning. Moreover,
the response times – to our knowledge, the problem of
nn methods are heavily used in different sciences such as
returning the best approximate answer within a given
astronomy, molecular physics and bioinformatics. Evtime has not been studied. We propose the problem of
eryday applications of nn search include image queries
nns on a time budget, a meta-algorithm for it, and a
and similar images results in online information retrieval
new data structure for solving the problem accurately.
(Web/Image Search), recommendations in music (PanWe now review existing ideas in nns as background to
dora, last.fm, GrooveShark) and movie libraries (Netour approach.
Abstract
Many high-profile applications pose high-dimensional
nearest-neighbor search problems. Yet, it still remains difficult to achieve fast query times for state-of-the-art approaches which use multidimensional trees for either exact
or approximate search, possibly in combination with hashing
approaches. Moreover, a number of these applications only
have a limited amount of time to answer nearest-neighbor
queries. However, we observe empirically that the correct
neighbor is often found early within the tree-search process,
while the bulk of the time is spent on verifying its correctness. Motivated by this, we propose an algorithm for finding
the best neighbor given any particular time limit, and develop a new data structure, the max-margin tree, to achieve
accurate results even with small time budgets. Max-margin
trees perform better in the limited-time setting than current
commonly-used data structures such as the kd-tree and more
recently developed data structures like the RP-tree.
∗ School of Computational Science and Engineering, Georgia
Institute of Technology
nn error (example spill trees [5]). These obtain significant speed up over the exact methods but still do not to
scale to high dimensions1 . Some other dann methods
lower the dimensionality of the data with random projections [8] to obtain better scalability. Locality sensitive
hashing [9, 10] (LSH) hashes the data into a lower dimensional buckets using hash functions which guarantee
that “close” points are hashed into the same bucket with
high probability and “farther apart” points are hashed
into the same bucket with low probability. This method
has significant improvements in running times over traditional methods in high dimensional data and is shown
to be highly scalable. Numerous other methods using
random projections exist in literature for dann. Hybrid spill trees [5] combine the two ideas of tree-based
search and random projections by building spill trees on
the randomly projected data. They have been shown to
have improved scalability.
Note that there is a significant difference in the
way LSH and the tree-based algorithms deal with approximations. Tree-based algorithms generally grow the
tree on a data-independent (or weakly data-dependent)
heuristic. The search strategy ensures the upper bound
on the approximation by visiting all required parts of
the tree (backtracking in Algorithm 1). LSH buckets
the data at multiple scales and the search algorithm
1.2 Approximations
simply looks at the appropriate buckets, searching only
What approximation principle (notion of error) should within those buckets without any explicit control over
be used? Distance approximate nearest-neighbor search the error. This makes LSH often highly efficient, albeit,
(dann) approximates the query distance to the nn and at times, at the cost of having inaccurate as well as no
any neighbor within that query distance is considered results. Using a tree allows performing the approximato be “good enough”. Applications where the values tion decisions during the search process for each query,
of the distances to the neighbor are essential (with no yielding the possibility of more stable and accurate reregard to the ordering of the neighbor) benefit from this sults (obviously, at the cost of some efficiency).2
Another natural notion of nn error is the percentile
approximation. The nn distance error is defined as the
relative difference between the distances of the query of the data set closer to the query than the returned
to the returned neighbor and true nn. For a given solution – the nn error is the relative rank of the
query q and a returned neighbor candidate p ∈ S, the returned solution. For a query q and a returned
neighbor candidate p ∈ S, the nn rank error τ is defined
nn distance error ǫ is defined as:
as:
D (q, p) − min D (q, r)
|{r ∈ S: D (q, r) < D (q, p)}|
r∈S
(1.3)
,
τ=
(1.2)
.
ǫ=
|S|
min D (q, r)
1.1 Approaches to Nearest-neighbor Search
The simplest approach of linear search over S to find
the nn is easy to implement, but requires O(n) computations for a single nn query, making it intractable
for moderately large n. The most generally efficient
and common approach is to use multidimensional indexing data structures. Binary spatial partitioning trees,
like kd-trees [1] (MakeKDTreeSplit subroutine in Algorithm 3), ball trees [2] and metric trees [3] index the data
for efficient querying. Tree-based depth-first search algorithms (Algorithm 1) greedily traverse down the tree
and subsequently attempt to prune away parts of the
data set from the computation using the current neighbor candidate while backtracking along the tree traversal path in the tree.
High dimensional data pose an inherent problem
for Euclidean nns because of the concentration of the
pairwise Euclidean distances [4]. This implies that
in high dimensions, the Euclidean distances between
uniformly distributed points lie in a small range of
continuous values. The tree based algorithms perform
no better than linear search because of their inability to
employ tight upper bounds on the running nn distance
in high dimensions [5, 6, 7]. This prompted interest in
approximation of the nn problem to obtain scalability.
r∈S
Numerous techniques exist for dann and are fairly
scalable to higher dimensions under certain assumptions. They can be broadly divided into two groups:
methods utilizing and modifying tree-based search and
methods utilizing random projections [8].
Certain tree-based algorithms prune more aggressively on the standard tree data structures utilizing the
allowed approximation to obtain some speedup. Another approach limits the depth-first algorithm to the
first leaf and modifies the tree data structures to bound
1 However, in the same paper, a variant of the spill tree
is presented which does scale to high dimensions. We will
subsequently discuss that result.
2 Another significant difference is that for the tree-based methods, the preprocessing is done only once. The search process is
adapted for the different desired levels of approximation (that is
allowed error ǫ as defined in Equation 1.2). The preprocessing of
the data for LSH is done for a particular level of approximation
(that is, for a fixed value of ǫ). To obtain another level of approximation, the preprocessing step has to be done again. In practice,
this caveat is dealt with by preprocessing and storing the hashed
data from LSH for multiple levels of approximation (that is, for
multiple values of ǫ).
Algorithm 1 Error constrained nns
Algorithm 2 Time constrained nns
ECNNS(Query q, Tree T , Error ǫ)
if isLeaf (T ) then
q.nn ← arg minr∈T ∪{q.nn} d(q, r)
else
if hT.w, qi + T.b < 0 then
ECNNS(q, T.lef t, ǫ)
/* backtrack */
if cannot prune(q, T.w, T.b, ǫ) then
ECNNS(q, T.right, ǫ)
end if
else
ECNNS(q, T.right, ǫ)
/* backtrack */
if cannot prune(q, T.w, T.b, ǫ) then
ECNNS(q, T.lef t, ǫ)
end if
end if
end if
return q.nn
TCNNS(Query q, Tree T , Time t)
if t ≥ 0 then
if isLeaf (T ) then
q.nn ← arg minr∈T ∪{q.nn} d(q, r)
t←t−1
else
if hT.w, qi + T.b < 0 then
TCNNS(q, T.lef t, t)
TCNNS(q, T.right, t)
else
TCNNS(q, T.right, t)
TCNNS(q, T.lef t, t)
end if
end if
end if
return q.nn
Figure 1: Recursive depth-first nns algorithms. The splitting hyperplane at each level of the binary tree T
is denoted by (T.w, T.b). The field q.nn stores the current best neighbor candidate for the query q. The function
cannot prune(q, T.w, T.b, ǫ) checks whether this node can be removed from the computation (i.e. can be pruned)
given the current neighbor candidate and the allowed approximation ǫ, otherwise the depth-first algorithm needs
to traverse that part of the tree (i.e. backtrack along the depth-first path in the tree). The time constraint t is
in the form of allowed number of leaves to be visited (see text for details).
where |A| denotes the cardinality of a set A.
Rank approximate nearest-neighbor search (rann)
[11] introduces the notion of nn rank error and approximates the problem of nns to returning a candidate neighbor whose distance to the query is ranked in
the low percentile of the data. This form of approximation is more sensible in a number of applications where
the ordering of the neighbors is more crucial than the
absolute values of the query distances.
NN error in this paper. Our focus is on highdimensional search where nn rank error seems more
appropriate for the following reasons: (i) In high dimensional data the pairwise distances tend to be concentrated in a small interval and dann can trivialize
the nns since all the points might satisfy the distance
approximation. (ii) In high dimensional data sets like
images, the actual (Euclidean) distances between the
points (images) do not have any inherent meaning and
the distances are generally designed to correspond to
certain ordering constraints. Hence, we consider the
nn rank error as the notion of nn error in this paper.
1.3 Learning from the data
Another line of research focuses on learning from the
data to improve the efficiency in search trees. The trees
are no longer grown on a data-independent heuristic and
are shown to adapt better to the data.
Unsupervised learning. kd-trees are built on a
heuristic (for example, median split or mean split), often choosing the coordinate with the largest spread, and
splitting the data at the median along that direction
(MakeKDTreeSplit subroutine in Algorithm 3). The
usual kd-trees [1, 12] are known to adapt to the ambient dimension, not to the intrinsic dimension3 [13].
PCA-trees [14] split the data in each level at the median along the principal eigenvector of the covariance
matrix of the data and completely adapt to the intrinsic dimension of the data [13]. However, computing the
covariance matrix and the principal eigenvector at each
level of the tree is a computationally expensive building procedure. 2-means-trees split the data along the
direction spanned by the centroids of the 2-means clustering solution as per the cluster assignments. They
are also known to perform well and adapt to the intrinsic dimension [13]. However, similar to PCA-trees,
3 Adapting to the intrinsic dimension of the data implies that
the data diameter of a node is halved after number of splits
linearly dependent on the intrinsic dimension of the data, not
on the ambient dimension.
Locating vs. Validating on a KD−tree
Locating vs. Validating on a KD−tree
80
70
Locating
Validating
Percentage of the queries
Percentage of the queries
70
60
50
40
30
20
10
0
Locating
Validating
60
50
40
30
20
10
0
0
20
40
60
80
Number of leaves
(a) Optdigits
0
500
1000
1500
2000
2500
3000
3500
Number of leaves
(b) MNist
Figure 2: The histogram of the number of leaves visited by the query using the depth-first search algorithm
(Algorithm 1 with ǫ = 0) to locate the true nearest neighbors are contrasted with the histogram of the total
number of leaves eventually visited by the depth-first strategy using a kd-trees. These examples imply that most
queries require examining of a relatively small subset of the data set (that is, visit a small number of leaves)
before they find their true nn. However, majority of the queries end up scanning most of the data to validate
that their current neighbor candidate is in fact the true nn.
they are computationally expensive to build, requiring
a 2-means clustering at each level of the tree. Both
these trees perform considerably better than kd-trees
for standard statistical tasks like vector quantization,
nns and regression [13]. Random Projection Trees (RPtrees) are built recursively by splitting the data at the
median along a random direction chosen from the surface of the unit sphere (MakeRPTreeSplit subroutine of
Algorithm 3). They are shown to adapt to the intrinsic dimension of the data [15]. They perform better
than kd-trees for the aforementioned statistical tasks.
RP-trees are not as adaptive as the PCA-trees and the
2-means-trees, but have comparable performances and
are computationally much cheaper to build.
Supervised learning. Cayton, et.al. [16] and Maneewongvatana, et.al. [12] explicitly learn from the data,
choosing optimal splits (with respect to minimizing the
expected querying time) at each level of the kd-tree on
the data set using a set of sample training queries. They
define a cost function for a split in terms of the amount
of computation required and the fixed nn error rate for
the training queries and subsequently pick the split with
the least cost. Cayton, et.al. [16], also present an algorithm to learn the binning structure used in LSH. Clarkson [17], uses training queries to estimate the effective
voronoi tessellation of the data set. These papers focus
on minimizing the expected query time given a fixed allowed error (error is zero in the case of exact search).
In Cayton, et.al.[16], the minimization of the cost function can be done by fixing the amount of computation
and minimizing the nn error rate. However, the focus
of these methods have been to minimize computation
time for a fixed error. The “curse of dimensionality”
makes the bounding of the error very expensive in highdimensional data. Even the learned kd-tree obtains
speedups of less than an order of magnitude (highest being less than 5) over the standard median-split kd-tree,
which are known to lack scalability in high-dimensions
(the speedups [16] are much better if the query is from
a distribution different than the data set queried on and
enough training queries are provided).
In this paper, we will introduce a new kind of tree
for nns based on a form of unsupervised learning.
1.4 This paper
We propose a new way of utilizing tree-based algorithms
for nns on a time budget in high dimensions. The rest
of the paper is organized as follows:
• We define the problem of time-constrained nns and
empirically motivate a tree-based meta-algorithm
for time-constrained nns in Section 2.
• We motivate and present a new data structure for
accurate time-constrained nn – the Max-Margin
tree (MM-tree) in Section 3 and demonstrate the
superior accuracy of MM-trees compared to other
trees in time-constrained nns on high dimensional
data sets in Section 4.
• We conclude with some discussions and possible
future directions for this work in Section 5.
2
Time-constrained NNS:
Problem and Meta-Algorithm
The problem. The most common theme in nn research is to fix the approximation and try to develop
algorithms that can obtain results most efficiently. The
exact search is a special case of this setting with zero error. Effectively, the nn methods are required to perform
the following task:
Error constrained search: minimize running time subject to the restriction that we
guarantee a certain amount of error.
This formulation of nn generally requires the algorithm
to ensure that the allowed approximation is always
valid. Having located the appropriate result for the
query, the algorithm still has to verify that the result is
actually within the allowed approximation.
This algorithmic behavior is not suitable for many
applications of nns (mentioned in Section 1) where the
response times are limited – (i) In this setting of limited
time, it is hard to guess the appropriate amount of approximation for which the error-constrained algorithm
would return an answer within the time limit. (ii) Validation of the error bound can be considered a waste of
limited resources in the limited time setting.
These applications do not necessarily require the
best search result right away, but the focus can be
on obtaining the best possible result (lowest nn error)
within the limited time. The true nn would always be
preferred but not necessarily required. Given enough
time, the algorithm should always return the true nn.
To summarize, the nn methods would be required to
perform the following task:
Time constrained search: minimize nn error subject to a limited number of computations (amount of time).
This formulation of nns focuses on minimizing the
nn error given the time without the requirement of
validating the error. One way of thinking about it
is to effectively minimize the area under the curve of
error obtained with the returned nn candidate (without
validation) with respect to the search time available (for
example, Figure 4 presents such curves).
The algorithm.
A possible algorithm for timeconstrained search is a natural modification of the treebased algorithm for error-constrained search in Algorithm 1. The algorithm performs the similar depth-first
search, but performs no validation of the nn error and
returns the current best neighbor candidate as the answer to the query when the time runs out (Algorithm
2). Even though tree-based algorithms appear to not
scale well to high dimensions, the depth-first search is
still conjectured to be a useful strategy [17] – depth-first
search on different data structures generally locates the
correct nn quite early on as a nn candidate, but spends
the subsequent computations searching other parts of
the tree without any improvement in the result.
We perform an experiment to verify this fact empirically. We consider 2 OCR image data sets - Optdigits
(64 dimensional) [18] and MNist (784 dimensional) [19].
We split the data sets into a reference set and query
set (a 75-25% random split for Optdigits and a 90-10%
random split for MNist). We perform the depth-first
nns for the exact nn of the each of the queries in the reference set (Algorithm 1 with ǫ = 0) using kd-trees (with
a leaf size of 30) to index the reference sets. We present
the histogram of the minimum number of leaves visited
to locate the true nearest neighbor and total number of
leaves to find as well as validate its correctness in Figure 2. The maximum number of leaves visited by any
query was 76 for Optdigits and around 3500 for MNist.
In both the cases, around 60% of the queries located
their true nearest neighbors in the first 5% of the leaves
visited. But only less than 5% of queries finished validating within that time. This empirically exhibits the
capability of the depth-first nns on tree data structures
to locate the correct answer early on, but the failure to
perform efficient validation.
Note that we have chosen the time constraint in the
form of a limit on the number of leaves to be visited.
This is done for the sake of convenience since the leaves
of a tree can be thought of as an unit of computation
in a tree. Here, we assume that there is enough time to
perform the O(d log n) computations to get to the leaf
at least once. However, Algorithm 2 can be generalized
to the setting where the time is explicitly constrained
(instead of being in terms of units of computation).
3 Robust data structure: Max-margin Trees
Algorithm 2 can be used with any binary tree, but the
quality of the results will depend on how well the tree
has been able to index the data set. More specifically,
consider the distribution of the number of leaves of the
kd-tree required to locate the true neighbor from Figure
2. A tree that indexes the data appropriately will have
a highly left-skewed distribution, which implies highly
accurate results for small time budget with Algorithm
2. A weakly data-dependent tree building heuristic
(like the median split in kd-trees) would not necessarily
perform well. We wish to utilize trees which learn from
the data.
To guarantee low error with Algorithm 2, we need to
ensure that the first few leaves visited by the depth-first
(a) kd-tree
(b) Maximum margin tree
Figure 3: Visual representations of a kd-tree and a MM-tree: The weakly data-dependent heuristic of a kd-tree
can provide less compact partitions. A more data-dependent heuristic like maximum margin splits can be more
useful for time-constrained nns.
strategy contains the true nn of the query q, implying
that q falls on the side of the hyperplane at each level of
the tree that contains the points closer to q. Since we are
considering binary trees, the hyperplane at each node
of the tree needs to effectively perform a simple binary
classification task for q – learning a hyperplane that
places q on the side containing its close neighborhood
with respect to the resolution of the data in the node
and does not split that close neighborhood. For this
task, we explore a widely used technique in supervised
learning – large margin discriminants.
Large margin methods. Maximum margins have
been exploited in many areas of machine learning because of their superior theoretical and empirical performance in classification [20]. This can be attributed
to the robustness of the discriminant functions estimated from the maximum margin formulation to labelling noise – minor perturbations of a point does not
change its class. We seek a similar form of robustness –
minor perturbations of a point does not change the side
of the hyperplane its lies on, hence, ensuring that close
neighborhoods are not split by the hyperplane4 . A maxmargin hyperplane is robust to small perturbations in
the data distribution which tend to adversely affect the
splitting hyperplane in traditional tree structures. This
robustness of the hierarchical large margin splits in tree
data structure would intuitively boost the accuracy of
the time constrained nn algorithm (Algorithm 2). The
desired tree building procedure using hierarchical large
4 This also implies that if q is part of the neighborhood, q is
placed on the appropriate side of the hyperplane
margin splits is presented in Algorithm 4.
The idea of maximum margin splits in the tree data
structure has been proposed earlier [21] using a set of
training queries. However, the split is just a standard
coordinate-axis-parallel kd-tree split rotated to get a
maximum margin split on the partition induced by the
kd-tree split. These splits do not necessarily find the
large margins present in the data, rather just improve
the kd-tree split – the kd-tree split might have already
split a close neighborhood and the rotation of the split
might not help much. We wish to to utilize the richer
set of max-margin partitions by obtaining a completely
unsupervised split.
Recently, maximum margin formulations have been
used for the unsupervised task of maximum margin clustering (mmc) [22, 23, 24, 25, 26, 27]. The task is to
find the optimal hyperplane (w∗ , b∗ ) as well as the optimal labelling vector y∗ from the following optimization
problem given a set of points {x1 , x2 , . . . , xn }:
n
(3.4)
(3.5) s.t.
(3.6)
(3.7)
min
min
y∈{−1,+1}n w,b,ξi
X
1
hw, wi + C
ξi
2
i=1
yi (hw, xi i + b) ≥ 1 − ξi
ξi ≥ 0
Pn
−l ≤ i=1 yi ≤ l.
∀i = 1, . . . , n
∀i = 1, . . . , n
This formulation finds a maximum margin split in the
data to obtain two clusters. The class balance constraint
(Eq. 3.7) avoids trivial solutions and the constant l ≥ 0
controls the class imbalance. The margin constraints
Algorithm 3 Different kinds of split in the data.
Algorithm 4 Max-margin tree building
MakeKDTreeSplit(Data S)
BuildMaxMargTree(Data S)
ds ← arg maxi∈{1:d} range(S(i))
if size(S) ≤ Nmin then
w ← eds // the d-dimensional vector with all entries as
T.points ← S
T.isLeaf ← True
zero except dth
s entry which is one.
else
b ← −median(S(ds ))
(w, b) ← MakeMaxMargSplit(S)
return (w, b)
if trivial split(w, b, S) then
MakeRPTreeSplit(Data S)
(w, b) ← MakeKDTreeSplit(S)
Pick a random unit d-dimensional vector v
end
if
w←v
Sl ← {x ∈ S: hw, xi + b < 0}
b ← −median({hx, vi: x ∈ S})
Sr ← {x ∈ S: hw, xi + b ≥ 0}
return (w, b)
T.w ← w, T.b ← b
MakeMaxMargSplit(Data S)
T.lef t ← BuildMaxMargTree(Sl )
Formulate S as a mmc problem P
T.right ← BuildMaxMargTree(Sr )
(w, b) ← CuttingPlaneMethod(P )
end if
return (w, b)
return T
(Eq. 3.5) enforce a robust separation of the data. This
notion of robustness provides promising results in the
case of clustering. We apply this method of binary
splitting of the data recursively to build a max-margin
tree (MM-tree). For the details of the mmc formulation
see Xu, et.al. [22].
Remark. Note that the presence of different configurations in the data like clusters (regions with a large number of points but no large margins) or uniform spread
(regions where every split has a large margin) implies
that it might not always be possible (or useful) to obtain
a large margin split in the data to the desired resolution.
In such a situation, we employ the standard efficient kdtree split. In the implementation, whenever the cutting
plane method for the mmc problem produces a trivial
split (where all the points end up being assigned to one
side of the split), we conduct a kd-tree median split (Algorithm 4).
If l = ǫn, we get the following recursion
(1 + ǫ)n
D(n) = D
+ 1.
2
2 n, which is O(log n).
This gives us a depth of log 1+ǫ
Comparing it to the log2 n depth of the kd-trees and
RP-trees, the ratio of the depth of the median-split tree
to the depth of MM-tree is given by
log2 n
= 1 − log2 (1 + ǫ),
2 n
log 1+ǫ
which is close to 1 for small ǫ. In our experiments, we
use ǫ = 0.001.
Construction times. The mmc formulation results
in a non-convex integer programming problem. Xu,
et.al. [22] relax it to a semi-definite program and solve
it using expensive off-the-shelf solvers, making it unde3.1 Max-margin tree: Properties
Tree-depth. For the recursive construction of the MM- sirable in the building process of a trees. The cutting
tree, we would require to choose an appropriate value plane method (CPMMC) solves the mmc problem in
for the parameter l in the balance constraint. Using an O(n) time [25, 28].
adaptive value for l, we provide a depth-bound for the
Theorem 3.2. (Theorem 4.4 in [25]) For any dataset
MM-tree.
X with n samples and sparsity of s, and any fixed
value
of (the regularization parameter) C > 0 and
Theorem 3.1. For a given dataset X of size n and an
(the
stopping
criteria for the cutting plane method)
adaptive balance constraint l = ǫn (for a small fixed
ε
>
0,
the
cutting
plane (CPMMC) algorithm takes time
constant ǫ > 0) in the mmc formulation (line 1 in
O(sn).
the MakeMaxMarginSplit subroutine of Algorithm 3),
the MM-tree constructed by Algorithm 4 has a depth of
O(log n).
Proof. Let D(n) be the depth of the tree for n points.
We use this cutting plane method to obtain the
maximum margin splits in the data while building the
MM-tree. The cutting plane method applied to mmc is
explained in detail in Zhao, et.al. [25].
Corollary 3.1. For a given dataset X of size n and
sparsity s, any fixed value of (the regularization parameter) C > 0, (the stopping criteria for the cutting
plane method) ε > 0 and the adaptive balance constraint
l = ǫn throughout Algorithm 4, the runtime of Algorithm 4 is O(sn log n).
- Optdigits and MNist digit data sets (with the same
query-reference split mentioned in Section 2). We also
present results for the 4096-dimensional Images data set
[30] with a 75-25% reference-query split. We compare
the performance vs. time (number of leaves visited in
the Algorithm 2) in two different ways:
Proof. By Theorem 3.2, the total time spent in the
MakeMaxMarginSplit subroutine for a particular level
of the tree is O(sn). If the MakeMaxMarginSplit
subroutine returns a trivial split, the MakeKDTreeSplit
subroutine takes O(n) time. Moreover, splitting the
data into the left or right subtree given the max-margin
split takes O(n) time. Hence, the total time spent at
any particular level of the tree is O(sn).
Then, if T (n) is the time taken by the algorithm to
build a tree on n points, we get the following recursion:
(1 + ǫ)n
T (n) = 2T
+ O(sn).
2
• mean nn rank error and the first standard deviation
above the mean – provides a view of the overall
performance of the data structure
• maximum nn rank error – provides a view of the
stability of the performance, indicating the worst
possible performance.
For a given number of leaves, we compare the nn rank
error of the results obtained by each of the data structures. We compare the best performing RP-tree (c = 3.0
for Optdigits, 3.5 for MNist, 2.5 for Images) and MMtree (C = 0.01 for Optdigits, 0.00001 for MNist, 0.05
for Images) to kd-trees. The results are presented in
Figure 4. The first standard deviation below the mean
Solving the recursion gives the total time taken by is not shown in the plot because for all the cases, it
Algorithm 4 as O(sn log n).
was almost always zero. The results for Optdigits indicate that MM-trees achieve much lower error than
The standard kd-tree splits require O(n) time for kd-trees and RP-trees given the same number of leaves
a set of n points. The RP-tree splits require O(n) visited. In fact, the behavior of the means of kd-trees
to split the data, but actually require O(n2 ) time in and RP-trees are very comparable to the behavior of
compute the maximum and average diameters of the the first standard deviation of the MM-trees. The redata (Algorithm 3 in Freund, et.al. [29]). The total sults for MNist make the superiority of MM-trees more
construction time for these splits would depend on the apparent by showing at least an order of magnitude less
depth of the tree formed. For kd-trees and RP-trees, the nn rank error for MM-trees when compared to kd-trees
median split ensures a depth of O(log n), resulting in and RP-trees with a time budget of less than 50 leaves.
a total of O(n log n) and O(n2 log n) construction time Moreover, the first standard deviation from the mean
respectively.
of the nn error for MM-trees is as good as the mean of
the nn error of the next best competitor. The results
4 Experiments
for Images show that the mean nn error of kd-trees and
We compare the performance of our proposed MM- MM-trees are close, but the MM-trees are more statree with kd-trees and RP-trees. We consider the ble than the kd-trees with the first standard deviation
nn rank error as the quality of the query results since close to the mean and a lower maximum nn error. In
the rank error is a more robust notion of nn error fact, in all the data sets, the first standard deviation
[11]. For implementing the RP-tree, we use Algorithm above the mean and the maximum rank error demon3 in Fruend, et.al. [29]. We cross-validate for the best strate the significantly more stable performance of the
value of c for the RP-tree and for the best value of MM-trees. These results justify the choice of the large
the regularization parameter C for the MM-tree. In margin splits in the tree for reducing the nn error of the
the next subsection, we present results for the time- time-constrained nn search algorithm (Algorithm 2).
constrained setting, comparing the error obtained given
fixed time. Though achieving efficiency in the time- 4.2 Error constrained search
constrained setting is the goal of this work, the last In this experiment, we compare the time required by
subsection compares the performance of the MM-tree to each of these data structures (with different parameter
the kd-tree and RP-tree in the error-constrained setting settings) to obtain a bounded rank error using the rankapproximate nearest neighbor algorithm (the single
for the sake of curiosity.
tree algorithm in Figure 2 in Ram, et.al., 2009 [11]).
Similar to the time constrained setting, we compare the
4.1 Time-constrained Search
We continue with the two data sets presented earlier performance in the following way:
Time constrained NN
Time constrained NN
15
MM−tree: mean
MM−tree: mean + std
KD−tree: mean
KD−tree: mean + std
RP−tree: mean
RP−tree: mean + std
1
0.5
Maximum rank error (in %−tile)
Rank error (in %−tile)
1.5
0
MM−tree
KD−tree
RP−tree
10
5
0
0
10
20
30
40
50
0
10
Number of leaves visited
20
(a) Optdigits: Mean error + 1 standard deviation
Time constrained NN
50
Time constrained NN
40
MM−tree: mean
MM−tree: mean + std
KD−tree: mean
KD−tree: mean + std
RP−tree: mean
RP−tree: mean + std
0.8
0.6
0.4
0.2
Maximum rank error (in %−tile)
Rank error (in %−tile)
40
(b) Optdigits: Maximum error
1
0
MM−tree
KD−tree
RP−tree
35
30
25
20
15
10
5
0
0
50
100
150
200
0
50
Number of leaves visited
100
150
200
Number of leaves visited
(c) MNist: Mean error + 1 standard deviation
(d) MNist: Maximum error
Time constrained NN
Time constrained NN
3
2
1.5
1
0.5
0
Maximum rank error (in %−tile)
20
MM−tree: mean
MM−tree: mean + std
KD−tree: mean
KD−tree: mean + std
RP−tree: mean
RP−tree: mean + std
2.5
Rank error (in %−tile)
30
Number of leaves visited
MM−tree
KD−tree
RP−tree
15
10
5
0
2
4
6
8
10
Number of leaves visited
(e) Images: Mean error + 1 standard deviation
2
4
6
8
10
Number of leaves visited
(f) Images: Maximum error
Figure 4: Rank error vs. Constrained time: Results for time-constrained search on Optdigits, MNIST and
Images data sets with different trees are presented in this figure. The results indicate that the MM-tree achieves
lower error rates compared to the other two tree data structures. The MM-tree has the lowest mean and maximum
rank-error. Moreover, in two of the above cases, the first standard deviation of the rank-error for the MM-tree is
very close the mean rank-error of the other data structures.
Error Constrained NN
Error constrained NN
MM−tree
KD−tree
RP−tree
6
5
4
3
2
1
40
Maximum rank error (in %−tile)
Average rank error (in %−tile)
7
0
MM−tree
KD−tree
RP−tree
35
30
25
20
15
10
5
0
0
200
400
600
800
0
Average number of distance computations
200
(a) Optdigits: Mean error
600
800
(b) Optdigits: Maximum error
Error constrained NN
Error constrained NN
50
MM−tree
KD−tree
RP−tree
6
5
4
3
2
1
0
Maximum rank error (in %−tile)
7
Average rank error (in %−tile)
400
Average number of distance computations
MM−tree
KD−tree
RP−tree
40
30
20
10
0
0
500
1000
1500
2000
2500
Average number of distance computations
(c) MNist: Mean error
0
500
1000
1500
2000
2500
Average number of distance computations
(d) MNist: Maximum error
Figure 5: Constrained rank error vs. Time: Results for error-constrained search on Optdigits and MNIST
data sets with different trees are presented in this figure. The results indicate that time required to find a neighbor
candidate within a required level of approximation and validate its correctness is almost same in all the three
data structures. This indicates that the MM-tree is as capable as the state-of-the-art tree based error-constrained
methods. However, results from Figure 4 indicate that the time required to locate the actual correct candidates
are very different.
• the mean nn rank error obtained for a given error
bound vs. the search time (which includes location
and validation time) – provides a view of the overall
performance of the data structure in the errorconstrained setting
• the maximum nn rank error obtained for a given
error bound vs. the search time – provides a view
of the stability of the performance, indicating the
worst possible performance in the error constrained
setting.
We continue using the following 2 datasets: Optdigits and MNist. We compare the best performing RPtree (c = 3.5 for Optdigits and MNist) and MM-tree
(C = 0.01 for Optdigits and C = 0.00001 for MNist) to
kd-trees. The results are presented in Figure 5. Firstly,
the results indicate that the MM-trees perform comparably to the well-established kd-trees as well as the
recent RP-trees. Secondly, they indicate that when validation is required, all the data structures perform comparably without having a clear winner. This observation
supports our guess that no data structure can exhibit
improved scalability when validation is enforced in high
dimensions. We conjecture that learning is unable to
obtain significant gain since the validation process still
dominates the majority of the computation.
5 Discussion and Future Directions
In this paper, we extend the problem of approximate
nn-search to the limited time setting and motivate its
importance and applicability to real-world problems.
Furthermore, we propose an algorithm and a data
structure to solve this problem with high accuracy and
empirically demonstrate the superior performance of the
new data structure. This data structure utilizes the
robustness of the well-known large margin discriminants
for the task of nn search. This method of nn search has
immediate applications in many real world applications.
We also provide theoretical analyses of the approach for
additional insight into its performance properties.
Future work will include a theoretical analysis of
this algorithm and the data structures in the form of
time-dependent error rates. An interesting extension of
this framework is to consider a time constraint for a
set of queries rather than a single query. This presents
interesting problems of adaptive time allocation (allocating less time for easy queries and more for harder
ones) and estimation of the hardness of a query on the
fly. The MM-trees can also be extended beyond metric
spaces by using the kernel trick and MM-trees can be
useful in the Hilbert space[25]. This would increase the
scope of this algorithm to a wider spectrum of applications.
References
[1] J. H. Freidman, J. L. Bentley, and R. A. Finkel. An
Algorithm for Finding Best Matches in Logarithmic
Expected Time. ACM Trans. Math. Softw., 1977.
[2] S. M. Omohundro. Five Balltree Construction Algorithms. Technical Report TR-89-063, International
Computer Science Institute, December 1989.
[3] F. P. Preparata and M. I. Shamos. Computational
Geometry: An Introduction. Springer, 1985.
[4] J. M. Hammersley. The Distribution of Distance in a
Hypersphere. Annals of Mathematical Statistics, 21,
1950.
[5] T. Liu, A. W. Moore, A. G. Gray, and K. Yang.
An Investigation of Practical Approximate Nearest
Neighbor Algorithms. 2005.
[6] L. Cayton. Fast Nearest Neighbor Retrieval for Bregman Divergences. Proceedings of the 25th international
conference on Machine learning, 2008.
[7] T. Liu, A. W. Moore, and A. G. Gray. Efficient
Exact k-NN and Nonparametric Classification in High
Dimensions. 2004.
[8] W.B. Johnson and J. Lindenstrauss. Extensions of
lipschitz mappings into a hilbert space. Contemporary
mathematics, 26(189-206):1–1, 1984.
[9] A. Gionis, P. Indyk, and R. Motwani. Similarity Search
in High Dimensions via Hashing. pages 518–529, 1999.
[10] P. Indyk and R. Motwani. Approximate Nearest
Neighbors: Towards Removing the Curse of Dimensionality. In STOC, pages 604–613, 1998.
[11] P. Ram, D. Lee, H. Ouyang, and A. Gray. Rankapproximate nearest neighbor search: Retaining meaning and speed in high dimensions. In Advances in Neural Information Processing Systems 22. 2009.
[12] S. Maneewongvatana and D.M. Mount. Analysis of
approximate nearest neighbor searching with clustered
point sets. Data structures, near neighbor searches,
and methodology: fifth and sixth DIMACS implementation challenges: papers related to the DIMACS challenge on dictionaries and priority queues (1995-1996)
and the DIMACS challenge on near neighbor searches
(1998-1999), 2002.
[13] N. Verma, S. Kpotufe, and S. Dasgupta. Which spatial
partition trees are adaptive to intrinsic dimension? In
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, 2009.
[14] J. McNames. A fast nearest-neighbor algorithm based
on a principal axis search tree. IEEE Transactions on
Pattern Analysis and Machine Intelligence, pages 964–
976, 2001.
[15] S. Dasgupta and Y. Freund. Random projection trees
and low dimensional manifolds. In Proceedings of the
40th annual ACM symposium on Theory of computing,
2008.
[16] L. Cayton and S. Dasgupta. A learning framework for
nearest neighbor search. Advances in Neural Information Processing Systems, 20, 2007.
[17] K.L. Clarkson. Nearest neighbor queries in metric
spaces. Discrete and Computational Geometry, 1999.
[18] C. L. Blake and C. J. Merz. UCI Machine Learning
Repository. http://archive.ics.uci.edu/ml/, 1998.
[19] Y.
LeCun.
MNist
dataset,
2000.
http://yann.lecun.com/exdb/mnist/.
[20] S. Boucheron, O. Bousquet, and G. Lugosi. Theory
of classification: A survey of some recent advances.
ESAIM Probability and statistics, 9:323–375, 2005.
[21] S. Maneewongvatana and D. Mount. The analysis of
a probabilistic approach to nearest neighbor searching.
Algorithms and Data Structures, 2001.
[22] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans.
Maximum margin clustering. Advances in neural information processing systems, 17, 2005.
[23] H. Valizadegan and R. Jin. Generalized maximum
margin clustering and unsupervised kernel learning.
Advances in Neural Information Processing Systems,
19, 2007.
[24] K. Zhang, I.W. Tsang, and J.T. Kwok. Maximum
margin clustering made practical. In Proceedings of
the 24th international conference on Machine learning,
2007.
[25] B. Zhao, F. Wang, and C. Zhang. Efficient maximum
margin clustering via cutting plane algorithm. In The
8th SIAM International Conference on Data Mining,
2008.
[26] Y. Hu, J. Wang, N. Yu, and X.S. Hua. Maximum
margin clustering with pairwise constraints. In Data
Mining, 2008. ICDM’08. Eighth IEEE International
Conference on, 2008.
[27] Y.F. Li, I.W. Tsang, J.T. Kwok, and Z.H. Zhou.
Tighter and convex maximum margin clustering. In
Proceeding of the 12th International Conference on
Artificial Intelligence and Statistics, Clearwater Beach,
FL, 2009.
[28] F. Wang, B. Zhao, and C. Zhang. Linear time
maximum margin clustering. Neural Networks, IEEE
Transactions on, 21(2):319–332, 2010.
[29] Y. Freund, S. Dasgupta, M. Kabra, and N. Verma.
Learning the structure of manifolds using random projections. Advances in Neural Information Processing
Systems, 2007.
[30] J. B. Tenenbaum, V. Silva, and J.C. Langford. A
Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 2000.
Download