When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)

advertisement
When Is Nearest Neighbors
Indexable?
Uri Shaft (Oracle Corp.)
Raghu Ramakrishnan (UW-Madison)
Motivation Scalability Experiments
• Dozens of papers describe experiments
about index scalability with increased
dimensions.
– Constants are:
• Number of data points
• Data and Query distribution
• Index structure / search algorithm
– Variable:
• Number of dimensions
– Measurement:
• Performance of index.
Example From PODS 1997
Example From PODS 1997
Motivation
• In many cases the conclusion is that the
empirical evidence suggests the index
structures do scale with dimensionality
• We would like to investigate these
claims mathematically – supply a proof
of scalability or non-scalability.
Historical Context
• Continues work done in
“When Is Nearest Neighbors
Meaningful?” (ICDT 1999)
• Previous work about behavior of
distance distributions.
• This work about behavior of indexing
structures under similar conditions.
Contents
• Vanishing Variance property
• Convex Description index structures
• Indexing Theorem
– The performance of CD index does not scale for
VV workloads using Euclidean distances.
• Conclusion
• Future Work
Vanishing Variance
• Same definition used in ICDT 99 work
(although not named in that work)
• In 1999 we showed that the workloads
become meaningless – ratios of
distances between query and various
data points become arbitrarily small.
• We use the same result here.
Vanishing Variance
• A scalability experiment contains a
series of workloads W1,W2,…,Wm,…
– m is the number of dimensions
– each workload W1 has n data points and a
query point (same distribution)
– Distance distribution marked as Dm
• Vanishing Variance:
 Dm 
  0
lim m var 
 E ( Dm ) 
Contents
• Vanishing Variance property
• Convex Description index structures
• Indexing Theorem
– The performance of CD index does not scale for
VV workloads using Euclidean distances.
• Conclusion
• Future Work
Convex Description Index
• Data points distributed to buckets (e.g. disk
pages). Access to a buckets is “all or
nothing”. We allow redundancy. A bucket
contains at least two data points.
• Each bucket associated with a description – a
convex region containing all data points in the
bucket.
• Search algorithm accesses at least all buckets
whose convex region is closer than the
nearest neighbor.
• Cost of search is the number of data points
retrieved.
Example: R-Tree
• Buckets are disk pages. Under normal
construction buckets contain more than two
data points each.
• Bucket descriptions are convex and contain
all data points (Bounding Rectangles).
• Search algorithm accesses all buckets whose
convex region is closer than the nearest
neighbor (and probably a few more).
Convex Description Indexes
•
•
•
•
•
•
All R-Tree variants
X-Tree
M-Tree
kdb-Tree
SS-Tree and SR-Tree
Many more
Other indexes (non-CD)
• Probability structures (P-Tree, VLDB
2000)
– Access based on clusters. A near enough
bucket may not be accessed
• Projection index (like VA-file)
– Compression structures.
– All data points accessed in pieces, not in
buckets.
Contents
• Vanishing Variance property
• Convex Description index structures
• Indexing Theorem
– The performance of CD index does not scale for
VV workloads using Euclidean distances.
• Conclusion
• Future Work
Indexing Theorem
• If:
– Scalability experiment uses a series of workloads
with Vanishing Variance
– The distance metric is Euclidean
– The indexing structure is Convex Description
• Then:
– The expected cost of a query converges to the
number of data points – I.e., a linear scan of
the data
Sketch of Proof
• Because of Vanishing Variance, the ratio
of distances between various query and
data points becomes arbitrarily close to
1.
• When using Euclidean distance, we can
look at an arbitrary data bucket and a
query point, choose two data points
from the bucket and create a triangle:
Distances of Q, D1, D2,…, Dn are about the same.
Distance of Q to Y is much smaller
D1
Y
Bucket
D2
Q
Therefore, distance of Q to data bucket is less than
distance to nearest neighbor.
Contents
• Vanishing Variance property
• Convex Description index structures
• Indexing Theorem
– The performance of CD index does not scale for
VV workloads using Euclidean distances.
• Conclusion
• Future Work
Conclusion
• Dozens of papers describe experiments
about index scalability with increased
dimensions.
• We wanted to investigate these claims
mathematically – supply a proof of
scalability or non-scalability.
• We proved that many of these
experiments do not scale in
dimensionality.
Conclusion
• Use this theorem to to channel indexing
research into more useful and practical
avenues
• Review previous results accordingly.
Future Work
• Remove restriction of at least two data
points in bucket.
– Easy exercise, need to take into account
the cost of traversing a hierarchical data
structure.
• Investigate other Lp metrics
• Investigate projection indexes using
Euclidean metric (looks like they do not
scale either)
Future Work
• Find scalable indexing structure for
Uniform data and L  metric
– Hint: use compression
• Find number of data points needed for
R-Tree to be practical on uniform data,
L2 metric.
– Approx:
n  F 3
m
Questions
Download