When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison) Motivation Scalability Experiments • Dozens of papers describe experiments about index scalability with increased dimensions. – Constants are: • Number of data points • Data and Query distribution • Index structure / search algorithm – Variable: • Number of dimensions – Measurement: • Performance of index. Example From PODS 1997 Example From PODS 1997 Motivation • In many cases the conclusion is that the empirical evidence suggests the index structures do scale with dimensionality • We would like to investigate these claims mathematically – supply a proof of scalability or non-scalability. Historical Context • Continues work done in “When Is Nearest Neighbors Meaningful?” (ICDT 1999) • Previous work about behavior of distance distributions. • This work about behavior of indexing structures under similar conditions. Contents • Vanishing Variance property • Convex Description index structures • Indexing Theorem – The performance of CD index does not scale for VV workloads using Euclidean distances. • Conclusion • Future Work Vanishing Variance • Same definition used in ICDT 99 work (although not named in that work) • In 1999 we showed that the workloads become meaningless – ratios of distances between query and various data points become arbitrarily small. • We use the same result here. Vanishing Variance • A scalability experiment contains a series of workloads W1,W2,…,Wm,… – m is the number of dimensions – each workload W1 has n data points and a query point (same distribution) – Distance distribution marked as Dm • Vanishing Variance: Dm 0 lim m var E ( Dm ) Contents • Vanishing Variance property • Convex Description index structures • Indexing Theorem – The performance of CD index does not scale for VV workloads using Euclidean distances. • Conclusion • Future Work Convex Description Index • Data points distributed to buckets (e.g. disk pages). Access to a buckets is “all or nothing”. We allow redundancy. A bucket contains at least two data points. • Each bucket associated with a description – a convex region containing all data points in the bucket. • Search algorithm accesses at least all buckets whose convex region is closer than the nearest neighbor. • Cost of search is the number of data points retrieved. Example: R-Tree • Buckets are disk pages. Under normal construction buckets contain more than two data points each. • Bucket descriptions are convex and contain all data points (Bounding Rectangles). • Search algorithm accesses all buckets whose convex region is closer than the nearest neighbor (and probably a few more). Convex Description Indexes • • • • • • All R-Tree variants X-Tree M-Tree kdb-Tree SS-Tree and SR-Tree Many more Other indexes (non-CD) • Probability structures (P-Tree, VLDB 2000) – Access based on clusters. A near enough bucket may not be accessed • Projection index (like VA-file) – Compression structures. – All data points accessed in pieces, not in buckets. Contents • Vanishing Variance property • Convex Description index structures • Indexing Theorem – The performance of CD index does not scale for VV workloads using Euclidean distances. • Conclusion • Future Work Indexing Theorem • If: – Scalability experiment uses a series of workloads with Vanishing Variance – The distance metric is Euclidean – The indexing structure is Convex Description • Then: – The expected cost of a query converges to the number of data points – I.e., a linear scan of the data Sketch of Proof • Because of Vanishing Variance, the ratio of distances between various query and data points becomes arbitrarily close to 1. • When using Euclidean distance, we can look at an arbitrary data bucket and a query point, choose two data points from the bucket and create a triangle: Distances of Q, D1, D2,…, Dn are about the same. Distance of Q to Y is much smaller D1 Y Bucket D2 Q Therefore, distance of Q to data bucket is less than distance to nearest neighbor. Contents • Vanishing Variance property • Convex Description index structures • Indexing Theorem – The performance of CD index does not scale for VV workloads using Euclidean distances. • Conclusion • Future Work Conclusion • Dozens of papers describe experiments about index scalability with increased dimensions. • We wanted to investigate these claims mathematically – supply a proof of scalability or non-scalability. • We proved that many of these experiments do not scale in dimensionality. Conclusion • Use this theorem to to channel indexing research into more useful and practical avenues • Review previous results accordingly. Future Work • Remove restriction of at least two data points in bucket. – Easy exercise, need to take into account the cost of traversing a hierarchical data structure. • Investigate other Lp metrics • Investigate projection indexes using Euclidean metric (looks like they do not scale either) Future Work • Find scalable indexing structure for Uniform data and L metric – Hint: use compression • Find number of data points needed for R-Tree to be practical on uniform data, L2 metric. – Approx: n F 3 m Questions