iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance: an efficient method to KNN processing, VLDB 2001. Query Requirement • Similarity queries: Similarity range and KNN queries • Similarity range query: Given a query point, find all data points within a given distance r to the query point. r •KNN query: Given a query point, find the K nearest neighbours, in distance to the point. Kth NN Other Methods • SS-tree : R-tree based index structure; use bounding spheres in internal nodes • Metric-tree : R-tree based, but use metric distance and bounding spheres • VA-file : use compression via bit strings for sequential filtering of unwanted data points • Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN • A-tree: R-tree based, but use relative bounding boxes • Problems: hard to integrate into existing DBMSs Basic Definition • Euclidean distance: • Relationship between data points: • Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q). Basic Concept of iDistance • Indexing points based on similarity y = i * c + dist (Si, p) Reference/anchor points d S3 S1 S2 ... S1 S1+d S2 S3 Sk c Sk+1 iDistance • Data points are partitioned into clusters/ partitions. • For each partition, there is a Reference Point that every data point in the partition makes reference to. • Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree • Iterative range queries are used in KNN searching. KNN Searching S1 ... S2 ... A range in B+-tree •Searching region is enlarged till getting K NN. KNN Searching dist (S1, q) dist(S2, q) S1 S2 q Dis_min(S2) Dis_min(S1) Dis_max(S2) Dis_max(S1) Increasing search radius : r r S1 0 Dis_min(S1) dist (S1,q) S2 Dis_max(S1) Dis_max(S2) dist (S2,q) KNN Searching Q2 Over Search? Inefficient situation: •When K= 3, query sphere with radius r will retrieve the 3 NNs. o2 q o1 S r o3 dist (S, q) •Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3) Stopping Criterion • Theorem 2: The KNN search algorithm terminates when K NNs are found and the answers are correct. Case 1: dist(furthest(KNN’), q) < r r Case 2: dist(furthest(KNN’), q) > r Kth ? In case 2 Space-based Partitioning: Equal-partitioning (centroid of hyperplane, closest distance) (external point, closest distance) Space-based Partitioning: Equal-partitioning from furthest points (centroid of hyper-plane, furthest distance) (external point, furthest distance) Effect of Reference Points on Query Space • Using external point to reduce searching area Effect on Query Space The area bounded by these arches is the affected searching area. • Using (centroid, furthest distance) can greatly reduce search area Data-based Partitioning I 1.0 0.70 0.31 0 0.20 0.67 1.0 Using cluster centroids as reference points Data-based Partitioning II 1.0 0.70 0.31 0 0.20 0.67 1.0 Using edge points as reference points Performance Study: Effect of Search Radius Dimension = 8 Dimension = 30 Dimension = 16 • 100K uniform data set • Using (external point, furthest distance) • Effect of search radius on query accuracy I/O Cost vs Search Radius • 10-NN queries on 100K uniform data sets • Using (external point, furthest distance) • Effect of search radius on query cost Effect of Reference Points •10-NN queries on 100K 30-d uniform data set •Different Reference Points Effect of Clustered # of Partitions on Accuracy • KNN queries on 100K 30-d clustered data set • Effect of query radius on query accuracy for different partition number Effect of # of Partitions on I/O and CPU Cost • 10-NN queries on 100K 30-d clustered data set • Effect of # of partitions on I/O and CPU Costs Effect of Data Sizes • KNN queries on 100K, 500K 30-d clustered data sets • Effect of query radius on query accuracy for different size of data sets Effect of Clustered Data Sets • 10-KNN query on 100K,500K 30-d clustered data sets • Effect of query radius on query cost for different size of data set Effect of Reference Points on Clustered Data Sets • 10-KNN query on 100K 30-d clustered data set • Effect of Reference Points: Cluster Edge vs Cluster Centroid iDistance ideal for Approximate KNN? • 10-KNN query on 100K,500K 30-d clustered data sets • Query cost for variant query accuracy on different size of data set Performance Study -Compare iMinMax and iDistance • • 10-KNN query on 100K 30-d clustered data sets C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+trees. iDistance vs A-tree iDistance vs A-tree Summary of iDistance • iDistance is simple, but efficient • It is a Metric based Index • The index can be integrated to existing systems easily.