dist

iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance: an efficient method to KNN processing, VLDB 2001. Query Requirement • Similarity queries: Similarity range and KNN queries • Similarity range query: Given a query point, find all data points within a given distance r to the query point. r •KNN query: Given a query point, find the K nearest neighbours, in distance to the point. Kth NN Other Methods • SS-tree : R-tree based index structure; use bounding spheres in internal nodes • Metric-tree : R-tree based, but use metric distance and bounding spheres • VA-file : use compression via bit strings for sequential filtering of unwanted data points • Psphere-tree : Two level index structure; use clusters and duplicates data based on sample queries; It is for approximate KNN • A-tree: R-tree based, but use relative bounding boxes • Problems: hard to integrate into existing DBMSs Basic Definition • Euclidean distance: • Relationship between data points: • Theorem 1: Let q be the query object, and Oi be the reference point for partition i, and p an arbitrary point in partition i. If dist(p, q) <= querydist(q) holds, then it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p) <=dist(Oi,q) + querydist(q). Basic Concept of iDistance • Indexing points based on similarity y = i * c + dist (Si, p) Reference/anchor points d S3 S1 S2 ... S1 S1+d S2 S3 Sk c Sk+1 iDistance • Data points are partitioned into clusters/ partitions. • For each partition, there is a Reference Point that every data point in the partition makes reference to. • Data points are indexed based on similarity (metric distance) to such a point using a CLASSICAL B+-tree • Iterative range queries are used in KNN searching. KNN Searching S1 ... S2 ... A range in B+-tree •Searching region is enlarged till getting K NN. KNN Searching dist (S1, q) dist(S2, q) S1 S2 q Dis_min(S2) Dis_min(S1) Dis_max(S2) Dis_max(S1) Increasing search radius : r r S1 0 Dis_min(S1) dist (S1,q) S2 Dis_max(S1) Dis_max(S2) dist (S2,q) KNN Searching Q2 Over Search? Inefficient situation: •When K= 3, query sphere with radius r will retrieve the 3 NNs. o2 q o1 S r o3 dist (S, q) •Among them only the o1 NN can be guaranteed. Hence the search continues with enlarged r till r > dist(q, o3) Stopping Criterion • Theorem 2: The KNN search algorithm terminates when K NNs are found and the answers are correct. Case 1: dist(furthest(KNN’), q) < r r Case 2: dist(furthest(KNN’), q) > r Kth ? In case 2 Space-based Partitioning: Equal-partitioning (centroid of hyperplane, closest distance) (external point, closest distance) Space-based Partitioning: Equal-partitioning from furthest points (centroid of hyper-plane, furthest distance) (external point, furthest distance) Effect of Reference Points on Query Space • Using external point to reduce searching area Effect on Query Space The area bounded by these arches is the affected searching area. • Using (centroid, furthest distance) can greatly reduce search area Data-based Partitioning I 1.0 0.70 0.31 0 0.20 0.67 1.0 Using cluster centroids as reference points Data-based Partitioning II 1.0 0.70 0.31 0 0.20 0.67 1.0 Using edge points as reference points Performance Study: Effect of Search Radius Dimension = 8 Dimension = 30 Dimension = 16 • 100K uniform data set • Using (external point, furthest distance) • Effect of search radius on query accuracy I/O Cost vs Search Radius • 10-NN queries on 100K uniform data sets • Using (external point, furthest distance) • Effect of search radius on query cost Effect of Reference Points •10-NN queries on 100K 30-d uniform data set •Different Reference Points Effect of Clustered # of Partitions on Accuracy • KNN queries on 100K 30-d clustered data set • Effect of query radius on query accuracy for different partition number Effect of # of Partitions on I/O and CPU Cost • 10-NN queries on 100K 30-d clustered data set • Effect of # of partitions on I/O and CPU Costs Effect of Data Sizes • KNN queries on 100K, 500K 30-d clustered data sets • Effect of query radius on query accuracy for different size of data sets Effect of Clustered Data Sets • 10-KNN query on 100K,500K 30-d clustered data sets • Effect of query radius on query cost for different size of data set Effect of Reference Points on Clustered Data Sets • 10-KNN query on 100K 30-d clustered data set • Effect of Reference Points: Cluster Edge vs Cluster Centroid iDistance ideal for Approximate KNN? • 10-KNN query on 100K,500K 30-d clustered data sets • Query cost for variant query accuracy on different size of data set Performance Study -Compare iMinMax and iDistance • • 10-KNN query on 100K 30-d clustered data sets C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+trees. iDistance vs A-tree iDistance vs A-tree Summary of iDistance • iDistance is simple, but efficient • It is a Metric based Index • The index can be integrated to existing systems easily.

dist

Related documents

Products

Support

dist

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib