dist

advertisement
iDistance
-- Indexing the Distance
An Efficient Approach to
KNN Indexing
C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish.
Indexing the distance: an efficient method to KNN processing, VLDB 2001.
Query Requirement
• Similarity queries:
Similarity range and KNN queries
• Similarity range query: Given a query point, find all data
points within a given distance r to the query point.
r
•KNN query: Given a query point,
find the K nearest neighbours,
in distance to the point.
Kth NN
Other Methods
• SS-tree : R-tree based index structure; use bounding
spheres in internal nodes
• Metric-tree : R-tree based, but use metric distance and
bounding spheres
• VA-file : use compression via bit strings for sequential
filtering of unwanted data points
• Psphere-tree : Two level index structure; use clusters
and duplicates data based on sample queries; It is for
approximate KNN
• A-tree: R-tree based, but use relative bounding boxes
• Problems: hard to integrate into existing DBMSs
Basic Definition
• Euclidean distance:
• Relationship between data points:
• Theorem 1: Let q be the query object, and Oi be the
reference point for partition i, and p an arbitrary point
in partition i. If dist(p, q) <= querydist(q) holds, then
it follows that dist(Oi, q) – querydist(q) <= dist(Oi, p)
<=dist(Oi,q) + querydist(q).
Basic Concept of iDistance
• Indexing points based on similarity
y = i * c + dist (Si, p)
Reference/anchor points
d
S3
S1
S2
...
S1 S1+d
S2
S3
Sk
c
Sk+1
iDistance
• Data points are partitioned into clusters/
partitions.
• For each partition, there is a Reference
Point that every data point in the partition
makes reference to.
• Data points are indexed based on similarity
(metric distance) to such a point using a
CLASSICAL B+-tree
• Iterative range queries are used in KNN
searching.
KNN Searching
S1
...
S2
...
A range in B+-tree
•Searching region is enlarged till getting K NN.
KNN Searching
dist (S1, q)
dist(S2, q)
S1
S2
q
Dis_min(S2)
Dis_min(S1)
Dis_max(S2)
Dis_max(S1)
Increasing search radius : r
r S1
0
Dis_min(S1)
dist (S1,q)
S2
Dis_max(S1)
Dis_max(S2)
dist (S2,q)
KNN Searching
Q2
Over Search?
Inefficient situation:
•When K= 3, query sphere
with radius r will retrieve
the 3 NNs.
o2
q o1
S
r
o3
dist (S, q)
•Among them only the o1
NN can be guaranteed.
Hence the search continues
with enlarged r till r >
dist(q, o3)
Stopping Criterion
• Theorem 2: The KNN search algorithm
terminates when K NNs are found and the answers
are correct.
Case 1: dist(furthest(KNN’), q) < r
r
Case 2: dist(furthest(KNN’), q) > r
Kth ? In case 2
Space-based Partitioning:
Equal-partitioning
(centroid of hyperplane, closest
distance)
(external point, closest
distance)
Space-based Partitioning:
Equal-partitioning from furthest points
(centroid of hyper-plane,
furthest distance)
(external point, furthest
distance)
Effect of Reference Points on
Query Space
• Using external point to
reduce searching area
Effect on Query Space
The area bounded by these
arches is the affected
searching area.
• Using (centroid, furthest
distance) can greatly reduce
search area
Data-based Partitioning I
1.0
0.70
0.31
0
0.20
0.67
1.0
Using cluster centroids as reference points
Data-based Partitioning II
1.0
0.70
0.31
0
0.20
0.67
1.0
Using edge points as reference points
Performance Study: Effect of Search Radius
Dimension = 8
Dimension = 30
Dimension = 16
• 100K uniform data set
• Using (external point, furthest
distance)
• Effect of search radius on query
accuracy
I/O Cost vs Search Radius
• 10-NN queries on 100K uniform data sets
• Using (external point, furthest distance)
• Effect of search radius on query cost
Effect of Reference Points
•10-NN queries on 100K 30-d uniform data set
•Different Reference Points
Effect of Clustered # of Partitions
on Accuracy
• KNN queries on 100K 30-d clustered data set
• Effect of query radius on query accuracy for different partition
number
Effect of # of Partitions
on I/O and CPU Cost
• 10-NN queries on 100K 30-d clustered data set
• Effect of # of partitions on I/O and CPU Costs
Effect of Data Sizes
• KNN queries on 100K, 500K 30-d clustered data sets
• Effect of query radius on query accuracy for different size of
data sets
Effect of Clustered Data Sets
• 10-KNN query on 100K,500K 30-d clustered data sets
• Effect of query radius on query cost for different size of data set
Effect of Reference Points on
Clustered Data Sets
• 10-KNN query on 100K 30-d clustered data set
• Effect of Reference Points: Cluster Edge vs Cluster Centroid
iDistance ideal for Approximate
KNN?
• 10-KNN query on 100K,500K 30-d clustered data sets
• Query cost for variant query accuracy on different size of data set
Performance Study -Compare iMinMax and iDistance
•
•
10-KNN query on 100K 30-d clustered data sets
C. Yu, B. C. Ooi, K. L. Tan. Progressive KNN search Using B+trees.
iDistance vs A-tree
iDistance vs A-tree
Summary of iDistance
• iDistance is simple, but efficient
• It is a Metric based Index
• The index can be integrated to existing
systems easily.
Download